Python Web Scraping Tutorial

Web scraping has become a widely adopted tool for data collection, and Python is a popular language for this purpose due to several reasons.

Python boasts a rich collection of libraries and frameworks, making it a valuable asset for web scraping. Its simplicity and readability facilitate the learning curve, and its versatility enables it to scrape different data formats, including HTML, JSON, and XML.

Moreover, Python offers built-in functions that are crucial for web scraping, such as regular expressions and string manipulation. Additionally, Python supports multi-threading, which optimizes scraping applications’ performance.

Furthermore, Python has a vast and active developer community that continuously creates and maintains web scraping libraries and frameworks. Overall, Python is a suitable language for data extraction, and its adoption for web scraping is continually growing.

Here are the steps to build a web scraper with Python:

  1. Choose a scraping library

There are many Python libraries available for web scraping, such as Beautiful Soup, Ixml, Selenium and Requests. Choose the one that best suits your needs and skill level.

2. Choose a good coding environment

PyCharm and Visual Studio Code are both excellent Integrated Development Environments (IDEs) for Python development. Choosing one over the other largely depends on your specific needs and preferences.

PyCharm is a full-featured IDE developed by JetBrains. It has a comprehensive set of tools for Python development, including debugging, testing, profiling, and code analysis. PyCharm has an intuitive user interface, and its code completion and debugging features are highly regarded. It is a paid tool, but it offers a free community edition that is suitable for many developers.

On the other hand, Visual Studio Code (VS Code) is a lightweight and free IDE developed by Microsoft. It has a broad range of extensions and plugins available that can be customized to fit your specific needs. VS Code is an open-source project with a thriving community, which regularly updates the software with new features and bug fixes.

Both IDEs offer similar features, including code completion, debugging, and version control integration. However, PyCharm is generally more suited for large-scale Python projects with more complex code bases, while VS Code is an excellent choice for smaller projects and beginners due to its ease of use and lightweight nature.

Ultimately, the choice between PyCharm and VS Code depends on your development needs and preferences. It’s worth trying both to see which one suits you better.

3. Importing and using libraries

Let’s start by importing the following libraries:

import pandas as pd
from bs4 import BeautifulSoup
import requests

Then, using the requests library, we will get the page we want to scrape and extract it’s HTML:

f = requests.get('http://targetwebsite.com/')

Next, we will pass the site’s HTML text to BeautifulSoup, which will parse this raw data so it can be easily scraped.

soup = BeautifulSoup(f.text)

Now all the targeted website data is stored in the soup object. We did this so that we can now extract any data we need by running BeautifulSoup’s in-build functions.

E.g. you can easily extract all of the available text on that targeted website with the following code:

print(soup.get_text())

Using your chosen scraping library, write the code to extract the data from the website. You may need to use techniques like CSS selectors, XPaths, or regular expressions to identify the data you want to extract.

The next step would be getting the exact info you need from the targeted website. This could be the titles, categories, product specifications or specific headers (H1, H2 etc). You can also scrape by getting the info after specific characters such as quotes. In Google Chrome, right click on the targeted website and then on „Inspect elements” so that the Chrome Developer Tools will appear on your screen.

Test your scraper to make sure it is working correctly and extracting the data you need.

You can use  use the find() and findAll() functions to extract all the info within specific tags.

Take the following page source code snipplet as an example:

<h4 class="title">
<a href="...">This is a Title</a>
</h4>

Now you can creat a loop that will go through the entire page source, find all the occurrences of the classes we need. Here is an example on how to extract the titles from the above webpage.

for element in soup.findAll(attrs={'class': 'title'}):
name = element.find('a')
    results.append(name.text)

Websites may change their HTML structure or block scrapers. Make sure to handle these errors by implementing error handling in your code.

Once you have extracted the data, you may want to store it in a database or a file for further analysis.

If you need to scrape the website regularly, you can automate your scraper using tools like Cron or Task Scheduler to run the script at a specified time. Also if the targeted website requires authentication, has verification mechanisms like captcha in place, or has JavaScript running in the browser while the page loads, you will have to use a browser automation tool like Selenium to aid with the scraping.

Selenium is an open-source tool that automates web browsers. It provides a single interface that lets you write test scripts in programming languages like Ruby, Java, NodeJS, PHP, Perl, Python, and C#, among others.

Selenium requires three components:

  • A supported browser such as Chrome, Edge, Firefox and Safari
  • Driver for the browser – See this page for links to the drivers
  • The selenium package

Install the selenium package by entering the following code in the terminal:

pip install selenium

Then you need to importe the appropriate class for the browser you are using:

from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
driver = Chrome(executable_path='/path/to/driver')

Now, you can load any webpage using get() method

driver.get('https://hydraproxy.com/blog')


Let’s use all those pips we installed previously to use:

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver

Next we should define the browser we are using. E.g. for Chrome:

driver = webdriver.Chrome(executable_path='c:\path\to\windows\webdriver\executable.exe')

By using the same instructions to find all the occurrences of the classes, our code should look like this:

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path='/nix/path/to/webdriver/executable')
driver.get('https://your.url/here?yes=brilliant')
results = []
content = driver.page_source
soup = BeautifulSoup(content)
for element in soup.findAll(attrs={'class': 'title'}):
    name = element.find('a')
    results.append(name.text)

In order to display the results you can use a simple syntax such as:

for x in results:
   print(x)

However, in most cases you need to parse the dataset for further analyse. Here comes in action the „pandas” library. Here is the code to export data in a csv format:

df = pd.DataFrame({Data: results})
df.to_csv(data.csv', index=False, encoding='utf-8')

Remember to follow ethical scraping practices and respect website’s terms of service to avoid legal issues. Here are some best practices to consider:

  1. Check the website’s terms of service: Before scraping a website, review its terms of service to ensure that web scraping is allowed. Some websites explicitly prohibit scraping, while others may have specific rules or guidelines.
  2. Respect robots.txt: The robots.txt file is a standard used by websites to communicate which parts of the site are open for crawling and which parts are off-limits. Make sure to review and adhere to the directives specified in the robots.txt file of the website you’re scraping.
  3. Set a reasonable scraping rate: Avoid sending too many requests in a short period, as it can put a strain on the website’s server. Set a reasonable scraping rate by adding delays between requests using the time.sleep() function. Use headers and user agents: Include appropriate headers and user agents in your requests to simulate a real browser. Some websites may block requests without valid user agents, so it’s important to set them to avoid being blocked. Handle errors and exceptions: Implement error handling in your scraping code to gracefully handle situations such as connection errors, timeouts, or unexpected HTML structures.
  4. Be mindful of the website’s resources, especially if you’re scraping a high-traffic site. Avoid excessive concurrent connections or downloading large files unnecessarily. Be considerate of bandwidth and storage: Avoid downloading or storing excessive amounts of data unless necessary. Minimize the size of scraped data by extracting only the relevant information needed for your project.
  5. Be ethical and legal: Respect the website’s content and intellectual property. Do not use scraped data for illegal purposes, such as spamming, copyright infringement, or unauthorized distribution. Cache data when possible: If the website’s content doesn’t change frequently, consider caching the scraped data to reduce the number of requests and improve performance. However, ensure the data remains up to date when needed.
  6. After respecting the above recommendations you should also consider using proxies in order to avoid getting blocked or banned. If you need more details regarding on how to implement proxy usage in you python code plese check here: https://hydraproxy.com/how-to-use-hydraproxy-in-python/

Leave a Comment


Captcha
+ 89 = 98


This site uses Akismet to reduce spam. Learn how your comment data is processed.