Scraping data from websites is a common task for data scientists and researchers, as it allows them to gather large amounts of information from various sources for analysis and study. In this blog post, we will discuss how to use Python to scrape data from Medium, a popular online publishing platform.
Introduction to BeautifulSoup
The first step in scraping data from Medium using Python is to install the BeautifulSoup library. BeautifulSoup is a Python library that is used for web scraping and parsing HTML and XML files. It provides a convenient way to extract information from HTML pages, allowing you to focus on processing the data rather than dealing with the low-level details of HTML parsing.
To install BeautifulSoup, you can use the following command in your terminal or command prompt:
pip install beautifulsoup4
You can also install the
lxml library, which is a parser that BeautifulSoup can use to parse HTML and XML documents. You can install it with the following command:
pip install lxml
Scraping Data from Medium
To start scraping data from Medium, you first need to identify the information that you want to extract. In this example, we will scrape the titles and URLs of the latest articles on Medium.
Next, you need to inspect the HTML of the Medium page that you want to scrape. You can do this by right-clicking on the page and selecting “Inspect Element.” This will open the developer tools in your browser, where you can view the HTML code for the page.
In the HTML code, you can see the structure of the page and locate the information that you want to extract. In this case, we want to extract the titles and URLs of the latest articles, which are contained in a
div element with the class
js-postArticle and a
h3 element with the class
With the target information located, you can start writing your Python code to extract the data. The first step is to make an HTTP request to the Medium page using the
requests library. You can use the following code to make the request:
import requests url = "https://medium.com/" response = requests.get(url)
Once you have the HTML content of the page, you can use BeautifulSoup to parse it and extract the information that you want. The following code uses BeautifulSoup to parse the HTML content and extract the titles and URLs of the latest articles:
from bs4 import BeautifulSoup soup = BeautifulSoup(response.text, "lxml") articles = soup.find_all("div", class_="js-postArticle") titles =  urls =  for article in articles: title = article.find("h3", class_="graf--title").text url = article.find("a")["href"] titles.append(title) urls.append(url)
In this code, the
find_all method is used to locate all of the
div elements with the class
js-postArticle, and the
find method is used to extract the title and URL from each