Breaking
Fri. Dec 1st, 2023
monitor showing Java programming

Scraping data from websites is a common task for data scientists and researchers, as it allows them to gather large amounts of information from various sources for analysis and study. In this blog post, we will discuss how to use Python to scrape data from Medium, a popular online publishing platform.

Introduction to BeautifulSoup

The first step in scraping data from Medium using Python is to install the BeautifulSoup library. BeautifulSoup is a Python library that is used for web scraping and parsing HTML and XML files. It provides a convenient way to extract information from HTML pages, allowing you to focus on processing the data rather than dealing with the low-level details of HTML parsing.

To install BeautifulSoup, you can use the following command in your terminal or command prompt:

Copy codepip install beautifulsoup4

You can also install the lxml library, which is a parser that BeautifulSoup can use to parse HTML and XML documents. You can install it with the following command:

Copy codepip install lxml

Scraping Data from Medium

To start scraping data from Medium, you first need to identify the information that you want to extract. In this example, we will scrape the titles and URLs of the latest articles on Medium.

Next, you need to inspect the HTML of the Medium page that you want to scrape. You can do this by right-clicking on the page and selecting “Inspect Element.” This will open the developer tools in your browser, where you can view the HTML code for the page.

In the HTML code, you can see the structure of the page and locate the information that you want to extract. In this case, we want to extract the titles and URLs of the latest articles, which are contained in a div element with the class js-postArticle and a h3 element with the class graf--title.

With the target information located, you can start writing your Python code to extract the data. The first step is to make an HTTP request to the Medium page using the requests library. You can use the following code to make the request:

import requests

url = "https://medium.com/"
response = requests.get(url)

Once you have the HTML content of the page, you can use BeautifulSoup to parse it and extract the information that you want. The following code uses BeautifulSoup to parse the HTML content and extract the titles and URLs of the latest articles:

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, "lxml")

articles = soup.find_all("div", class_="js-postArticle")

titles = []
urls = []

for article in articles:
    title = article.find("h3", class_="graf--title").text
    url = article.find("a")["href"]
    titles.append(title)
    urls.append(url)

In this code, the find_all method is used to locate all of the div elements with the class js-postArticle, and the find method is used to extract the title and URL from each

By Hari Haran

I'm Aspiring data scientist who want to know about more AI. I'm very keen in learning many sources in AI.

Related Post

Leave a Reply

Your email address will not be published. Required fields are marked *