How to Use Python to Crawl the Top 250 Movies on Douban

Douban is a Chinese website that provides reviews, ratings, and recommendations for movies, books, music, and other media. The Douban Top 250 movies list is a ranking of the most popular movies on Douban, as determined by user votes.

In this blog post, we will show you how to use Python to crawl the top 250 movies on Douban. We will use the requests library to get the HTML content of the Douban Top 250 movies list page, and the Beautiful Soup library to parse the HTML content and extract the movie titles.

Prerequisites

To follow along with this tutorial, you will need the following:

  • A Python programming environment
  • The requests library
  • The Beautiful Soup library

Step 1: Get the HTML Content of the Douban Top 250 Movies List Page

The first step is to get the HTML content of the Douban Top 250 movies list page. We can do this using the requests library.

import requests

url = "https://movie.douban.com/top250"
response = requests.get(url)

The requests.get() function takes the URL of the page as its only argument. It returns a requests.Response object, which contains the HTML content of the page.

Step 2: Parse the HTML Content and Extract the Movie Titles

The next step is to parse the HTML content of the page and extract the movie titles. We can do this using the Beautiful Soup library.

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.content, "html.parser")

movies = soup.find_all("div", class_="item")

The BeautifulSoup library provides a number of methods for parsing HTML content. In this case, we are using the BeautifulSoup() function to create a BeautifulSoup object from the HTML content of the page. We are then using the find_all() method to find all div elements with the class item.

The find_all() method returns a list of BeautifulSoup objects. Each BeautifulSoup object represents one movie in the Douban Top 250 movies list.

See also  does turnitin identify chatgpt

Step 3: Save the Movie Titles to a CSV File

The final step is to save the movie titles to a CSV file. We can do this using the csv module.

import csv

with open("top250.csv", "w", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerow(["Title"])
    for movie in movies:
        title = movie.find("span", class_="title").text
        writer.writerow([title])

The csv module provides a number of functions for working with CSV files. In this case, we are using the csv.writer() function to create a csv.writer object. We then use the writerow() method to write a row of data to the CSV file.

The writerow() method takes a list of values as its argument. The first value in the list is the title of the movie.

Conclusion

In this blog post, we showed you how to use Python to crawl the top 250 movies on Douban. We used the requests library to get the HTML content of the Douban Top 250 movies list page, and the Beautiful Soup library to parse the HTML content and extract the movie titles. We then saved the movie titles to a CSV file.