How to Use Python to Crawl the Top 250 Movies on Douban
Douban is a Chinese website that provides reviews, ratings, and recommendations for movies, books, music, and other media. The Douban Top 250 movies list is a ranking of the most popular movies on Douban, as determined by user votes.
In this blog post, we will show you how to use Python to crawl the top 250 movies on Douban. We will use the requests
library to get the HTML content of the Douban Top 250 movies list page, and the Beautiful Soup
library to parse the HTML content and extract the movie titles.
Prerequisites
To follow along with this tutorial, you will need the following:
- A Python programming environment
- The
requests
library - The
Beautiful Soup
library
Step 1: Get the HTML Content of the Douban Top 250 Movies List Page
The first step is to get the HTML content of the Douban Top 250 movies list page. We can do this using the requests
library.
import requests
url = "https://movie.douban.com/top250"
response = requests.get(url)
The requests.get()
function takes the URL of the page as its only argument. It returns a requests.Response
object, which contains the HTML content of the page.
Step 2: Parse the HTML Content and Extract the Movie Titles
The next step is to parse the HTML content of the page and extract the movie titles. We can do this using the Beautiful Soup
library.
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")
movies = soup.find_all("div", class_="item")
The BeautifulSoup
library provides a number of methods for parsing HTML content. In this case, we are using the BeautifulSoup()
function to create a BeautifulSoup
object from the HTML content of the page. We are then using the find_all()
method to find all div
elements with the class item
.
The find_all()
method returns a list of BeautifulSoup
objects. Each BeautifulSoup
object represents one movie in the Douban Top 250 movies list.
Step 3: Save the Movie Titles to a CSV File
The final step is to save the movie titles to a CSV file. We can do this using the csv
module.
import csv
with open("top250.csv", "w", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(["Title"])
for movie in movies:
title = movie.find("span", class_="title").text
writer.writerow([title])
The csv
module provides a number of functions for working with CSV files. In this case, we are using the csv.writer()
function to create a csv.writer
object. We then use the writerow()
method to write a row of data to the CSV file.
The writerow()
method takes a list of values as its argument. The first value in the list is the title of the movie.
Conclusion
In this blog post, we showed you how to use Python to crawl the top 250 movies on Douban. We used the requests
library to get the HTML content of the Douban Top 250 movies list page, and the Beautiful Soup
library to parse the HTML content and extract the movie titles. We then saved the movie titles to a CSV file.