Python实战：高效爬取电影影评数据与文本分析技巧

发布时间：2026-01-13 15:31

数据分析：R或Python的Pandas库，参考数据科学实战书籍 #生活技巧# #工作学习技巧# #编程语言学习路径#

引言

在数字化时代，电影评价平台如豆瓣网，汇聚了大量用户对电影的评论和评分，这些数据不仅影响着观众的选择，也为电影制作和营销提供了宝贵的参考。本文将详细介绍如何使用Python编写爬虫，高效地抓取豆瓣电影影评数据，并进行深入的文本分析，包括情感分析和词云图展示。

一、环境准备

在开始编写爬虫之前，我们需要安装一些必要的Python库：

pip install requests beautifulsoup4 pandas textblob jieba pyecharts requests：用于发送HTTP请求。 beautifulsoup4：用于解析HTML页面。 pandas：用于数据分析和处理。 textblob：用于情感分析。 jieba：用于中文分词。 pyecharts：用于数据可视化。

二、爬虫编写

1. 设置目标网页和请求头

首先，我们需要确定要爬取的电影及其对应的豆瓣链接。以电影《肖申克的救赎》为例，其豆瓣链接为：https://movie.douban.com/subject/1292052/。

import requests from bs4 import BeautifulSoup url = 'https://movie.douban.com/subject/1292052/comments' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36' } 2. 定义下载图片和文本的方法

我们需要定义函数来下载影评数据，包括用户名、评分和评论内容。

def get_comments(url): response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') comments = [] for item in soup.find_all('div', class_='comment'): username = item.find('span', class_='comment-info').a.text rating = item.find('span', class_='rating').get('title') comment = item.find('p', class_='').text.strip() comments.append({'username': username, 'rating': rating, 'comment': comment}) return comments 3. 创建文件夹保存数据

为了保存抓取的数据，我们需要创建文件夹并写入文件。

import os def save_comments(comments, filename='comments.csv'): if not os.path.exists('data'): os.makedirs('data') with open(f'data/{filename}', 'w', encoding='utf-8') as f: f.write('username,rating,comment\n') for comment in comments: f.write(f"{comment['username']},{comment['rating']},{comment['comment']}\n")

三、情感分析

使用TextBlob库对评论进行情感分析，得到每条评论的情感分数。

from textblob import TextBlob def analyze_sentiment(comments): for comment in comments: blob = TextBlob(comment['comment']) sentiment = blob.sentiment.polarity comment['sentiment'] = sentiment return comments

四、数据可视化

1. 词云图展示

使用jieba进行中文分词，并生成词云图。

import jieba from pyecharts.charts import WordCloud from pyecharts import options as opts def generate_wordcloud(comments): text = ' '.join([comment['comment'] for comment in comments]) words = jieba.cut(text) wordcloud = WordCloud() wordcloud.add('', list(words), word_size_range=[20, 100]) wordcloud.set_global_opts(title_opts=opts.TitleOpts(title='影评词云图')) wordcloud.render('wordcloud.html') 2. 情感分布图

使用pyecharts生成情感分布图。

from pyecharts.charts import Pie def generate_sentiment_pie(comments): positive = len([comment for comment in comments if comment['sentiment'] > 0]) neutral = len([comment for comment in comments if comment['sentiment'] == 0]) negative = len([comment for comment in comments if comment['sentiment'] < 0]) pie = Pie() pie.add('', [('积极', positive), ('中性', neutral), ('负面', negative)]) pie.set_global_opts(title_opts=opts.TitleOpts(title='情感分布图')) pie.render('sentiment_pie.html')

五、完整代码示例

将上述步骤整合成完整的代码：

import requests from bs4 import BeautifulSoup import os from textblob import TextBlob import jieba from pyecharts.charts import WordCloud, Pie from pyecharts import options as opts def get_comments(url): response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') comments = [] for item in soup.find_all('div', class_='comment'): username = item.find('span', class_='comment-info').a.text rating = item.find('span', class_='rating').get('title') comment = item.find('p', class_='').text.strip() comments.append({'username': username, 'rating': rating, 'comment': comment}) return comments def save_comments(comments, filename='comments.csv'): if not os.path.exists('data'): os.makedirs('data') with open(f'data/{filename}', 'w', encoding='utf-8') as f: f.write('username,rating,comment\n') for comment in comments: f.write(f"{comment['username']},{comment['rating']},{comment['comment']}\n") def analyze_sentiment(comments): for comment in comments: blob = TextBlob(comment['comment']) sentiment = blob.sentiment.polarity comment['sentiment'] = sentiment return comments def generate_wordcloud(comments): text = ' '.join([comment['comment'] for comment in comments]) words = jieba.cut(text) wordcloud = WordCloud() wordcloud.add('', list(words), word_size_range=[20, 100]) wordcloud.set_global_opts(title_opts=opts.TitleOpts(title='影评词云图')) wordcloud.render('wordcloud.html') def generate_sentiment_pie(comments): positive = len([comment for comment in comments if comment['sentiment'] > 0]) neutral = len([comment for comment in comments if comment['sentiment'] == 0]) negative = len([comment for comment in comments if comment['sentiment'] < 0]) pie = Pie() pie.add('', [('积极', positive), ('中性', neutral), ('负面', negative)]) pie.set_global_opts(title_opts=opts.TitleOpts(title='情感分布图')) pie.render('sentiment_pie.html') if __name__ == '__main__': url = 'https://movie.douban.com/subject/1292052/comments' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36' } comments = get_comments(url) comments = analyze_sentiment(comments) save_comments(comments) generate_wordcloud(comments) generate_sentiment_pie(comments)