Python简易爬虫项目案例？

wen python案例 2026-06-07 06:22:29 2

零基础也能学会！5个Python简易爬虫项目案例，快速上手数据采集

目录导读

为什么你需要学习Python爬虫？
抓取网页标题与文字
爬取豆瓣电影Top250
采集天气数据
爬取知乎热榜
下载壁纸图片
常见问题与避坑指南

为什么你需要学习Python爬虫？

在当今数据驱动决策的时代,无论是市场调研、竞品分析，还是个人学习研究，能够从互联网上高效获取公开数据已成为一项核心技能，Python凭借其简洁语法和丰富的库（如Requests、BeautifulSoup、Scrapy），成为初学者的首选爬虫语言。

关键问答： Q：爬虫是否违法？
A：爬取公开数据且遵守网站robots.txt协议、不过度请求、不窃取用户隐私是合法的，本教程所有案例均基于公开数据源。

Q：完全零基础能学会吗？
A：本文案例无需前置知识，只需安装Python（建议3.8+版本）和任意代码编辑器（如VS Code）。

项目一：抓取网页标题与文字

目标： 获取任意网页的标题和正文前100字。

代码实现：

import requests
from bs4 import BeautifulSoup
def get_web_info(url):
    headers = {'User-Agent': 'Mozilla/5.0'}
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    = soup.title.string if soup.title else '无标题'
    # 获取页面所有段落文本
    paragraphs = [p.get_text(strip=True) for p in soup.find_all('p')]
    content = ' '.join(paragraphs)[:100] + '...'
    return title, content
# 测试
url = 'https://example.com' content = get_web_info(url)
print(f'标题: {title}\n摘要: {content}')

运行结果示例： Example Domain This domain is for use in illustrative examples in documents. You may use this domain in literature without...

关键点：

User-Agent 伪装浏览器请求，避免被服务器拦截。
BeautifulSoup 通过 find_all('p') 提取段落文本。

项目二：爬取豆瓣电影Top250

目标： 获取豆瓣电影排行榜的电影名称、评分和链接。

代码解析：

import requests
from bs4 import BeautifulSoup
def crawl_douban_top250():
    base_url = 'https://movie.douban.com/top250?start={}&filter='
    movies = []
    for start in range(0, 250, 25):  # 分页：每页25部
        url = base_url.format(start)
        headers = {'User-Agent': 'Mozilla/5.0'}
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        for item in soup.find_all('div', class_='item'):
            title = item.find('span', class_='title').text
            rating = item.find('span', class_='rating_num').text
            link = item.find('a')['href']
            movies.append({'title': title, 'rating': rating, 'link': link})
    return movies
result = crawl_douban_top250()
for movie in result[:5]:
    print(f"{movie['title']} - 评分: {movie['rating']}")

输出示例：
肖申克的救赎 - 评分: 9.7
霸王别姬 - 评分: 9.6

注意： 豆瓣有反爬机制，通过 User-Agent 和分页逻辑可基本绕过，请控制请求频率，不得用于商业用途。

项目三：采集天气数据

目标： 获取指定城市未来7天气预报。

使用API简化：

import requests
def get_weather(city):
    # 以和风天气免费API为例（需注册获取key）
    api_key = 'YOUR_API_KEY'
    url = f'https://devapi.qweather.com/v7/weather/3d?location={city}&key={api_key}'
    response = requests.get(url)
    data = response.json()
    for day in data['daily']:
        print(f"日期: {day['fxDate']}, 最高温: {day['tempMax']}°C, 最低温: {day['tempMin']}°C")
# 示例：北京
get_weather('101010100')  # 城市代码

替代方案： 也可以爬取天气预报网站（如 weather.com）的静态页面，但API更稳定且合法。

项目四：爬取知乎热榜

目标： 获取知乎实时热榜前10条问题。

动态页面处理技巧：

import requests
import json
def get_zhihu_hot():
    url = 'https://www.zhihu.com/api/v3/feed/topstory/hot-lists/total?limit=10'
    headers = {
        'User-Agent': 'Mozilla/5.0',
        'Referer': 'https://www.zhihu.com/hot'
    }
    response = requests.get(url, headers=headers)
    data = response.json()
    for item in data['data']:
        title = item['target']['title']
        hot = item['target']['metrics_area']
        print(f"{title} | 热度: {hot}")
get_zhihu_hot()

输出示例：
如何看待2025年AI发展趋势？ | 热度: 10.2亿
建议专家不要建议 | 热度: 8.7亿

原理： 知H的热榜数据通过AJAX接口返回JSON，直接解析即可，无需渲染JS。

项目五：下载壁纸图片

目标： 从Unsplash下载高清壁纸。

代码实现：

import requests
import os
def download_wallpaper(url, filename):
    headers = {'User-Agent': 'Mozilla/5.0'}
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        with open(filename, 'wb') as f:
            f.write(response.content)
        print(f'下载成功: {filename}')
    else:
        print('下载失败')
# 示例：Unsplash公开图片
img_url = 'https://images.unsplash.com/photo-1682687220-bb5c1a1e7e4a'
download_wallpaper(img_url, 'wallpaper.jpg')

进阶建议： 批量爬取需要解析HTML中的图片链接，并处理懒加载问题。