如何用Python的collections.Counter统计文本中最高频的词汇

访客自然语言处理 2026-06-05 12:55:15 1

用Python的collections.Counter统计文本中最高频词汇：从入门到实战

📖 文章目录导读

为什么选择Counter？——高频统计的痛点与Counter的优势
Counter核心用法详解——基础函数与数据操作
实战案例：从TXT文本中提取高频词——完整代码与分步解析
进阶技巧——排除停用词、处理中文、性能优化
常见问题解答（Q&A）——6个高频疑问与一次解决

为什么选择collections.Counter？

当我们需要统计一段文本中“哪个词出现最多”时，传统思路是手动遍历、用字典计数、再排序输出，但如果直接使用Python内置的collections.Counter，这些步骤会被封装成一行代码——Counter(text.split()).most_common(10)。

Counter的本质：它是dict的子类，专门为可哈希对象的计数设计，它自动完成以下工作：

遍历可迭代对象（如列表、字符串）
统计每个元素出现次数
提供.most_common(n)直接返回频次最高的n个元素及次数

相比手动实现,Counter代码减少70%，运行速度由于底层C语言优化通常更快，更重要的是，它天然支持Top-N统计，无需自己实现排序和截断逻辑。

Counter核心用法详解

1 基础操作

from collections import Counter
text = "apple banana apple cherry banana apple"
words = text.split()
counter = Counter(words)
# 输出所有词频
print(counter)  # Counter({'apple': 3, 'banana': 2, 'cherry': 1})
# 提取最高频的2个词
print(counter.most_common(2))  # [('apple', 3), ('banana', 2)]

2 组合与更新

Counter支持加法、减法操作，非常适合多文本合并统计：

counter1 = Counter({"a": 3, "b": 2})
counter2 = Counter({"b": 1, "c": 4})
# 合并计数
result = counter1 + counter2  # Counter({'c': 4, 'a': 3, 'b': 3})

3 获取元素列表

使用.elements()可以根据计数值重复返回键：

list(Counter({"a":2, "b":1}).elements())  # ['a', 'a', 'b']

实战：从TXT文件统计高频词

假设我们有一个1000字的英文article.txt，需要提取前10个高频词，并忽略“the”、“and”等常见停用词。

完整代码

from collections import Counter
import re
# 1. 加载停用词表（可自行扩充）
stop_words = {"the", "a", "an", "in", "on", "for", "and", "or", "to", "of", "is", "it", "that", "this", "with", "as", "at", "be", "by", "was", "are"}
# 2. 读取文件
with open("article.txt", "r", encoding="utf-8") as f:
    text = f.read()
# 3. 清洗文本：只保留字母字符，转小写
clean_text = re.sub(r'[^a-zA-Z\s]', '', text).lower()
# 4. 分割并过滤停用词
words = [word for word in clean_text.split() if word not in stop_words and len(word) > 1]
# 5. 计数并获取Top10
counter = Counter(words)
top10 = counter.most_common(10)
# 6. 输出结果
print("高频词 TOP 10：")
for rank, (word, count) in enumerate(top10, 1):
    print(f"{rank}. {word}: {count}次")

代码解析

正则清洗：re.sub(r'[^a-zA-Z\s]', '', text) 去掉标点符号和数字，避免“apple.”和“apple”被视为不同词。
长度过滤：len(word)>1 过滤掉单字母单词（如“a”可能未被停用词表覆盖）。
大小写统一：所有词转小写，确保“Apple”和“apple”合并计数。

输出示例

高频词 TOP 10：
1. data: 25次
2. python: 18次
3. analysis: 14次
4. learning: 12次
...

进阶技巧：中文、性能与大数据处理

1 中文文本处理

中文没有空格分词,需要第三方分词库，例如使用jieba：

import jieba
from collections import Counter
text = "Python是一种高效的编程语言，适合数据分析和机器学习。"
words = jieba.lcut(text)  # ['Python', '是', '一种', '高效', '的', '编程', '语言', '，'...]
# 过滤标点和单字词（建议保留双字词以上）
words = [w for w in words if len(w) > 1 and w not in stop_chinese]
counter = Counter(words)

2 处理超大型文本（流式读取）

如果文件太大（如GB级），不要一次性读入内存，使用迭代器逐行处理：

from collections import Counter
counter = Counter()
with open("big_text.txt", "r", encoding="utf-8") as f:
    for line in f:
        words = line.strip().split()
        counter.update(words)  # 逐步更新计数
top5 = counter.most_common(5)

这种方式内存占用恒定,只存单词及其计数，而非整篇文本。

3 性能优化技巧

使用.update()代替多次加法：counter.update(iterable)比counter = counter + Counter(iterable)快2-3倍。
排序提速：如果只需要Top-N，用.most_common(n)（内部使用堆排序）比全排序后再切片快得多。
内存控制：当计数对象数量极大时，考虑heapq.nlargest配合原始字典，但大部分场景Counter已足够。

常见问题解答（Q&A）

Q1：Counter可以统计汉字吗？
可以，但需注意：如果直接对字符串Counter("你好世界")，会统计每个汉字作为独立元素，要统计词语，必须先用分词库（如jieba）切词。

Q2：如何忽略大小写？
在分割前统一转换为小写：clean_text = text.lower()，注意：如果你在处理中文（无大小写问题），此步骤可省略。

Q3：为什么我的结果包含标点符号？
因为你没有在分割前清洗文本，建议用正则re.sub(r'[^\w\s]', '', text)移除标点符号，或使用re.split(r'\W+', text)根据非字母字符分割。

Q4：如果我想统计不同长度的N-gram（如双连词、三连词）怎么办？
Counter本身只统计单个元素，你需要自己构造N-gram列表，例如双连词：

words = "my name is Tom".split()
bigrams = [f"{words[i]}_{words[i+1]}" for i in range(len(words)-1)]
Counter(bigrams)

Q5：Counter和defaultdict(int)哪个更好？
两者计数结果一致，但Counter提供了.most_common()、.elements()等便捷方法，且直接支持加减运算，如果只是简单计数，defaultdict性能略优（少一层类包装），但大部分场景Counter更简洁。

Q6：如何将Counter结果可视化？
常见做法：将.most_common(10)转换为字典或pandas的DataFrame，然后使用matplotlib或wordcloud生成柱状图或词云。

import matplotlib.pyplot as plt
data = counter.most_common(10)
plt.bar([w[0] for w in data], [w[1] for w in data])
plt.show()

collections.Counter是Python文本分析中最基础、最高效的工具之一，它用最小的代码量完成了“统计高频词”的核心任务，并且能通过清洗、过滤和停用词处理轻松扩展到真实场景，无论你是处理英文小说、中文新闻还是社交媒体数据，掌握Counter与分词库的组合，就等于掌握了文本分析的入门钥匙。

提示：实际项目中，建议将停用词表保存为stopwords.txt文件，并导入使用，避免硬编码，对于专业的SEO关键词分析，可进一步结合TF-IDF算法来提升关键词的权重准确性。

标签： Counter 文本