一个英文关键字提取器(Yake)

Published on Aug. 22, 2023, 12:11 p.m.

虽然对于关键词提取这种工作有很多可用方案,最近的keybert都不错,不过提取个关键词用上深度学习有点太耗资源了。还好找到了这个Yake。
Yake 是一种轻量级的无监督自动关键词提取方法,它基于从单个文档中提取的文本统计特征来选择文本中最重要的关键词。系统不需要针对特定​​的文档集进行训练,也不需要依赖字典、外部语料库、文本大小、语言或域。为了证明我们提议的优点和重要性,我们将其与十种最先进的无监督方法(TF.IDF、KP-Miner、RAKE、TextRank、SingleRank、ExpandRank、TopicRank、TopicalPageRank、PositionRank 和 MultipartiteRank)进行了比较, 和一种监督方法 (KEA)。在 20 个数据集之上进行的实验结果(参见下面的基准部分)表明,我们的方法在许多不同大小的集合下显着优于最先进的方法,语言或领域。除了这里描述的 python 包,我们还提供了一个演示、API和移动应用程序。

基本原理

随着信息的复杂性和规模的增长,从文本中提取关键字已成为个人和组织面临的挑战。自动化这项任务以便及时和充分地处理文本的需要导致了自动关键字提取工具的出现。尽管取得了进步,但显然缺乏从单个文档中自动提取关键字的多语言在线工具。呸!是一种新颖的基于特征的多语言关键字提取系统,支持不同大小、域或语言的文本。与其他方法不同,Yake!不依赖字典或同义词词典,也没有针对任何语料库进行训练。相反,它遵循一种基于从文本中提取的特征的无监督方法,因此,它适用于以不同语言编写的文档,而无需进一步的知识。这对于大量任务和对训练语料库的访问受限或受限的大量情况是有益的。

可在线 [ http://yake.inesctec.pt ]、Google Play上以开源 Python 包 [ https://github.com/LIAAD/yake ] 和API的形式获得

Docker 容器中的 REST API 服务器
此安装将为您提供 YAKE 原始 REST API 的镜像!在这里可用。

docker run -p 5000:5000 -d liaad/yake-server:latest
启动后,容器将在后台运行,地址为http://127.0.0.1:5000。访问 YAKE!API 文档,请访问http://127.0.0.1:5000/apidocs/。

您可以使用以下方法测试 RESTful API curl:

curl -X POST "http://localhost:5000/yake/" -H "accept: application/json" -H "Content-Type: application/json" \
-d @- <<'EOF'
{
  "language": "en",
  "max_ngram_size": 3,
  "number_of_keywords": 10,
  "text": "Sources tell us that Google is acquiring Kaggle, a platform that hosts data science and machine learning competitions. Details about the transaction remain somewhat vague , but given that Google is hosting its Cloud Next conference in San Francisco this week, the official announcement could come as early as tomorrow. Reached by phone, Kaggle co-founder CEO Anthony Goldbloom declined to deny that the acquisition is happening. Google itself declined 'to comment on rumors'. Kaggle, which has about half a million data scientists on its platform, was founded by Goldbloom and Ben Hamner in 2010. The service got an early start and even though it has a few competitors like DrivenData, TopCoder and HackerRank, it has managed to stay well ahead of them by focusing on its specific niche. The service is basically the de facto home for running data science and machine learning competitions. With Kaggle, Google is buying one of the largest and most active communities for data scientists ..."
}
EOF