文本分段算法TextTiling

Published on Aug. 22, 2023, 12:11 p.m.

Texttiling利用了词性共现、分布的模式。算法有三个部分:1. 将文章分成一个一个句子单元 2. 为每一个句子单元算一个分数 3. 根据句子单元之间的”against scores”所得到的图,来得到子话题的边界。

text=“I'm messing around with this one myself just now for the same reason you are and had the same”
ttt = nltk.tokenize.TextTilingTokenizer()
tiles = ttt.tokenize(text)

参考连接

https://www.nltk.org/_modules/nltk/tokenize/texttiling.html

Tags: