Bert用于主题建模 ( Bert vs LDA )

Published on Aug. 22, 2023, 12:10 p.m.

在这篇文章中,将用LDA(Latent Dirichlet Allocation,专门用于此目的)和使用词嵌入来进行主题建模。我将尝试用不同的算法组合(TF-IDF、LDA和Bert)和不同的降维(PCA、TSNE、UMAP)来进行主题建模。
原始地址 https://github.com/mcelikkaya/medium_articles2/blob/main/bertlda_topic_modeling.ipynb

from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score
import umap.umap_ as umap
import string 
import time
from gensim import corpora
import gensim
ntopic = 20
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mcelikkaya\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!

True
!pip install stop_words
Requirement already satisfied: stop_words in c:\users\mcelikkaya\anaconda3\envs\p37_tensor23\lib\site-packages (2018.7.23)

WARNING: You are using pip version 20.1.1; however, version 21.1.1 is available.
You should consider upgrading via the 'C:\Users\mcelikkaya\Anaconda3\envs\p37_tensor23\python.exe -m pip install --upgrade pip' command.
from stop_words import get_stop_words

stop_words = get_stop_words('en')
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
from pprint import pprint
num_topics = len( set(newsgroups_train.target_names) )
print("num_topics : ",num_topics )
pprint(list(newsgroups_train.target_names))
num_topics :  20
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']
#Append sentences from newsgroup documents to raw sentences
raw_sentences = []
for s in newsgroups_train.data:
  raw_sentences.append( s )
import re

def only_letters(tested_string):
    for letter in tested_string:
        if letter not in "abcdefghijklmnopqrstuvwxyz":
            return False
    return True

#I just did an adhoc cleaning, as I see documents
#have some non English characters, so I use above method only_letters
#to filter instead of default isalpha python method
def clean_data(s): 
    s = s.replace(">","").lower()
    if "lines:" in s :
        index = s.index("lines:")
        s = s[index+10:] 

    word_list = word_tokenize(s)
    cleaned = []
    for w in word_list:
        if w not in stop_words:
            if w in string.punctuation or only_letters(w):
                if w in string.punctuation or len( set(w) ) > 1:
                    cleaned.append( w)
    return " ".join(cleaned) ,cleaned           

#from documents clean sentence and return vocublary of sentence
def build_data(docs):

    n_docs = len(docs)
    sentences = []  # sentences
    token_lists = []  # words vocublary

    for i in range(len(docs)):
        sentence,token_list = clean_data(docs[i])
        if token_list: #if not all items eleminated
            sentences.append(sentence)
            token_lists.append(token_list)

    return sentences, token_lists

print("Number of raw sentences ", len(raw_sentences))
Number of raw sentences  11314
print("Sample raw sentence \n", raw_sentences[10])
Sample raw sentence 
 From: [email protected] (Irwin Arnstein)
Subject: Re: Recommendation on Duc
Summary: What's it worth?
Distribution: usa
Expires: Sat, 1 May 1993 05:00:00 GMT
Organization: CompuTrac Inc., Richardson TX
Keywords: Ducati, GTS, How much? 
Lines: 13

I have a line on a Ducati 900GTS 1978 model with 17k on the clock.  Runs
very well, paint is the bronze/brown/orange faded out, leaks a bit of oil
and pops out of 1st with hard accel.  The shop will fix trans and oil 
leak.  They sold the bike to the 1 and only owner.  They want $3495, and
I am thinking more like $3K.  Any opinions out there?  Please email me.
Thanks.  It would be a nice stable mate to the Beemer.  Then I'll get
a jap bike and call myself Axis Motors!

-- 
-----------------------------------------------------------------------
"Tuba" (Irwin)      "I honk therefore I am"     CompuTrac-Richardson,Tx
[email protected]    DoD #0826          (R75/6)
-----------------------------------------------------------------------
sentences, token_lists = build_data(raw_sentences )
print(len(sentences))

11300
print("Sentence after cleaning :\n", sentences[10])
Sentence after cleaning :
 line ducati model clock . runs well , paint faded , leaks bit oil pops hard accel . shop will fix trans oil leak . sold bike owner . want $ , thinking like $ . opinions ? please email . thanks . nice stable mate beemer . get jap bike call axis motors ! - tuba ( irwin ) honk therefore , tx irwin @ dod # ( ) -
#get tfidf of documents
def get_tfidf_embedding(items):
  tfidf = TfidfVectorizer()
  embeddings = tfidf.fit_transform(items)
  return embeddings

#Generate embedding with tfidf
embedding_tf_idf = get_tfidf_embedding( sentences )
print("Shape of sentences applied tf-idf :", embedding_tf_idf.shape)
Shape of sentences applied tf-idf : (11300, 70990)

print("Type of tf-idf vector :", type( embedding_tf_idf[10] ) )
print("Sample of tf-idf vector :",  embedding_tf_idf[10] ) 
#you can see tf-idf scores for row 10
Type of tf-idf vector : 
Sample of tf-idf vector : (0, 17044) 0.10181313112043715
 (0, 63993) 0.1300522422417
 (0, 61894) 0.10249541873723259
 (0, 27777) 0.1911862461978002
 (0, 30960) 0.3957008553364163
 (0, 63730) 0.2020739231752
 (0, 40449) 0.16216587766976334
 (0, 4461) 0.16216587766976334
 (0, 8475) 0.08485716723991746
 (0, 31414) 0.18845779142458444
 (0, 5483) 0.1838092364452786
 (0, 37593) 0.1911862461978002
 (0, 58568) 0.14265663715924579
 (0, 61995) 0.1028762233336384
 (0, 67264) 0.06666743118273906
 (0, 44987) 0.12607131535204325
 (0, 6153) 0.22108600346236493
 (0, 57525) 0.11855223645126424
 (0, 34836) 0.16672268959335923
 (0, 63098) 0.178222695383795
 (0, 56178) 0.13755888350927367
 (0, 298) 0.19427951309782668
 (0, 47696) 0.16300833746692905
 (0, 43798) 0.26178940407773144
 (0, 34841) 0.17240097371055277
 (0, 20955) 0.19785042766820815
 (0, 45132) 0.13813804095719331
 (0, 53808) 0.1080934078302259
 (0, 17736) 0.16775241873124694
 (0, 26277) 0.08590962363698813
 (0, 22006) 0.12375287289810279
 (0, 42343) 0.09562755236527265
 (0, 18877) 0.0857868943894454
 (0, 24168) 0.05398130992761945
 (0, 44142) 0.08892829883473839
 (0, 35482) 0.05021189293190694
 (0, 35570) 0.08868427003326287
 (0, 6347) 0.08500462216993547
 (0, 67675) 0.06195315958832872
 (0, 68125) 0.04877498605091992
 (0, 10746) 0.12150206579524199
 (0, 61763) 0.06688292195796296
 (0, 47239) 0.06820541000348619
 (0, 39906) 0.10746087457231246
def predict_topics_with_kmeans(embeddings,num_topics):
  kmeans_model = KMeans(num_topics)
  kmeans_model.fit(embeddings)
  topics_labels = kmeans_model.predict(embeddings)
  return topics_labels

def plot_embeddings(embedding, labels,title):

    labels = np.array( labels )
    distinct_labels =  set( labels )

    n = len(embedding)
    counter = Counter(labels)
    for i in range(len( distinct_labels )):
        ratio = (counter[i] / n )* 100
        cluster_label = f"cluster {i}: { round(ratio,2)}"
        x = embedding[:, 0][labels == i]
        y = embedding[:, 1][labels == i]
        plt.plot(x, y, '.', alpha=0.4, label= cluster_label)
    plt.legend(title="Topic",loc = 'upper left', bbox_to_anchor=(1.01,1))
    plt.title(title)

def reduce_umap(embedding):
  reducer = umap.UMAP() #umap.UMAP()
  embedding_umap = reducer.fit_transform( embedding  )
  return embedding_umap

def reduce_pca(embedding):
    pca = PCA(n_components=2)
    reduced = pca.fit_transform( embedding )
    print( "pca explained_variance_ ",pca.explained_variance_)
    print( "pca explained_variance_ratio_ ",pca.explained_variance_ratio_)

    return reduced

def reduce_tsne(embedding):
    tsne = TSNE(n_components=2)
    reduced = tsne.fit_transform( embedding )

    return reduced
#Apply kmeans to raw vectors
labels_tfidf_raw  = predict_topics_with_kmeans(embedding_tf_idf,num_topics)

print("Embedding Tf-idf shape :",embedding_tf_idf.shape)
Embedding Tf-idf shape : (11300, 70990)
#Apply kmeans to umap vectors
embedding_tf_idf_umap =  reduce_umap( embedding_tf_idf )
labels_tfidf_umap  = predict_topics_with_kmeans(embedding_tf_idf_umap,num_topics)

(11300,)
8
print("Embedding shape after umap",embedding_tf_idf_umap.shape)
Embedding shape after umap (11300, 2)
plot_embeddings(embedding_tf_idf_umap,labels_tfidf_umap,"Tf-idf with Umap")

png

embedding_tf_idf_tsne =  reduce_tsne( embedding_tf_idf )
labels_tfidf_tsne  = predict_topics_with_kmeans(embedding_tf_idf_tsne,num_topics)

plot_embeddings(embedding_tf_idf_tsne,labels_tfidf_tsne,"Tf-idf with T-sne")

png

#The silhouette value is a measure of how similar an object is to its own cluster compared to other clusters
#The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. 
#Negative values generally indicate that a sample has 
#been assigned to the wrong cluster, as a different cluster is more similar.
print("Silhouette score:" )
print("without dim reduction :", silhouette_score(embedding_tf_idf , labels_tfidf_raw) )
print("with Tf-idf   Umap    :", silhouette_score(embedding_tf_idf_umap, labels_tfidf_umap) )
print("with Tf-idf   T-sne   :",  silhouette_score(embedding_tf_idf_tsne, labels_tfidf_tsne) )
Silhouette score:
without dim reduction : 0.0058927913065180484
with Tf-idf   Umap    : 0.35885185
with Tf-idf   T-sne   : 0.34634998
def get_document_topic_lda(model, corpus, k):
   n_doc = len(corpus)
   #init a vector of size number of docs x clusters
   document_topic_mapping = np.zeros((n_doc, k))
   for i in range(n_doc):
     # for each document create a vector of probability document belonging to topic
     for topic, prob in model.get_document_topics(corpus[i]):
       document_topic_mapping[i, topic] = prob

   return document_topic_mapping
print("Number of words in token list :", len( token_lists ))
Number of words in token list : 11300

dictionary = corpora.Dictionary(token_lists)
corpus = [dictionary.doc2bow(text) for text in token_lists]
k = ntopic
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=k, id2word=dictionary,passes=20)

embedding_lda = get_document_topic_lda(ldamodel, corpus, k)
#This is not a document embedding , I just use it as it is.
print("LDA vector shape :", embedding_lda.shape )
LDA vector shape : (11300, 20)

for i,topic in enumerate(embedding_lda[10].flatten()):
    print("Topic ",i+1,") ", embedding_lda[10].flatten()[i])
Topic  1 )  0.0
Topic  2 )  0.0
Topic  3 )  0.0
Topic  4 )  0.7270360589027405
Topic  5 )  0.0
Topic  6 )  0.0
Topic  7 )  0.0
Topic  8 )  0.01592746376991272
Topic  9 )  0.0
Topic  10 )  0.0
Topic  11 )  0.0
Topic  12 )  0.0
Topic  13 )  0.0
Topic  14 )  0.2096281200647354
Topic  15 )  0.0
Topic  16 )  0.0
Topic  17 )  0.0
Topic  18 )  0.036269474774599075
Topic  19 )  0.0
Topic  20 )  0.0
print("Number of tokens : ", len( token_lists ))
print("Sample item from corpus length :", len( corpus[100] ) )
print("Sample item from corpus vector :", corpus[100]  )
Number of tokens :  11300
Sample item from corpus length : 55
Sample item from corpus vector : [(0, 3), (1, 17), (114, 1), (179, 1), (262, 1), (271, 5), (297, 3), (300, 2), (313, 17), (376, 1), (465, 1), (488, 1), (494, 1), (831, 2), (1001, 4), (1160, 1), (1161, 1), (1189, 3), (1672, 1), (1688, 8), (1784, 1), (4299, 1), (4380, 3), (5095, 1), (5096, 1), (5097, 1), (5098, 1), (5099, 1), (5100, 1), (5101, 2), (5102, 1), (5103, 1), (5104, 2), (5105, 2), (5106, 1), (5107, 2), (5108, 1), (5109, 1), (5110, 1), (5111, 1), (5112, 2), (5113, 1), (5114, 1), (5115, 2), (5116, 1), (5117, 1), (5118, 1), (5119, 1), (5120, 1), (5121, 1), (5122, 1), (5123, 3), (5124, 1), (5125, 1), (5126, 1)]
print("Token 0 : ", dictionary.id2token[0] )
Token 0 :  ,
list(token_lists[100] ).count(  dictionary.id2token[10] )
0
ldamodel.get_document_topics( corpus[0] )
[(3, 0.98441637)]
srtd = sorted( ldamodel.get_document_topics( corpus[0] ) , key=lambda x: x[1], reverse=True)
print( srtd )
print( srtd[0][0] )
[(3, 0.98441637)]
3
labels_lda = []
for line in corpus :
  line_labels = sorted( ldamodel.get_document_topics( line ) , key=lambda x: x[1], reverse=True)
  #1st 0 is for selecting top item, and 2nd 0 is for index of tuple
  top_topic = line_labels[0][0]
  labels_lda.append(  top_topic)
np.array(labels_lda ).shape
(11300,)
#Since LDA already has a low dimension num_topic ( 20 ) , dimension reductions
#will not yield good results
embedding_umap_lda = reduce_umap( embedding_lda  )
plot_embeddings(embedding_umap_lda, labels_lda,"LDA with Umap")

png

embedding_pca_lda = reduce_pca(embedding_lda  )
plot_embeddings(embedding_pca_lda, labels_lda,"LDA with PCA")
pca explained_variance_  [0.11024877 0.07530292]
pca explained_variance_ratio_  [0.36910586 0.25210936]

png

embedding_tsne_lda = reduce_tsne(embedding_lda  )
plot_embeddings(embedding_tsne_lda, labels_lda,"LDA with T-sne")

png

print("Silhouette score:" )
print("LDA          : ", silhouette_score(embedding_lda, labels_lda) )

print("LDA with PCA : ", silhouette_score(embedding_pca_lda, labels_lda) )

print("LDA with TSNE : ", silhouette_score(embedding_tsne_lda, labels_lda) )

print("LDA with UMAP : ", silhouette_score(embedding_umap_lda, labels_lda) )
Silhouette score:
LDA          :  0.3518280869100735
LDA with PCA :  0.028769437366574727
LDA with TSNE :  -0.076625
LDA with UMAP :  0.0077518793

使用Bert做主题建模

from sentence_transformers import SentenceTransformer
model_bert = SentenceTransformer('bert-base-nli-max-tokens')
embedding_bert = np.array(model_bert.encode(sentences, show_progress_bar=True))

HBox(children=(HTML(value='Batches'), FloatProgress(value=0.0, max=354.0), HTML(value='')))
#Bert embeddings are shape of 768
print("Bert Embedding shape", embedding_bert.shape)
print("Bert Embedding sample", embedding_bert[0][0:50])
Bert Embedding shape (11300, 768)
Bert Embedding sample [ 0.47721025  1.4122621   1.2466793   0.25712243  1.3472135   0.05440127
  0.5105701   0.5357235   0.882528   -0.00200403  1.2273117   1.3862449
  1.5305245   0.35195014  0.07823949  0.68975717  1.7512573   1.070853
  1.169689    0.6520605  -0.03964749 -0.0158121   0.72166926  0.0570051
  1.2987412   1.5801806   0.73180366  1.0034966  -0.50588244 -0.04584667
  0.17520961  1.5908399   0.51227933  0.7542705   0.9528048   0.5564427
  0.81947947  0.05092382  0.26165453  0.9130406   0.63735193  0.40963185
  0.7895509   0.46444982 -0.31118888  1.1499051   0.7721692   1.4973241
  0.6381602   1.0090775 ]
#应用Kmeans而不降维
labels_bert_raw  = predict_topics_with_kmeans(embedding_bert,num_topics)
#Apply Kmeans for Bert Vectors  with Umap  dimension reduction
#应用Kmeans对Bert矢量进行Umap降维处理

embedding_umap_bert = reduce_umap( embedding_bert )
labels_bert_umap  = predict_topics_with_kmeans(embedding_umap_bert,num_topics)
plot_embeddings(embedding_umap_bert, labels_bert_umap,"Bert with Umap")

png

#Apply Kmeans for Bert Vectors  with PCA  dimension reduction

embedding_bert_pca =  reduce_pca( embedding_bert )
labels_bert_pca  = predict_topics_with_kmeans(embedding_bert_pca,num_topics)

plot_embeddings(embedding_bert_pca,labels_bert_pca,"Bert with PCA")
pca explained_variance_  [10.201032   5.7837133]
pca explained_variance_ratio_  [0.0993989  0.05635653]
#Apply Kmeans for Bert Vectors  with T-sne  dimension reduction

embedding_bert_tsne =  reduce_tsne( embedding_bert )
labels_bert_tsne  = predict_topics_with_kmeans(embedding_bert_tsne,num_topics)
plot_embeddings(embedding_bert_tsne,labels_bert_tsne,"Bert with T-sne")

print("Silhouette score:" )

print("Raw Bert" ,silhouette_score(embedding_bert, labels_bert_raw) )

print("Bert with PCA" ,  silhouette_score(embedding_bert_pca, labels_bert_pca) )

print("Bert with Tsne" , silhouette_score(embedding_bert_tsne, labels_bert_tsne) )

print("Bert with Umap" ,  silhouette_score(embedding_umap_bert , labels_bert_umap ) )
Silhouette score:
Raw Bert 0.047337186
Bert with PCA 0.32503554
Bert with Tsne 0.39274785
Bert with Umap 0.4647518

链接地址

https://colab.research.google.com/drive/1n8bTSuHnD3ya4CbhGvOL90vvVAo9s3Si#scrollTo=pFhF6RKiln5s