Probabilistic topic models have become one of the most widespread machine learning technique for textual analysis purpose. In this framework, Latent Dirichlet Allocation (LDA) gained more and more popularity as a text modelling technique. The idea is that documents are represented as random mixtures over latent topics, where a distribution over words characterizes each topic. Unfortunately, topic models do not guarantee the interpretability of their outputs. The topics learned from the model may be characterized by a set of irrelevant or unchained words, being useless for the interpretation. In the framework of topic quality evaluation, the pairwise semantic cohesion among the top-N most probable words (for a given topic) is calculated by measures based on words co-occurrences. Many topic-quality metrics were proposed defining different score measures such as: Pointwise Mutual Information (PMI), also called UCI; an asymmetrical measure called UMass; Normalized Pointwise Mutual Information (NPMI), a measure based on tf-idf scores , and a measure called CV proposed by Roder et al. Although these several measures in the literature have already considered cooccurrence between words as a measure of association, none has undertaken a statistical approach based on hypotheses testing to assess whether the co-occurrence obtained between two words can be attributed to the chance or if these links carry relevant information about the structure of topics. Thus, we propose a new coherence measure based on Statistically Validated Network to evaluate the interpretability of the top words of a topic.
Andrea Simonetti, Alessandro Albano, Antonella Plaia, Michele Tumminello (2022). Statistically Validated Networks for evaluating coherence in topic models. In The 10th International Conference on Complex Networks and their Applications- Book of Abstracts.
Statistically Validated Networks for evaluating coherence in topic models
Andrea Simonetti
;Alessandro Albano
;Antonella Plaia;Michele Tumminello
2022-01-01
Abstract
Probabilistic topic models have become one of the most widespread machine learning technique for textual analysis purpose. In this framework, Latent Dirichlet Allocation (LDA) gained more and more popularity as a text modelling technique. The idea is that documents are represented as random mixtures over latent topics, where a distribution over words characterizes each topic. Unfortunately, topic models do not guarantee the interpretability of their outputs. The topics learned from the model may be characterized by a set of irrelevant or unchained words, being useless for the interpretation. In the framework of topic quality evaluation, the pairwise semantic cohesion among the top-N most probable words (for a given topic) is calculated by measures based on words co-occurrences. Many topic-quality metrics were proposed defining different score measures such as: Pointwise Mutual Information (PMI), also called UCI; an asymmetrical measure called UMass; Normalized Pointwise Mutual Information (NPMI), a measure based on tf-idf scores , and a measure called CV proposed by Roder et al. Although these several measures in the literature have already considered cooccurrence between words as a measure of association, none has undertaken a statistical approach based on hypotheses testing to assess whether the co-occurrence obtained between two words can be attributed to the chance or if these links carry relevant information about the structure of topics. Thus, we propose a new coherence measure based on Statistically Validated Network to evaluate the interpretability of the top words of a topic.| File | Dimensione | Formato | |
|---|---|---|---|
|
Extended_Abstract_complex_Network_Conference.pdf
Solo gestori archvio
Descrizione: Contributo completo
Tipologia:
Post-print
Dimensione
106.96 kB
Formato
Adobe PDF
|
106.96 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


