Probabilistic topic models have become one of the most widespread machine learning technique for textual analysis purpose. In this framework, Latent Dirichlet Allocation (LDA) gained more and more popularity as a text modelling technique. The idea is that documents are represented as random mixtures over latent topics, where a distribution over words characterizes each topic. Unfortunately, topic models do not guarantee the interpretability of their outputs. The topics learned from the model may be characterized by a set of irrelevant or unchained words, being useless for the interpretation. In the framework of topic quality evaluation, the pairwise semantic cohesion among the top-N most probable words (for a given topic) is calculated by measures based on words co-occurrences. Many topic-quality metrics were proposed defining different score measures such as: Pointwise Mutual Information (PMI), also called UCI; an asymmetrical measure called UMass; Normalized Pointwise Mutual Information (NPMI), a measure based on tf-idf scores , and a measure called CV proposed by Roder et al. Although these several measures in the literature have already considered cooccurrence between words as a measure of association, none has undertaken a statistical approach based on hypotheses testing to assess whether the co-occurrence obtained between two words can be attributed to the chance or if these links carry relevant information about the structure of topics. Thus, we propose a new coherence measure based on Statistically Validated Network to evaluate the interpretability of the top words of a topic.

Andrea Simonetti, Alessandro Albano, Antonella Plaia, Michele Tumminello (2022). Statistically Validated Networks for evaluating coherence in topic models. In The 10th International Conference on Complex Networks and their Applications- Book of Abstracts.

Statistically Validated Networks for evaluating coherence in topic models

Andrea Simonetti
;
Alessandro Albano
;
Antonella Plaia;Michele Tumminello
2022-01-01

Abstract

Probabilistic topic models have become one of the most widespread machine learning technique for textual analysis purpose. In this framework, Latent Dirichlet Allocation (LDA) gained more and more popularity as a text modelling technique. The idea is that documents are represented as random mixtures over latent topics, where a distribution over words characterizes each topic. Unfortunately, topic models do not guarantee the interpretability of their outputs. The topics learned from the model may be characterized by a set of irrelevant or unchained words, being useless for the interpretation. In the framework of topic quality evaluation, the pairwise semantic cohesion among the top-N most probable words (for a given topic) is calculated by measures based on words co-occurrences. Many topic-quality metrics were proposed defining different score measures such as: Pointwise Mutual Information (PMI), also called UCI; an asymmetrical measure called UMass; Normalized Pointwise Mutual Information (NPMI), a measure based on tf-idf scores , and a measure called CV proposed by Roder et al. Although these several measures in the literature have already considered cooccurrence between words as a measure of association, none has undertaken a statistical approach based on hypotheses testing to assess whether the co-occurrence obtained between two words can be attributed to the chance or if these links carry relevant information about the structure of topics. Thus, we propose a new coherence measure based on Statistically Validated Network to evaluate the interpretability of the top words of a topic.
gen-2022
Settore SECS-S/01 - Statistica
Settore SECS-S/06 -Metodi Mat. dell'Economia e d. Scienze Attuariali e Finanz.
978-2-9557050-5-6
Andrea Simonetti, Alessandro Albano, Antonella Plaia, Michele Tumminello (2022). Statistically Validated Networks for evaluating coherence in topic models. In The 10th International Conference on Complex Networks and their Applications- Book of Abstracts.
File in questo prodotto:
File Dimensione Formato  
Extended_Abstract_complex_Network_Conference.pdf

Solo gestori archvio

Descrizione: Contributo completo
Tipologia: Post-print
Dimensione 106.96 kB
Formato Adobe PDF
106.96 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/531495
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact