Archivio istituzionale della ricerca dell'Università degli Studi di Palermo

Probabilistic topic models have become one of the most widespread machine learning technique for textual analysis purpose. In this framework, Latent Dirichlet Allocation (LDA) gained more and more popularity as a text modelling technique. The idea is that documents are represented as random mixtures over latent topics, where a distribution over words characterizes each topic. Unfortunately, topic models do not guarantee the interpretability of their outputs. The topics learned from the model may be characterized by a set of irrelevant or unchained words, being useless for the interpretation. In the framework of topic quality evaluation, the pairwise semantic cohesion among the top-N most probable words (for a given topic) is calculated by measures based on words co-occurrences. Many topic-quality metrics were proposed defining different score measures such as: Pointwise Mutual Information (PMI), also called UCI; an asymmetrical measure called UMass; Normalized Pointwise Mutual Information (NPMI), a measure based on tf-idf scores , and a measure called CV proposed by Roder et al. Although these several measures in the literature have already considered cooccurrence between words as a measure of association, none has undertaken a statistical approach based on hypotheses testing to assess whether the co-occurrence obtained between two words can be attributed to the chance or if these links carry relevant information about the structure of topics. Thus, we propose a new coherence measure based on Statistically Validated Network to evaluate the interpretability of the top words of a topic.

Andrea Simonetti, Alessandro Albano, Antonella Plaia, Michele Tumminello (2022). Statistically Validated Networks for evaluating coherence in topic models. In The 10th International Conference on Complex Networks and their Applications- Book of Abstracts.

Statistically Validated Networks for evaluating coherence in topic models

Andrea Simonetti;Alessandro Albano;Antonella Plaia;Michele Tumminello

2022-01-01

Abstract

Probabilistic topic models have become one of the most widespread machine learning technique for textual analysis purpose. In this framework, Latent Dirichlet Allocation (LDA) gained more and more popularity as a text modelling technique. The idea is that documents are represented as random mixtures over latent topics, where a distribution over words characterizes each topic. Unfortunately, topic models do not guarantee the interpretability of their outputs. The topics learned from the model may be characterized by a set of irrelevant or unchained words, being useless for the interpretation. In the framework of topic quality evaluation, the pairwise semantic cohesion among the top-N most probable words (for a given topic) is calculated by measures based on words co-occurrences. Many topic-quality metrics were proposed defining different score measures such as: Pointwise Mutual Information (PMI), also called UCI; an asymmetrical measure called UMass; Normalized Pointwise Mutual Information (NPMI), a measure based on tf-idf scores , and a measure called CV proposed by Roder et al. Although these several measures in the literature have already considered cooccurrence between words as a measure of association, none has undertaken a statistical approach based on hypotheses testing to assess whether the co-occurrence obtained between two words can be attributed to the chance or if these links carry relevant information about the structure of topics. Thus, we propose a new coherence measure based on Statistically Validated Network to evaluate the interpretability of the top words of a topic.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data
	
				gen-2022
			
	ISBN della monografia 
DATO PREVISTO SU LOGINMIUR
	
				978-2-9557050-5-6
			
	Citazione
	
				Andrea Simonetti,  Alessandro Albano,  Antonella Plaia,  Michele Tumminello (2022). Statistically Validated Networks for evaluating coherence in topic models. In The 10th International Conference on Complex Networks and their Applications- Book of Abstracts.
			
	Appare nelle tipologie:
	
				2.07 Contributo in atti di convegno pubblicato in volume

File in questo prodotto:

File	Dimensione	Formato
Extended_Abstract_complex_Network_Conference.pdf Solo gestori archvio Descrizione: Contributo completo Tipologia: Post-print Dimensione 106.96 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	106.96 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/531495

Citazioni

ND

ND

ND

social impact