Archivio istituzionale della ricerca dell'Università degli Studi di Palermo

Topic models arise from the need of understanding and exploring large text document collections and predicting their underlying structure. Latent Dirichlet Allocation (LDA) (Blei et al., 2003) has quickly become one of the most popular text modelling techniques. The idea is that documents are represented as random mixtures over latent topics, where a distribution over words characterizes each topic. Unfortunately, topic models give no guaranty on the interpretability of their outputs. The topics learned from texts may be characterized by a set of irrelevant or unchained words. Therefore, topic models require validation of the coherence of estimated topics. However, the automatic evaluation of the latent space of a topic model is a difficult task. Formerly, the most used metric for evaluating the quality of a topic model was the held-out likelihood. Still, the literature has shown that this method emphasizes complexity rather than interpretability. Although many procedures were recently proposed (Röder et al., 2015), the automatic evaluation of topic coherence remains an open research area. Our work aims to provide a new technique based on Statistically Validated Network (Tumminello et al., 2011). Our approach consists in representing each topic as a network of its most probable words. The presence of a link between each pair of words is assessed by statistically validating their co-occurrences in sentences against the null hypothesis of random co-occurrence. Thus, we propose a new coherence measure based on the structure of the statistically validated network. Furthermore, the new measure provides a ranking of topics and distinguishes high-quality from low-quality topics. The intuition is that the pairwise associations of words is strictly related to the semantic coherence and interpretability of a topic.

Alessandro Albano, Andrea Simonetti (2020). MEASURING TOPIC COHERENCE THROUGH STATISTICALLY VALIDATED NETWORKS. In Book of Abstracts Third international conference on Data Science & Social Research,.

MEASURING TOPIC COHERENCE THROUGH STATISTICALLY VALIDATED NETWORKS

Alessandro Albano;Andrea Simonetti

2020-01-01

Abstract

Topic models arise from the need of understanding and exploring large text document collections and predicting their underlying structure. Latent Dirichlet Allocation (LDA) (Blei et al., 2003) has quickly become one of the most popular text modelling techniques. The idea is that documents are represented as random mixtures over latent topics, where a distribution over words characterizes each topic. Unfortunately, topic models give no guaranty on the interpretability of their outputs. The topics learned from texts may be characterized by a set of irrelevant or unchained words. Therefore, topic models require validation of the coherence of estimated topics. However, the automatic evaluation of the latent space of a topic model is a difficult task. Formerly, the most used metric for evaluating the quality of a topic model was the held-out likelihood. Still, the literature has shown that this method emphasizes complexity rather than interpretability. Although many procedures were recently proposed (Röder et al., 2015), the automatic evaluation of topic coherence remains an open research area. Our work aims to provide a new technique based on Statistically Validated Network (Tumminello et al., 2011). Our approach consists in representing each topic as a network of its most probable words. The presence of a link between each pair of words is assessed by statistically validating their co-occurrences in sentences against the null hypothesis of random co-occurrence. Thus, we propose a new coherence measure based on the structure of the statistically validated network. Furthermore, the new measure provides a ranking of topics and distinguishes high-quality from low-quality topics. The intuition is that the pairwise associations of words is strictly related to the semantic coherence and interpretability of a topic.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data
	
				2020
			
	Parole chiave 
DATO PREVISTO SU LOGINMIUR
	
				topic model, topic coherence, LDA, statistically validated networks.
			
	ISBN della monografia 
DATO PREVISTO SU LOGINMIUR
	
				978-886629-051-3
			
	Citazione
	
				Alessandro Albano,  Andrea Simonetti (2020). MEASURING TOPIC COHERENCE THROUGH STATISTICALLY VALIDATED NETWORKS. In Book of Abstracts Third international conference on Data Science & Social Research,.
			
	Appare nelle tipologie:
	
				2.08 Abstract in atti di convegno pubblicato in volume

File in questo prodotto:

File	Dimensione	Formato
Abstract MEASURING TOPIC COHERENCE THROUGH STATISTICALLY VALIDATED NETWORKS.pdf Solo gestori archvio Tipologia: Post-print Dimensione 199.98 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	199.98 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/455292

Citazioni

ND

ND

ND

social impact