Topic models arise from the need of understanding and exploring large text document collections and predicting their underlying structure. Latent Dirichlet Allocation (LDA) (Blei et al., 2003) has quickly become one of the most popular text modelling techniques. The idea is that documents are represented as random mixtures over latent topics, where a distribution over words characterizes each topic. Unfortunately, topic models give no guaranty on the interpretability of their outputs. The topics learned from texts may be characterized by a set of irrelevant or unchained words. Therefore, topic models require validation of the coherence of estimated topics. However, the automatic evaluation of the latent space of a topic model is a difficult task. Formerly, the most used metric for evaluating the quality of a topic model was the held-out likelihood. Still, the literature has shown that this method emphasizes complexity rather than interpretability. Although many procedures were recently proposed (Röder et al., 2015), the automatic evaluation of topic coherence remains an open research area. Our work aims to provide a new technique based on Statistically Validated Network (Tumminello et al., 2011). Our approach consists in representing each topic as a network of its most probable words. The presence of a link between each pair of words is assessed by statistically validating their co-occurrences in sentences against the null hypothesis of random co-occurrence. Thus, we propose a new coherence measure based on the structure of the statistically validated network. Furthermore, the new measure provides a ranking of topics and distinguishes high-quality from low-quality topics. The intuition is that the pairwise associations of words is strictly related to the semantic coherence and interpretability of a topic.

Alessandro Albano, Andrea Simonetti (2020). MEASURING TOPIC COHERENCE THROUGH STATISTICALLY VALIDATED NETWORKS. In Book of Abstracts Third international conference on Data Science & Social Research,.

MEASURING TOPIC COHERENCE THROUGH STATISTICALLY VALIDATED NETWORKS

Alessandro Albano;Andrea Simonetti
2020-01-01

Abstract

Topic models arise from the need of understanding and exploring large text document collections and predicting their underlying structure. Latent Dirichlet Allocation (LDA) (Blei et al., 2003) has quickly become one of the most popular text modelling techniques. The idea is that documents are represented as random mixtures over latent topics, where a distribution over words characterizes each topic. Unfortunately, topic models give no guaranty on the interpretability of their outputs. The topics learned from texts may be characterized by a set of irrelevant or unchained words. Therefore, topic models require validation of the coherence of estimated topics. However, the automatic evaluation of the latent space of a topic model is a difficult task. Formerly, the most used metric for evaluating the quality of a topic model was the held-out likelihood. Still, the literature has shown that this method emphasizes complexity rather than interpretability. Although many procedures were recently proposed (Röder et al., 2015), the automatic evaluation of topic coherence remains an open research area. Our work aims to provide a new technique based on Statistically Validated Network (Tumminello et al., 2011). Our approach consists in representing each topic as a network of its most probable words. The presence of a link between each pair of words is assessed by statistically validating their co-occurrences in sentences against the null hypothesis of random co-occurrence. Thus, we propose a new coherence measure based on the structure of the statistically validated network. Furthermore, the new measure provides a ranking of topics and distinguishes high-quality from low-quality topics. The intuition is that the pairwise associations of words is strictly related to the semantic coherence and interpretability of a topic.
2020
topic model, topic coherence, LDA, statistically validated networks.
978-886629-051-3
Alessandro Albano, Andrea Simonetti (2020). MEASURING TOPIC COHERENCE THROUGH STATISTICALLY VALIDATED NETWORKS. In Book of Abstracts Third international conference on Data Science & Social Research,.
File in questo prodotto:
File Dimensione Formato  
Abstract MEASURING TOPIC COHERENCE THROUGH STATISTICALLY VALIDATED NETWORKS.pdf

Solo gestori archvio

Tipologia: Post-print
Dimensione 199.98 kB
Formato Adobe PDF
199.98 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/455292
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact