Probabilistic topic models have become one of the most widespread machine learning techniques in textual analysis. Topic discovering is an unsupervised process that does not guarantee the interpretability of its output. Hence, the automatic evaluation of topic coherence has attracted the interest of many researchers over the last decade, and it is an open research area. The present article offers a new quality evaluation method based on Statistically Validated Networks (SVNs). The proposed probabilistic approach consists of representing each topic as a weighted network of its most probable words. The presence of a link between each pair of words is assessed by statistically validating their co-occurrence in sentences against the null hypothesis of random co-occurrence. The proposed method allows one to distinguish between high-quality and low-quality topics, by making use of a battery of statistical tests. The statistically significant pairwise associations of words represented by the links in the SVN might reasonably be expected to be strictly related to the semantic coherence and interpretability of a topic. Therefore, the more connected the network, the more coherent the topic in question. We demonstrate the effectiveness of the method through an analysis of a real text corpus, which shows that the proposed measure is more correlated with human judgement than the state-of-the-art coherence measures.
Andrea Simonetti, Alessandro Albano, Antonella Plaia, Michele Tumminello (2023). Ranking coherence in Topic Models using Statistically Validated Networks. JOURNAL OF INFORMATION SCIENCE [10.1177/01655515221148369].
Ranking coherence in Topic Models using Statistically Validated Networks
Andrea Simonetti;Alessandro Albano;Antonella Plaia;Michele Tumminello
2023-01-01
Abstract
Probabilistic topic models have become one of the most widespread machine learning techniques in textual analysis. Topic discovering is an unsupervised process that does not guarantee the interpretability of its output. Hence, the automatic evaluation of topic coherence has attracted the interest of many researchers over the last decade, and it is an open research area. The present article offers a new quality evaluation method based on Statistically Validated Networks (SVNs). The proposed probabilistic approach consists of representing each topic as a weighted network of its most probable words. The presence of a link between each pair of words is assessed by statistically validating their co-occurrence in sentences against the null hypothesis of random co-occurrence. The proposed method allows one to distinguish between high-quality and low-quality topics, by making use of a battery of statistical tests. The statistically significant pairwise associations of words represented by the links in the SVN might reasonably be expected to be strictly related to the semantic coherence and interpretability of a topic. Therefore, the more connected the network, the more coherent the topic in question. We demonstrate the effectiveness of the method through an analysis of a real text corpus, which shows that the proposed measure is more correlated with human judgement than the state-of-the-art coherence measures.File | Dimensione | Formato | |
---|---|---|---|
New_Jis.pdf
accesso aperto
Descrizione: Articolo completo
Tipologia:
Post-print
Dimensione
659.31 kB
Formato
Adobe PDF
|
659.31 kB | Adobe PDF | Visualizza/Apri |
simonetti-et-al-2023-ranking-coherence-in-topic-models-using-statistically-validated-networks.pdf
accesso aperto
Tipologia:
Versione Editoriale
Dimensione
605.5 kB
Formato
Adobe PDF
|
605.5 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.