Archivio istituzionale della ricerca dell'Università degli Studi di Palermo

Probabilistic topic models have become one of the most widespread machine learning techniques in textual analysis. Topic discovering is an unsupervised process that does not guarantee the interpretability of its output. Hence, the automatic evaluation of topic coherence has attracted the interest of many researchers over the last decade, and it is an open research area. The present article offers a new quality evaluation method based on Statistically Validated Networks (SVNs). The proposed probabilistic approach consists of representing each topic as a weighted network of its most probable words. The presence of a link between each pair of words is assessed by statistically validating their co-occurrence in sentences against the null hypothesis of random co-occurrence. The proposed method allows one to distinguish between high-quality and low-quality topics, by making use of a battery of statistical tests. The statistically significant pairwise associations of words represented by the links in the SVN might reasonably be expected to be strictly related to the semantic coherence and interpretability of a topic. Therefore, the more connected the network, the more coherent the topic in question. We demonstrate the effectiveness of the method through an analysis of a real text corpus, which shows that the proposed measure is more correlated with human judgement than the state-of-the-art coherence measures.

Andrea Simonetti, Alessandro Albano, Antonella Plaia, Michele Tumminello (2023). Ranking coherence in Topic Models using Statistically Validated Networks. JOURNAL OF INFORMATION SCIENCE [10.1177/01655515221148369].

Ranking coherence in Topic Models using Statistically Validated Networks

Andrea Simonetti;Alessandro Albano;Antonella Plaia;Michele Tumminello

2023-01-01

Abstract

Probabilistic topic models have become one of the most widespread machine learning techniques in textual analysis. Topic discovering is an unsupervised process that does not guarantee the interpretability of its output. Hence, the automatic evaluation of topic coherence has attracted the interest of many researchers over the last decade, and it is an open research area. The present article offers a new quality evaluation method based on Statistically Validated Networks (SVNs). The proposed probabilistic approach consists of representing each topic as a weighted network of its most probable words. The presence of a link between each pair of words is assessed by statistically validating their co-occurrence in sentences against the null hypothesis of random co-occurrence. The proposed method allows one to distinguish between high-quality and low-quality topics, by making use of a battery of statistical tests. The statistically significant pairwise associations of words represented by the links in the SVN might reasonably be expected to be strictly related to the semantic coherence and interpretability of a topic. Therefore, the more connected the network, the more coherent the topic in question. We demonstrate the effectiveness of the method through an analysis of a real text corpus, which shows that the proposed measure is more correlated with human judgement than the state-of-the-art coherence measures.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data
	
				2023
			
	Titolo del periodico 
DATO PREVISTO SU LOGINMIUR
	
				JOURNAL OF INFORMATION SCIENCE
			
	DOI del contributo 
DATO PREVISTO SU LOGINMIUR
	
				https://dx.doi.org/10.1177/01655515221148369
			
	URL alternativo rispetto a quello dell'editore 
DATO PREVISTO SU LOGINMIUR
	
				https://journals.sagepub.com/share/RKVEQ3HKXIYKT7JVEXWX?target=10.1177/01655515221148369
			
	Citazione
	
				Andrea Simonetti,  Alessandro Albano,  Antonella Plaia,  Michele Tumminello (2023). Ranking coherence in Topic Models using Statistically Validated Networks. JOURNAL OF INFORMATION SCIENCE [10.1177/01655515221148369].
			
	Appare nelle tipologie:
	
				1.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
New_Jis.pdf accesso aperto Descrizione: Articolo completo Tipologia: Post-print Dimensione 659.31 kB Formato Adobe PDF Visualizza/Apri	659.31 kB	Adobe PDF	Visualizza/Apri
simonetti-et-al-2023-ranking-coherence-in-topic-models-using-statistically-validated-networks.pdf accesso aperto Tipologia: Versione Editoriale Dimensione 605.5 kB Formato Adobe PDF Visualizza/Apri	605.5 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/574748

Citazioni

ND

2

1

social impact