Archivio istituzionale della ricerca dell'Università degli Studi di Palermo

This paper presents a novel methodology, called Word Co-occurrence SVN topic model (WCSVNtm), for document clustering and topic modeling in textual datasets. This method represents the corpus as a bipartite network of words and documents to rigorously assess the statistical significance of word co-occurrences within documents and document overlap based on shared vocabulary. By employing the Leiden community detection algorithm to the SVN, distinct communities of words can be identified and interpreted as topics. Similarly, documents can be sorted into groups based on their thematic similarities. We demonstrate the effectiveness of our approach by analyzing three datasets: a set of 120 Wikipedia articles, the arXiv10 dataset, which consists of 100,000 abstracts from scientific papers, and a sampled subset of 10,000 documents from the original arXiv10. To benchmark our results, we compare our approach with several well-established models in the field of topic modeling and document clustering, including the hierarchical Stochastic Block Model (hSBM), BERTopic, and Latent Dirichlet Allocation (LDA). The results show that WCSVNtm achieves competitive performance across all datasets, automatically selecting the number of topics and document clusters, whereas state-of-the-art methods require prior knowledge or additional tuning for optimization. Finally, any advancements in community detection algorithms could further improve our method.

Simonetti, A., Albano, A., Tumminello, M., Di Matteo, T. (2025). Statistically validated network for analysing textual data. APPLIED NETWORK SCIENCE, 10(1) [10.1007/s41109-025-00693-z].

Statistically validated network for analysing textual data

Simonetti, Andrea;Albano, Alessandro;Tumminello, Michele;Di Matteo, T.

2025-02-01

Abstract

This paper presents a novel methodology, called Word Co-occurrence SVN topic model (WCSVNtm), for document clustering and topic modeling in textual datasets. This method represents the corpus as a bipartite network of words and documents to rigorously assess the statistical significance of word co-occurrences within documents and document overlap based on shared vocabulary. By employing the Leiden community detection algorithm to the SVN, distinct communities of words can be identified and interpreted as topics. Similarly, documents can be sorted into groups based on their thematic similarities. We demonstrate the effectiveness of our approach by analyzing three datasets: a set of 120 Wikipedia articles, the arXiv10 dataset, which consists of 100,000 abstracts from scientific papers, and a sampled subset of 10,000 documents from the original arXiv10. To benchmark our results, we compare our approach with several well-established models in the field of topic modeling and document clustering, including the hierarchical Stochastic Block Model (hSBM), BERTopic, and Latent Dirichlet Allocation (LDA). The results show that WCSVNtm achieves competitive performance across all datasets, automatically selecting the number of topics and document clusters, whereas state-of-the-art methods require prior knowledge or additional tuning for optimization. Finally, any advancements in community detection algorithms could further improve our method.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data
	
				feb-2025
			
	Settore scientifico disciplinare del contributo
	
				Settore STAT-01/A - Statistica
Settore STAT-04/A - Metodi matematici dell'economia e delle scienze attuariali e finanziarie
			
	Titolo del periodico 
DATO PREVISTO SU LOGINMIUR
	
				APPLIED NETWORK SCIENCE
			
	DOI del contributo 
DATO PREVISTO SU LOGINMIUR
	
				https://dx.doi.org/10.1007/s41109-025-00693-z
			
	URL dell'editore (Open access ove possibile)
	
				https://appliednetsci.springeropen.com/articles/10.1007/s41109-025-00693-z
			
	Citazione
	
				Simonetti, A., Albano, A., Tumminello, M., Di Matteo, T. (2025). Statistically validated network for analysing textual data. APPLIED NETWORK SCIENCE, 10(1) [10.1007/s41109-025-00693-z].
			
	Appare nelle tipologie:
	
				1.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
Simonetti et al editorial version.pdf accesso aperto Descrizione: Full paper Tipologia: Versione Editoriale Dimensione 2.68 MB Formato Adobe PDF Visualizza/Apri	2.68 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/673424

Citazioni

ND

ND

ND

social impact