Archivio istituzionale della ricerca dell'Università degli Studi di Palermo

In machine learning, document clustering and topic modeling are scientific challenges concerning the extraction of useful information from a collection of texts. Traditional approaches, such as Latent Dirichlet Allocation (LDA), rely on maximising likeli- hood functions. In this paper, we explore a paradigm shift towards network represen- tation of textual data and the associated challenges of community detection [3]. We proposes a new method to face the tasks of document clustering and topic modeling, representing a collection of documents as a bipartite network. Then, we introduce the application of Statistically Validated Networks (SVN) to filter out irrelevant con- nections within the projected networks of words and documents. The SVN method is promising in the framework of topic modeling. For instance, Simonetti et al. (2022) recently proposed a new application of SVN to measure the coherence of topics. In- stead, we aim to identify the topics themselves. By doing so, we can naturally find topics with high coherence according to the measure proposed by the authors. Moreover, the modularity contribution of each community (topic) can be interpreted as a measure of coherence since it is an intensive quantity that assesses the tendency of words within a given topic to occur in the same sentences jointly

Andrea Simonetti, Alessandro Albano (2023). Statistically Validated Network approach for document clustering and topic modeling. In THE 12TH INTERNATIONAL CONFERENCE ON COMPLEX NETWORKS AND THEIR APPLICATIONS, BOOK OF ABSTRACTS.

Statistically Validated Network approach for document clustering and topic modeling

Andrea Simonetti;Alessandro Albano

2023-01-01

Abstract

In machine learning, document clustering and topic modeling are scientific challenges concerning the extraction of useful information from a collection of texts. Traditional approaches, such as Latent Dirichlet Allocation (LDA), rely on maximising likeli- hood functions. In this paper, we explore a paradigm shift towards network represen- tation of textual data and the associated challenges of community detection [3]. We proposes a new method to face the tasks of document clustering and topic modeling, representing a collection of documents as a bipartite network. Then, we introduce the application of Statistically Validated Networks (SVN) to filter out irrelevant con- nections within the projected networks of words and documents. The SVN method is promising in the framework of topic modeling. For instance, Simonetti et al. (2022) recently proposed a new application of SVN to measure the coherence of topics. In- stead, we aim to identify the topics themselves. By doing so, we can naturally find topics with high coherence according to the measure proposed by the authors. Moreover, the modularity contribution of each community (topic) can be interpreted as a measure of coherence since it is an intensive quantity that assesses the tendency of words within a given topic to occur in the same sentences jointly

Scheda breve

Scheda completa

Scheda completa (DC)

	Data
	
				2023
			
	ISBN della monografia 
DATO PREVISTO SU LOGINMIUR
	
				978-2-9557050-7-0
			
	Citazione
	
				Andrea Simonetti,  Alessandro Albano (2023). Statistically Validated Network approach for document clustering and topic modeling. In THE 12TH INTERNATIONAL CONFERENCE ON COMPLEX NETWORKS AND THEIR APPLICATIONS, BOOK OF ABSTRACTS.
			
	Appare nelle tipologie:
	
				2.07 Contributo in atti di convegno pubblicato in volume

File in questo prodotto:

File	Dimensione	Formato
2023_CNA.pdf Solo gestori archvio Descrizione: Contributo completo Tipologia: Versione Editoriale Dimensione 2.1 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	2.1 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/631817

Citazioni

ND

ND

ND

social impact