In machine learning, document clustering and topic modeling are scientific challenges concerning the extraction of useful information from a collection of texts. Traditional approaches, such as Latent Dirichlet Allocation (LDA), rely on maximising likeli- hood functions. In this paper, we explore a paradigm shift towards network represen- tation of textual data and the associated challenges of community detection [3]. We proposes a new method to face the tasks of document clustering and topic modeling, representing a collection of documents as a bipartite network. Then, we introduce the application of Statistically Validated Networks (SVN) to filter out irrelevant con- nections within the projected networks of words and documents. The SVN method is promising in the framework of topic modeling. For instance, Simonetti et al. (2022) recently proposed a new application of SVN to measure the coherence of topics. In- stead, we aim to identify the topics themselves. By doing so, we can naturally find topics with high coherence according to the measure proposed by the authors. Moreover, the modularity contribution of each community (topic) can be interpreted as a measure of coherence since it is an intensive quantity that assesses the tendency of words within a given topic to occur in the same sentences jointly

Andrea Simonetti, Alessandro Albano (2023). Statistically Validated Network approach for document clustering and topic modeling. In THE 12TH INTERNATIONAL CONFERENCE ON COMPLEX NETWORKS AND THEIR APPLICATIONS, BOOK OF ABSTRACTS.

Statistically Validated Network approach for document clustering and topic modeling

Andrea Simonetti
;
Alessandro Albano
2023-01-01

Abstract

In machine learning, document clustering and topic modeling are scientific challenges concerning the extraction of useful information from a collection of texts. Traditional approaches, such as Latent Dirichlet Allocation (LDA), rely on maximising likeli- hood functions. In this paper, we explore a paradigm shift towards network represen- tation of textual data and the associated challenges of community detection [3]. We proposes a new method to face the tasks of document clustering and topic modeling, representing a collection of documents as a bipartite network. Then, we introduce the application of Statistically Validated Networks (SVN) to filter out irrelevant con- nections within the projected networks of words and documents. The SVN method is promising in the framework of topic modeling. For instance, Simonetti et al. (2022) recently proposed a new application of SVN to measure the coherence of topics. In- stead, we aim to identify the topics themselves. By doing so, we can naturally find topics with high coherence according to the measure proposed by the authors. Moreover, the modularity contribution of each community (topic) can be interpreted as a measure of coherence since it is an intensive quantity that assesses the tendency of words within a given topic to occur in the same sentences jointly
2023
978-2-9557050-7-0
Andrea Simonetti, Alessandro Albano (2023). Statistically Validated Network approach for document clustering and topic modeling. In THE 12TH INTERNATIONAL CONFERENCE ON COMPLEX NETWORKS AND THEIR APPLICATIONS, BOOK OF ABSTRACTS.
File in questo prodotto:
File Dimensione Formato  
2023_CNA.pdf

Solo gestori archvio

Descrizione: Contributo completo
Tipologia: Versione Editoriale
Dimensione 2.1 MB
Formato Adobe PDF
2.1 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/631817
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact