Probabilistic topic models are machine learning tools for processing and understanding large text document collections. Among the different models in the literature, Latent Dirichlet Allocation (LDA) has turned out to be the benchmark of the topic modelling community. The key idea is to represent text documents as random mixtures over latent semantic structures called topics. Each topic follows a multinomial distribution over the vocabulary words. In order to understand the result of a topic model, researchers usually select the top-n (essential words) words with the highest probability given a topic and look for meaningful and interpretable semantic themes. This work proposes a new method for exploring topics in LDA models, using Statistically Validated Networks (SVNs). The main idea of the proposed method is to consider co-occurrence between essential words as a measure of association. Two different approaches, called undirected and directed are proposed. Firstly, the symmetrical asso- ciation between two words is taken into account, i.e. how many times two words are found in the same sentence. Conversely, in the directed approach, the order in which the words are in the sentence is also considered. We use hypothesis testing to assess whether the co-occurrence between two words can be attributed to the chance or if these links carry relevant information about the structure of topics. Specifically, textual data is represented as a bipartite network in which one set of nodes is made by sentences, and the other set of nodes is made by a list of essential words associated with a given topic. A link between a word and a sentence is set if the word belongs to that sentence. Therefore, the projection of the bipartite network on the set of words results in a word-co-occurrence network. Note that the directed approach produces a directed network while the undirected one an undirected network. Indeed, a directed link from one word to another may be val- idated, but not the other way around. The two methods are applied to a real dataset, highlighting the differences.

Alessandro Albano, Mariangela Sciandra, Antonella Plaia (2022). Exploring topics in LDA models through Statistically Validated Networks: directed and undirected approaches. In Proceedings of the 11th International Conference on Complex Networks and their Applications.

Exploring topics in LDA models through Statistically Validated Networks: directed and undirected approaches

Alessandro Albano;Mariangela Sciandra;Antonella Plaia
2022-01-01

Abstract

Probabilistic topic models are machine learning tools for processing and understanding large text document collections. Among the different models in the literature, Latent Dirichlet Allocation (LDA) has turned out to be the benchmark of the topic modelling community. The key idea is to represent text documents as random mixtures over latent semantic structures called topics. Each topic follows a multinomial distribution over the vocabulary words. In order to understand the result of a topic model, researchers usually select the top-n (essential words) words with the highest probability given a topic and look for meaningful and interpretable semantic themes. This work proposes a new method for exploring topics in LDA models, using Statistically Validated Networks (SVNs). The main idea of the proposed method is to consider co-occurrence between essential words as a measure of association. Two different approaches, called undirected and directed are proposed. Firstly, the symmetrical asso- ciation between two words is taken into account, i.e. how many times two words are found in the same sentence. Conversely, in the directed approach, the order in which the words are in the sentence is also considered. We use hypothesis testing to assess whether the co-occurrence between two words can be attributed to the chance or if these links carry relevant information about the structure of topics. Specifically, textual data is represented as a bipartite network in which one set of nodes is made by sentences, and the other set of nodes is made by a list of essential words associated with a given topic. A link between a word and a sentence is set if the word belongs to that sentence. Therefore, the projection of the bipartite network on the set of words results in a word-co-occurrence network. Note that the directed approach produces a directed network while the undirected one an undirected network. Indeed, a directed link from one word to another may be val- idated, but not the other way around. The two methods are applied to a real dataset, highlighting the differences.
Alessandro Albano, Mariangela Sciandra, Antonella Plaia (2022). Exploring topics in LDA models through Statistically Validated Networks: directed and undirected approaches. In Proceedings of the 11th International Conference on Complex Networks and their Applications.
File in questo prodotto:
File Dimensione Formato  
Complex_Networks_2022_Albanoetal.pdf

Solo gestori archvio

Descrizione: Articolo principale
Tipologia: Pre-print
Dimensione 108.88 kB
Formato Adobe PDF
108.88 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/576096
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact