Probabilistic topic models are machine learning tools for processing and understanding large text document collections. Among the different models in the literature, Latent Dirichlet Allocation (LDA) has turned out to be the benchmark of the topic modelling community. The key idea is to represent text documents as random mixtures over latent semantic structures called topics. Each topic follows a multinomial distribution over the vocabulary words. In order to understand the result of a topic model, researchers usually select the top-n (essential words) words with the highest probability given a topic and look for meaningful and interpretable semantic themes. This work proposes a new method for exploring topics in LDA models, using Statistically Validated Networks (SVNs). The main idea of the proposed method is to consider co-occurrence between essential words as a measure of association. Two different approaches, called undirected and directed are proposed. Firstly, the symmetrical asso- ciation between two words is taken into account, i.e. how many times two words are found in the same sentence. Conversely, in the directed approach, the order in which the words are in the sentence is also considered. We use hypothesis testing to assess whether the co-occurrence between two words can be attributed to the chance or if these links carry relevant information about the structure of topics. Specifically, textual data is represented as a bipartite network in which one set of nodes is made by sentences, and the other set of nodes is made by a list of essential words associated with a given topic. A link between a word and a sentence is set if the word belongs to that sentence. Therefore, the projection of the bipartite network on the set of words results in a word-co-occurrence network. Note that the directed approach produces a directed network while the undirected one an undirected network. Indeed, a directed link from one word to another may be val- idated, but not the other way around. The two methods are applied to a real dataset, highlighting the differences.
Alessandro Albano, Mariangela Sciandra, Antonella Plaia (2022). Exploring topics in LDA models through Statistically Validated Networks: directed and undirected approaches. In Proceedings of the 11th International Conference on Complex Networks and their Applications (pp. 189-191).
Exploring topics in LDA models through Statistically Validated Networks: directed and undirected approaches
Alessandro Albano;Mariangela Sciandra;Antonella Plaia
2022-01-01
Abstract
Probabilistic topic models are machine learning tools for processing and understanding large text document collections. Among the different models in the literature, Latent Dirichlet Allocation (LDA) has turned out to be the benchmark of the topic modelling community. The key idea is to represent text documents as random mixtures over latent semantic structures called topics. Each topic follows a multinomial distribution over the vocabulary words. In order to understand the result of a topic model, researchers usually select the top-n (essential words) words with the highest probability given a topic and look for meaningful and interpretable semantic themes. This work proposes a new method for exploring topics in LDA models, using Statistically Validated Networks (SVNs). The main idea of the proposed method is to consider co-occurrence between essential words as a measure of association. Two different approaches, called undirected and directed are proposed. Firstly, the symmetrical asso- ciation between two words is taken into account, i.e. how many times two words are found in the same sentence. Conversely, in the directed approach, the order in which the words are in the sentence is also considered. We use hypothesis testing to assess whether the co-occurrence between two words can be attributed to the chance or if these links carry relevant information about the structure of topics. Specifically, textual data is represented as a bipartite network in which one set of nodes is made by sentences, and the other set of nodes is made by a list of essential words associated with a given topic. A link between a word and a sentence is set if the word belongs to that sentence. Therefore, the projection of the bipartite network on the set of words results in a word-co-occurrence network. Note that the directed approach produces a directed network while the undirected one an undirected network. Indeed, a directed link from one word to another may be val- idated, but not the other way around. The two methods are applied to a real dataset, highlighting the differences.File | Dimensione | Formato | |
---|---|---|---|
Complex_Networks_2022_Albanoetal.pdf
Solo gestori archvio
Descrizione: Articolo principale
Tipologia:
Post-print
Dimensione
108.88 kB
Formato
Adobe PDF
|
108.88 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
BookOfAbstractsCNA2022_LDA.pdf
accesso aperto
Tipologia:
Versione Editoriale
Dimensione
2.92 MB
Formato
Adobe PDF
|
2.92 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.