Archivio istituzionale della ricerca dell'Università degli Studi di Palermo

Probabilistic topic models are machine learning tools for processing and understanding large text document collections. Among the different models in the literature, Latent Dirichlet Allocation (LDA) has turned out to be the benchmark of the topic modelling community. The key idea is to represent text documents as random mixtures over latent semantic structures called topics. Each topic follows a multinomial distribution over the vocabulary words. In order to understand the result of a topic model, researchers usually select the top-n (essential words) words with the highest probability given a topic and look for meaningful and interpretable semantic themes. This work proposes a new method for exploring topics in LDA models, using Statistically Validated Networks (SVNs). The main idea of the proposed method is to consider co-occurrence between essential words as a measure of association. Two different approaches, called undirected and directed are proposed. Firstly, the symmetrical asso- ciation between two words is taken into account, i.e. how many times two words are found in the same sentence. Conversely, in the directed approach, the order in which the words are in the sentence is also considered. We use hypothesis testing to assess whether the co-occurrence between two words can be attributed to the chance or if these links carry relevant information about the structure of topics. Specifically, textual data is represented as a bipartite network in which one set of nodes is made by sentences, and the other set of nodes is made by a list of essential words associated with a given topic. A link between a word and a sentence is set if the word belongs to that sentence. Therefore, the projection of the bipartite network on the set of words results in a word-co-occurrence network. Note that the directed approach produces a directed network while the undirected one an undirected network. Indeed, a directed link from one word to another may be val- idated, but not the other way around. The two methods are applied to a real dataset, highlighting the differences.

Alessandro Albano, Mariangela Sciandra, Antonella Plaia (2022). Exploring topics in LDA models through Statistically Validated Networks: directed and undirected approaches. In Proceedings of the 11th International Conference on Complex Networks and their Applications (pp. 189-191).

Exploring topics in LDA models through Statistically Validated Networks: directed and undirected approaches

Alessandro Albano;Mariangela Sciandra;Antonella Plaia

2022-01-01

Abstract

Probabilistic topic models are machine learning tools for processing and understanding large text document collections. Among the different models in the literature, Latent Dirichlet Allocation (LDA) has turned out to be the benchmark of the topic modelling community. The key idea is to represent text documents as random mixtures over latent semantic structures called topics. Each topic follows a multinomial distribution over the vocabulary words. In order to understand the result of a topic model, researchers usually select the top-n (essential words) words with the highest probability given a topic and look for meaningful and interpretable semantic themes. This work proposes a new method for exploring topics in LDA models, using Statistically Validated Networks (SVNs). The main idea of the proposed method is to consider co-occurrence between essential words as a measure of association. Two different approaches, called undirected and directed are proposed. Firstly, the symmetrical asso- ciation between two words is taken into account, i.e. how many times two words are found in the same sentence. Conversely, in the directed approach, the order in which the words are in the sentence is also considered. We use hypothesis testing to assess whether the co-occurrence between two words can be attributed to the chance or if these links carry relevant information about the structure of topics. Specifically, textual data is represented as a bipartite network in which one set of nodes is made by sentences, and the other set of nodes is made by a list of essential words associated with a given topic. A link between a word and a sentence is set if the word belongs to that sentence. Therefore, the projection of the bipartite network on the set of words results in a word-co-occurrence network. Note that the directed approach produces a directed network while the undirected one an undirected network. Indeed, a directed link from one word to another may be val- idated, but not the other way around. The two methods are applied to a real dataset, highlighting the differences.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data
	
				2022
			
	ISBN della monografia 
DATO PREVISTO SU LOGINMIUR
	
				978-2-9557050-6-3
			
	URL dell'editore (Open access ove possibile)
	
				https://2022.complexnetworks.org/
			
	Citazione
	
				Alessandro Albano,  Mariangela Sciandra,  Antonella Plaia (2022). Exploring topics in LDA models through Statistically Validated Networks: directed and undirected approaches. In Proceedings of the 11th International Conference on Complex Networks and their Applications (pp. 189-191).
			
	Appare nelle tipologie:
	
				2.07 Contributo in atti di convegno pubblicato in volume

File in questo prodotto:

File	Dimensione	Formato
Complex_Networks_2022_Albanoetal.pdf Solo gestori archvio Descrizione: Articolo principale Tipologia: Post-print Dimensione 108.88 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	108.88 kB	Adobe PDF	Visualizza/Apri Richiedi una copia
BookOfAbstractsCNA2022_LDA.pdf accesso aperto Tipologia: Versione Editoriale Dimensione 2.92 MB Formato Adobe PDF Visualizza/Apri	2.92 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/576096

Citazioni

ND

ND

ND

social impact