Probabilistic topic models, such as LDA, are standard text analysis algorithms that provide predictive and latent topic representation for a corpus. However, due to the unsupervised training process, it is difficult to verify the assumption that the latent space discovered by these models is generally meaningful and valuable. This paper introduces a two-stage LDA algorithm to estimate latent topics in text documents and use readability scores to link the identified topics to a linguistically motivated latent structure. We define a new interpretative tool called induced topic readability, which is used to rank topics from the one with the most complex linguistic structure to the one with the lowest semantic content readily. The usefulness of our method is shown with an application to real data, using articles from the New York Times.

Mariangela Sciandra, Alessandro Albano (2022). A two-stage LDA algorithm for ranking induced topic readability. In JADT 2022 proceedings book.

A two-stage LDA algorithm for ranking induced topic readability

Mariangela Sciandra
;
Alessandro Albano
2022-07-01

Abstract

Probabilistic topic models, such as LDA, are standard text analysis algorithms that provide predictive and latent topic representation for a corpus. However, due to the unsupervised training process, it is difficult to verify the assumption that the latent space discovered by these models is generally meaningful and valuable. This paper introduces a two-stage LDA algorithm to estimate latent topics in text documents and use readability scores to link the identified topics to a linguistically motivated latent structure. We define a new interpretative tool called induced topic readability, which is used to rank topics from the one with the most complex linguistic structure to the one with the lowest semantic content readily. The usefulness of our method is shown with an application to real data, using articles from the New York Times.
lug-2022
979-12-80153-31-9
Mariangela Sciandra, Alessandro Albano (2022). A two-stage LDA algorithm for ranking induced topic readability. In JADT 2022 proceedings book.
File in questo prodotto:
File Dimensione Formato  
A two-stage LDA algorithm for ranking induced topic readability.pdf

Solo gestori archvio

Descrizione: Contributo completo
Tipologia: Post-print
Dimensione 270.25 kB
Formato Adobe PDF
270.25 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/564283
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact