Old Church Slavonic (OCS) is an ancient language, and it has unique challenges and hurdles in natural language processing. Currently, there is a lack of Python libraries devised for the analysis of OCS texts. This research is not just filling the crucial gap in the computational treatment of OCS language but also producing valuable resources for scholars in historical linguistics, cultural studies, and humanities for the development of further research in the field of ancient language processing. The main contribution of this research work is the development of an algorithm for the lemmatization of OCS texts based on a learned dictionary. The approach can deal with ancient languages without the need for prior linguistic knowledge. Preparing a dataset of more than 330K words of OCS and their corresponding lemmas, this approach integrates the algorithm and dictionary efficiently to achieve accurate lemmatization on test data.
Nawaz, U., Lo Presti, L., Napolitano, M., La Cascia, M. (2024). Automatic Lemmatization of Old Church Slavonic Language Using A Novel Dictionary-Based Approach. In G. Sfikas, G. Retsinas (a cura di), Document Analysis Systems 16th IAPR International Workshop, DAS 2024, Athens, Greece, August 30–31, 2024, Proceedings (pp. 408-421) [10.1007/978-3-031-70442-0_25].
Automatic Lemmatization of Old Church Slavonic Language Using A Novel Dictionary-Based Approach
Nawaz, Usman
;Lo Presti, Liliana;La Cascia, Marco
2024-09-11
Abstract
Old Church Slavonic (OCS) is an ancient language, and it has unique challenges and hurdles in natural language processing. Currently, there is a lack of Python libraries devised for the analysis of OCS texts. This research is not just filling the crucial gap in the computational treatment of OCS language but also producing valuable resources for scholars in historical linguistics, cultural studies, and humanities for the development of further research in the field of ancient language processing. The main contribution of this research work is the development of an algorithm for the lemmatization of OCS texts based on a learned dictionary. The approach can deal with ancient languages without the need for prior linguistic knowledge. Preparing a dataset of more than 330K words of OCS and their corresponding lemmas, this approach integrates the algorithm and dictionary efficiently to achieve accurate lemmatization on test data.File | Dimensione | Formato | |
---|---|---|---|
978-3-031-70442-0_25.pdf
Solo gestori archvio
Descrizione: Articolo
Tipologia:
Versione Editoriale
Dimensione
1.57 MB
Formato
Adobe PDF
|
1.57 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.