The contribution presents the construction of the REVERINO dataset, consisting of 4,533 pairs of regesta and medieval Latin texts, created through a structured pipeline involving manual annotation, segmentation model training, OCR-based text extraction, and post-processing. The aim is to support the automatic summarization of historical documents, particularly 13th-century papal texts. The study uses the dataset to evaluate the performance of language models (GPT-4 and Llama) in generating regesta, comparing direct and translation-based approaches. The results show promising potential but also significant limitations, especially in accurately identifying key elements such as names, dates, and recipients. The project demonstrates that AI can contribute to the summarization of historical sources, but further improvements in both models and data are needed to ensure reliability and accuracy.

Sabbatini, I., Righi, L., Puccetti, G., Esuli, A. (2025). Automatic Extraction of Regesta for Medieval Latin Text Summarization. ERCIM NEWS(141), 31-32.

Automatic Extraction of Regesta for Medieval Latin Text Summarization

Ilaria Sabbatini
Membro del Collaboration Group
;
2025-01-01

Abstract

The contribution presents the construction of the REVERINO dataset, consisting of 4,533 pairs of regesta and medieval Latin texts, created through a structured pipeline involving manual annotation, segmentation model training, OCR-based text extraction, and post-processing. The aim is to support the automatic summarization of historical documents, particularly 13th-century papal texts. The study uses the dataset to evaluate the performance of language models (GPT-4 and Llama) in generating regesta, comparing direct and translation-based approaches. The results show promising potential but also significant limitations, especially in accurately identifying key elements such as names, dates, and recipients. The project demonstrates that AI can contribute to the summarization of historical sources, but further improvements in both models and data are needed to ensure reliability and accuracy.
2025
Settore HIST-04/D - Paleografia
Sabbatini, I., Righi, L., Puccetti, G., Esuli, A. (2025). Automatic Extraction of Regesta for Medieval Latin Text Summarization. ERCIM NEWS(141), 31-32.
File in questo prodotto:
File Dimensione Formato  
ERCIM EN141-web.pdf

accesso aperto

Descrizione: Articolo principale
Tipologia: Versione Editoriale
Dimensione 3.01 MB
Formato Adobe PDF
3.01 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/705552
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? 0
social impact