This paper outlines the methodological approach developed by the REVER project to digitally reproduce the regestation process starting from documentary texts. The first phase concerned the definition of the corpus, a complex task due to the large number of papal documents and the fragmentation of the editions: some collections include both the regesta and the extended texts, while others contain only the regesta with references to hundreds of volumes. All editions, produced between the nineteenth and twentieth centuries, must be converted from digitised images into fully machine-readable text. This requires a recognition step using OCR/HTR, made difficult by complex layouts, ancient characters, abbreviations, and printing noise. After surveying the main collections, the project selected a group of volumes by Auvray, Berger, Haluscynskyj, the MGH, and part of Potthast’s regesta, prioritising editions that include both regesta and extended texts. The goal is to build a dataset of approximately 23,000 regestum/extended text pairs for training machine-learning tools. The paper also presents a comparative analysis of the most widely used OCR/HTR systems (Transkribus, Escriptorium, OCR4all, Rescribe, Treventus), assessing their features, limitations, recognition models, output quality, and technical requirements. The defined workflow makes it possible to standardise text preparation, facilitate annotation, enable automatic summarisation, and support the testing of the REVER model, providing replicable guidelines for the digitisation of other historical corpora. ⸻ Abstract This paper outlines the methodological approach developed by the REVER project to digitally reproduce the regestation process starting from documentary texts. The first phase concerned the definition of the corpus, a complex task due to the large number of papal documents and the fragmentation of the editions: some collections include both the regesta and the extended texts, while others contain only the regesta with references to hundreds of volumes. All editions, produced between the nineteenth and twentieth centuries, must be converted from digitised images into fully machine-readable text. This requires a recognition step using OCR/HTR, made difficult by complex layouts, ancient characters, abbreviations, and printing noise. After surveying the main collections, the project selected a group of volumes by Auvray, Berger, Haluscynskyj, the MGH, and part of Potthast’s regesta, prioritising editions that include both regesta and extended texts. The goal is to build a dataset of approximately 23,000 regestum/extended text pairs for training machine-learning tools. The paper also presents a comparative analysis of the most widely used OCR/HTR systems (Transkribus, Escriptorium, OCR4all, Rescribe, Treventus), assessing their features, limitations, recognition models, output quality, and technical requirements. The defined workflow makes it possible to standardise text preparation, facilitate annotation, enable automatic summarisation, and support the testing of the REVER model, providing replicable guidelines for the digitisation of other historical corpora.

Sabbatini, I. (2025). Definition and Automatic Collection of a New Medieval Corpus for Text Summarisation in Latin. In A. Melloni, F. Cadeddu (a cura di), Definition and Automatic Collection of a New Medieval Corpus for Text Summarisation in Latin (pp. 121-148). Gottingen : Vandenhoeck&Ruprecht verlage [10.14220/9783737018944.121].

Definition and Automatic Collection of a New Medieval Corpus for Text Summarisation in Latin

Ilaria Sabbatini
Co-primo
2025-12-01

Abstract

This paper outlines the methodological approach developed by the REVER project to digitally reproduce the regestation process starting from documentary texts. The first phase concerned the definition of the corpus, a complex task due to the large number of papal documents and the fragmentation of the editions: some collections include both the regesta and the extended texts, while others contain only the regesta with references to hundreds of volumes. All editions, produced between the nineteenth and twentieth centuries, must be converted from digitised images into fully machine-readable text. This requires a recognition step using OCR/HTR, made difficult by complex layouts, ancient characters, abbreviations, and printing noise. After surveying the main collections, the project selected a group of volumes by Auvray, Berger, Haluscynskyj, the MGH, and part of Potthast’s regesta, prioritising editions that include both regesta and extended texts. The goal is to build a dataset of approximately 23,000 regestum/extended text pairs for training machine-learning tools. The paper also presents a comparative analysis of the most widely used OCR/HTR systems (Transkribus, Escriptorium, OCR4all, Rescribe, Treventus), assessing their features, limitations, recognition models, output quality, and technical requirements. The defined workflow makes it possible to standardise text preparation, facilitate annotation, enable automatic summarisation, and support the testing of the REVER model, providing replicable guidelines for the digitisation of other historical corpora. ⸻ Abstract This paper outlines the methodological approach developed by the REVER project to digitally reproduce the regestation process starting from documentary texts. The first phase concerned the definition of the corpus, a complex task due to the large number of papal documents and the fragmentation of the editions: some collections include both the regesta and the extended texts, while others contain only the regesta with references to hundreds of volumes. All editions, produced between the nineteenth and twentieth centuries, must be converted from digitised images into fully machine-readable text. This requires a recognition step using OCR/HTR, made difficult by complex layouts, ancient characters, abbreviations, and printing noise. After surveying the main collections, the project selected a group of volumes by Auvray, Berger, Haluscynskyj, the MGH, and part of Potthast’s regesta, prioritising editions that include both regesta and extended texts. The goal is to build a dataset of approximately 23,000 regestum/extended text pairs for training machine-learning tools. The paper also presents a comparative analysis of the most widely used OCR/HTR systems (Transkribus, Escriptorium, OCR4all, Rescribe, Treventus), assessing their features, limitations, recognition models, output quality, and technical requirements. The defined workflow makes it possible to standardise text preparation, facilitate annotation, enable automatic summarisation, and support the testing of the REVER model, providing replicable guidelines for the digitisation of other historical corpora.
dic-2025
Settore HIST-04/A - Storia delle religioni
Settore HIST-04/B - Storia del cristianesimo e delle chiese
Sabbatini, I. (2025). Definition and Automatic Collection of a New Medieval Corpus for Text Summarisation in Latin. In A. Melloni, F. Cadeddu (a cura di), Definition and Automatic Collection of a New Medieval Corpus for Text Summarisation in Latin (pp. 121-148). Gottingen : Vandenhoeck&Ruprecht verlage [10.14220/9783737018944.121].
File in questo prodotto:
File Dimensione Formato  
Regexta Sabbatini_Puccetti Melloni_Cadeddu_10_Puccetti-Sabbatini[11152].pdf

Solo gestori archvio

Descrizione: capitolo completo
Tipologia: Pre-print
Dimensione 2.23 MB
Formato Adobe PDF
2.23 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/695183
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact