This paper outlines the methodological approach developed by the REVER project to digitally reproduce the regestation process starting from documentary texts. The first phase concerned the definition of the corpus, a complex task due to the large number of papal documents and the fragmentation of the editions: some collections include both the regesta and the extended texts, while others contain only the regesta with references to hundreds of volumes. All editions, produced between the nineteenth and twentieth centuries, must be converted from digitised images into fully machine-readable text. This requires a recognition step using OCR/HTR, made difficult by complex layouts, ancient characters, abbreviations, and printing noise. After surveying the main collections, the project selected a group of volumes by Auvray, Berger, Haluscynskyj, the MGH, and part of Potthast’s regesta, prioritising editions that include both regesta and extended texts. The goal is to build a dataset of approximately 23,000 regestum/extended text pairs for training machine-learning tools. The paper also presents a comparative analysis of the most widely used OCR/HTR systems (Transkribus, Escriptorium, OCR4all, Rescribe, Treventus), assessing their features, limitations, recognition models, output quality, and technical requirements. The defined workflow makes it possible to standardise text preparation, facilitate annotation, enable automatic summarisation, and support the testing of the REVER model, providing replicable guidelines for the digitisation of other historical corpora. ⸻ Abstract This paper outlines the methodological approach developed by the REVER project to digitally reproduce the regestation process starting from documentary texts. The first phase concerned the definition of the corpus, a complex task due to the large number of papal documents and the fragmentation of the editions: some collections include both the regesta and the extended texts, while others contain only the regesta with references to hundreds of volumes. All editions, produced between the nineteenth and twentieth centuries, must be converted from digitised images into fully machine-readable text. This requires a recognition step using OCR/HTR, made difficult by complex layouts, ancient characters, abbreviations, and printing noise. After surveying the main collections, the project selected a group of volumes by Auvray, Berger, Haluscynskyj, the MGH, and part of Potthast’s regesta, prioritising editions that include both regesta and extended texts. The goal is to build a dataset of approximately 23,000 regestum/extended text pairs for training machine-learning tools. The paper also presents a comparative analysis of the most widely used OCR/HTR systems (Transkribus, Escriptorium, OCR4all, Rescribe, Treventus), assessing their features, limitations, recognition models, output quality, and technical requirements. The defined workflow makes it possible to standardise text preparation, facilitate annotation, enable automatic summarisation, and support the testing of the REVER model, providing replicable guidelines for the digitisation of other historical corpora.

Sabbatini, I. (2025). Definition and Automatic Collection of a New Medieval Corpus for Text Summarisation in Latin. In A. Melloni, F. Cadeddu (a cura di), The Digital Turn in Religious Studies. Research, Services, Infrastructures (pp. 121-148). Vandenhoeck & Ruprecht unipress.

Definition and Automatic Collection of a New Medieval Corpus for Text Summarisation in Latin

Sabbatini, Ilaria
2025-12-01

Abstract

This paper outlines the methodological approach developed by the REVER project to digitally reproduce the regestation process starting from documentary texts. The first phase concerned the definition of the corpus, a complex task due to the large number of papal documents and the fragmentation of the editions: some collections include both the regesta and the extended texts, while others contain only the regesta with references to hundreds of volumes. All editions, produced between the nineteenth and twentieth centuries, must be converted from digitised images into fully machine-readable text. This requires a recognition step using OCR/HTR, made difficult by complex layouts, ancient characters, abbreviations, and printing noise. After surveying the main collections, the project selected a group of volumes by Auvray, Berger, Haluscynskyj, the MGH, and part of Potthast’s regesta, prioritising editions that include both regesta and extended texts. The goal is to build a dataset of approximately 23,000 regestum/extended text pairs for training machine-learning tools. The paper also presents a comparative analysis of the most widely used OCR/HTR systems (Transkribus, Escriptorium, OCR4all, Rescribe, Treventus), assessing their features, limitations, recognition models, output quality, and technical requirements. The defined workflow makes it possible to standardise text preparation, facilitate annotation, enable automatic summarisation, and support the testing of the REVER model, providing replicable guidelines for the digitisation of other historical corpora. ⸻ Abstract This paper outlines the methodological approach developed by the REVER project to digitally reproduce the regestation process starting from documentary texts. The first phase concerned the definition of the corpus, a complex task due to the large number of papal documents and the fragmentation of the editions: some collections include both the regesta and the extended texts, while others contain only the regesta with references to hundreds of volumes. All editions, produced between the nineteenth and twentieth centuries, must be converted from digitised images into fully machine-readable text. This requires a recognition step using OCR/HTR, made difficult by complex layouts, ancient characters, abbreviations, and printing noise. After surveying the main collections, the project selected a group of volumes by Auvray, Berger, Haluscynskyj, the MGH, and part of Potthast’s regesta, prioritising editions that include both regesta and extended texts. The goal is to build a dataset of approximately 23,000 regestum/extended text pairs for training machine-learning tools. The paper also presents a comparative analysis of the most widely used OCR/HTR systems (Transkribus, Escriptorium, OCR4all, Rescribe, Treventus), assessing their features, limitations, recognition models, output quality, and technical requirements. The defined workflow makes it possible to standardise text preparation, facilitate annotation, enable automatic summarisation, and support the testing of the REVER model, providing replicable guidelines for the digitisation of other historical corpora.
dic-2025
Settore HIST-04/D - Paleografia
Sabbatini, I. (2025). Definition and Automatic Collection of a New Medieval Corpus for Text Summarisation in Latin. In A. Melloni, F. Cadeddu (a cura di), The Digital Turn in Religious Studies. Research, Services, Infrastructures (pp. 121-148). Vandenhoeck & Ruprecht unipress.
File in questo prodotto:
File Dimensione Formato  
The Digital Turn in Religious Studies_Puccetti-Sabbatini.pdf

Solo gestori archvio

Descrizione: Articolo principale completo di frontespizio e indice del volume
Tipologia: Versione Editoriale
Dimensione 848.4 kB
Formato Adobe PDF
848.4 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/695184
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact