Definition and Automatic Collection of a New Medieval Corpus for Text Summarisation in Latin

Sabbatini, I.

doi:10.14220/9783737018944.121

This paper outlines the methodological approach developed by the REVER project to digitally reproduce the regestation process starting from documentary texts. The first phase concerned the definition of the corpus, a complex task due to the large number of papal documents and the fragmentation of the editions: some collections include both the regesta and the extended texts, while others contain only the regesta with references to hundreds of volumes. All editions, produced between the nineteenth and twentieth centuries, must be converted from digitised images into fully machine-readable text. This requires a recognition step using OCR/HTR, made difficult by complex layouts, ancient characters, abbreviations, and printing noise. After surveying the main collections, the project selected a group of volumes by Auvray, Berger, Haluscynskyj, the MGH, and part of Potthast’s regesta, prioritising editions that include both regesta and extended texts. The goal is to build a dataset of approximately 23,000 regestum/extended text pairs for training machine-learning tools. The paper also presents a comparative analysis of the most widely used OCR/HTR systems (Transkribus, Escriptorium, OCR4all, Rescribe, Treventus), assessing their features, limitations, recognition models, output quality, and technical requirements. The defined workflow makes it possible to standardise text preparation, facilitate annotation, enable automatic summarisation, and support the testing of the REVER model, providing replicable guidelines for the digitisation of other historical corpora. ⸻ Abstract This paper outlines the methodological approach developed by the REVER project to digitally reproduce the regestation process starting from documentary texts. The first phase concerned the definition of the corpus, a complex task due to the large number of papal documents and the fragmentation of the editions: some collections include both the regesta and the extended texts, while others contain only the regesta with references to hundreds of volumes. All editions, produced between the nineteenth and twentieth centuries, must be converted from digitised images into fully machine-readable text. This requires a recognition step using OCR/HTR, made difficult by complex layouts, ancient characters, abbreviations, and printing noise. After surveying the main collections, the project selected a group of volumes by Auvray, Berger, Haluscynskyj, the MGH, and part of Potthast’s regesta, prioritising editions that include both regesta and extended texts. The goal is to build a dataset of approximately 23,000 regestum/extended text pairs for training machine-learning tools. The paper also presents a comparative analysis of the most widely used OCR/HTR systems (Transkribus, Escriptorium, OCR4all, Rescribe, Treventus), assessing their features, limitations, recognition models, output quality, and technical requirements. The defined workflow makes it possible to standardise text preparation, facilitate annotation, enable automatic summarisation, and support the testing of the REVER model, providing replicable guidelines for the digitisation of other historical corpora.

Sabbatini, I. (2025). Definition and Automatic Collection of a New Medieval Corpus for Text Summarisation in Latin. In A. Melloni, F. Cadeddu (a cura di), Definition and Automatic Collection of a New Medieval Corpus for Text Summarisation in Latin (pp. 121-148). Gottingen : Vandenhoeck&Ruprecht verlage [10.14220/9783737018944.121].