Archivio istituzionale della ricerca dell'Università degli Studi di Palermo

Sentence similarity search is a fundamental task in information retrieval, enabling applications such as search engines, question answering, and textual analysis. However, retrieval systems often struggle when training data are scarce, as is the case for low-resource languages or specialized domains such as ancient texts. To address this challenge, we propose a novel paradigm for domain-specific sentence similarity search, where the embedding space is shaped by a combination of limited real data and a large amount of synthetic data generated by Large Language Models (LLMs). Specifically, we employ LLMs to generate domain-specific sentence pairs and fine-tune a sentence embedding model, effectively distilling knowledge from the LLM to the retrieval model. We validate our method through a case study on biblical intertextuality in Latin, demonstrating that synthetic data augmentation significantly improves retrieval effectiveness in a domain with scarce annotated resources. More broadly, our approach offers a scalable and adaptable framework for enhancing retrieval in domain-specific contexts. Source code and trained models are available at https://github.com/aimagelab/ biblical-retrieval-synthesis.

Caffagni, D., Cocchi, F., Mambelli, A., Tutrone, F., Zanella, M., Cornia, M., et al. (2026). Generating Synthetic Data with Large Language Models for Low-Resource Sentence Retrieval. In W.T. Balke, K. Golub, Y. Manolopoulos, K. Stefanidis, Z. Zhang (a cura di), Linking Theory and Practice of Digital Libraries : 29th International Conference on Theory and Practice of Digital Libraries, TPDL 2025 Tampere, Finland, September 23–26, 2025 Proceedings (pp. 36-52). Cham : Springer.

Generating Synthetic Data with Large Language Models for Low-Resource Sentence Retrieval

Caffagni Davide;Cocchi Federico;Mambelli Anna;Tutrone Fabio;Zanella Marco;Cornia Marcella;Cucchiara Rita

2026-01-01

Abstract

Sentence similarity search is a fundamental task in information retrieval, enabling applications such as search engines, question answering, and textual analysis. However, retrieval systems often struggle when training data are scarce, as is the case for low-resource languages or specialized domains such as ancient texts. To address this challenge, we propose a novel paradigm for domain-specific sentence similarity search, where the embedding space is shaped by a combination of limited real data and a large amount of synthetic data generated by Large Language Models (LLMs). Specifically, we employ LLMs to generate domain-specific sentence pairs and fine-tune a sentence embedding model, effectively distilling knowledge from the LLM to the retrieval model. We validate our method through a case study on biblical intertextuality in Latin, demonstrating that synthetic data augmentation significantly improves retrieval effectiveness in a domain with scarce annotated resources. More broadly, our approach offers a scalable and adaptable framework for enhancing retrieval in domain-specific contexts. Source code and trained models are available at https://github.com/aimagelab/ biblical-retrieval-synthesis.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data
	
				2026
			
	Settore scientifico disciplinare del contributo
	
				Settore FICP-01/A - Filologia greca e latina
Settore LATI-01/A - Lingua e letteratura latina
Settore FICP-01/B - Letteratura cristiana antica
			
	Citazione
	
				Caffagni, D., Cocchi, F., Mambelli, A., Tutrone, F., Zanella, M., Cornia, M., et al. (2026). Generating Synthetic Data with Large Language Models for Low-Resource Sentence Retrieval. In W.T. Balke, K. Golub, Y. Manolopoulos, K. Stefanidis, Z. Zhang (a cura di), Linking Theory and Practice of Digital Libraries : 29th International Conference on Theory and Practice of Digital Libraries, TPDL 2025 Tampere, Finland, September 23–26, 2025 Proceedings (pp. 36-52). Cham : Springer.
			
	Appare nelle tipologie:
	
				2.01 Capitolo o Saggio

File in questo prodotto:

File	Dimensione	Formato
Springer_Digital Libraries-FINAL.pdf Solo gestori archvio Descrizione: Articolo principale completo di frontespizio e indice del volume Tipologia: Versione Editoriale Dimensione 2.08 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	2.08 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/691603

Citazioni

ND

0

ND

social impact