The effective integration of real and synthetic clinical data in multiple languages is essential to advance healthcare research. In this study, we propose a statistical framework that leverages cross-lingual embeddings to validate semantic alignment between authentic Italian EHRs and synthetic English clinical notes. Using two state-of-the-art models, E5 and BGE, we encode the texts and employ Fuzzy C-Means clustering along with multidimensional scaling to assess their semantic coherence. Our analysis reveals distinct language-specific patterns alongside robust cross-lingual alignment, highlighting the promise of synthetic data augmentation in mitigating resource scarcity.

Speciale Marco, Albano Alessandro, Sciandra Mariangela, Plaia Antonella (2025). Cross Lingual Embeddings for Clinical Text: A Statistical Framework for Validating Real and Synthetic Electronic Health Records. In Statistics for Innovation IV SIS 2025, Short Papers, Contributed Sessions 3 (pp. 351-356) [10.1007/978-3-031-96033-8_57].

Cross Lingual Embeddings for Clinical Text: A Statistical Framework for Validating Real and Synthetic Electronic Health Records

Speciale Marco
;
Albano Alessandro;Sciandra Mariangela;Plaia Antonella
2025-01-01

Abstract

The effective integration of real and synthetic clinical data in multiple languages is essential to advance healthcare research. In this study, we propose a statistical framework that leverages cross-lingual embeddings to validate semantic alignment between authentic Italian EHRs and synthetic English clinical notes. Using two state-of-the-art models, E5 and BGE, we encode the texts and employ Fuzzy C-Means clustering along with multidimensional scaling to assess their semantic coherence. Our analysis reveals distinct language-specific patterns alongside robust cross-lingual alignment, highlighting the promise of synthetic data augmentation in mitigating resource scarcity.
2025
Settore STAT-01/A - Statistica
9783031960321
9783031960338
Speciale Marco, Albano Alessandro, Sciandra Mariangela, Plaia Antonella (2025). Cross Lingual Embeddings for Clinical Text: A Statistical Framework for Validating Real and Synthetic Electronic Health Records. In Statistics for Innovation IV SIS 2025, Short Papers, Contributed Sessions 3 (pp. 351-356) [10.1007/978-3-031-96033-8_57].
File in questo prodotto:
File Dimensione Formato  
Marco_Speciale_iris.pdf

Solo gestori archvio

Tipologia: Versione Editoriale
Dimensione 1.29 MB
Formato Adobe PDF
1.29 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/684744
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact