The effective integration of real and synthetic clinical data in multiple languages is essential to advance healthcare research. In this study, we propose a statistical framework that leverages cross-lingual embeddings to validate semantic alignment between authentic Italian EHRs and synthetic English clinical notes. Using two state-of-the-art models, E5 and BGE, we encode the texts and employ Fuzzy C-Means clustering along with multidimensional scaling to assess their semantic coherence. Our analysis reveals distinct language-specific patterns alongside robust cross-lingual alignment, highlighting the promise of synthetic data augmentation in mitigating resource scarcity.
Speciale Marco, Albano Alessandro, Sciandra Mariangela, Plaia Antonella (2025). Cross Lingual Embeddings for Clinical Text: A Statistical Framework for Validating Real and Synthetic Electronic Health Records. In Statistics for Innovation IV SIS 2025, Short Papers, Contributed Sessions 3 (pp. 351-356) [10.1007/978-3-031-96033-8_57].
Cross Lingual Embeddings for Clinical Text: A Statistical Framework for Validating Real and Synthetic Electronic Health Records
Speciale Marco
;Albano Alessandro;Sciandra Mariangela;Plaia Antonella
2025-01-01
Abstract
The effective integration of real and synthetic clinical data in multiple languages is essential to advance healthcare research. In this study, we propose a statistical framework that leverages cross-lingual embeddings to validate semantic alignment between authentic Italian EHRs and synthetic English clinical notes. Using two state-of-the-art models, E5 and BGE, we encode the texts and employ Fuzzy C-Means clustering along with multidimensional scaling to assess their semantic coherence. Our analysis reveals distinct language-specific patterns alongside robust cross-lingual alignment, highlighting the promise of synthetic data augmentation in mitigating resource scarcity.| File | Dimensione | Formato | |
|---|---|---|---|
|
Marco_Speciale_iris.pdf
Solo gestori archvio
Tipologia:
Versione Editoriale
Dimensione
1.29 MB
Formato
Adobe PDF
|
1.29 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


