Textual analysis has gained significant interest in medical research, particularly for automated patient diagnosis based on clinical narratives. While traditional approaches often focus on associational methods, this paper explores the application of causal forests to analyze textual data from electronic health records (EHRs), aiming to identify causal relationships between specific words and the likelihood of receiving certain medical diagnoses. Utilizing the MIMIC-III dataset, we assess how linguistic factors influence diagnosis probabilities for three conditions: diabetes, hypothyroidism, and adrenal gland disorders. Our findings reveal significant causal links between certain clinical terms and diagnosis probabilities, emphasizing the potential of causal inference techniques to improve the analysis of language in clinical narratives. Additionally, we uncover heterogeneity in treatment effects, demonstrating that specific words can identify high-risk patient subgroups. This study highlights the importance of integrating causal inference in natural language processing within healthcare settings.

Albano, A., Di Maria, C., Sciandra, M., Plaia, A. (2025). Causal Forests for Discovering Diagnostic Language in Electronic Health Records. APPLIED STOCHASTIC MODELS IN BUSINESS AND INDUSTRY, 41 [10.1002/asmb.70038].

Causal Forests for Discovering Diagnostic Language in Electronic Health Records

Alessandro Albano;Chiara Di Maria
;
Mariangela Sciandra;Antonella Plaia
2025-01-01

Abstract

Textual analysis has gained significant interest in medical research, particularly for automated patient diagnosis based on clinical narratives. While traditional approaches often focus on associational methods, this paper explores the application of causal forests to analyze textual data from electronic health records (EHRs), aiming to identify causal relationships between specific words and the likelihood of receiving certain medical diagnoses. Utilizing the MIMIC-III dataset, we assess how linguistic factors influence diagnosis probabilities for three conditions: diabetes, hypothyroidism, and adrenal gland disorders. Our findings reveal significant causal links between certain clinical terms and diagnosis probabilities, emphasizing the potential of causal inference techniques to improve the analysis of language in clinical narratives. Additionally, we uncover heterogeneity in treatment effects, demonstrating that specific words can identify high-risk patient subgroups. This study highlights the importance of integrating causal inference in natural language processing within healthcare settings.
2025
Albano, A., Di Maria, C., Sciandra, M., Plaia, A. (2025). Causal Forests for Discovering Diagnostic Language in Electronic Health Records. APPLIED STOCHASTIC MODELS IN BUSINESS AND INDUSTRY, 41 [10.1002/asmb.70038].
File in questo prodotto:
File Dimensione Formato  
Appl Stoch Models Bus Ind - 2025 - Albano - Causal Forests for Discovering Diagnostic Language in Electronic Health.pdf

accesso aperto

Descrizione: Manuscript
Tipologia: Versione Editoriale
Dimensione 976.49 kB
Formato Adobe PDF
976.49 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/688363
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact