Textual analysis has gained significant interest in medical research, particularly for automated patient diagnosis based on clinical narratives. While traditional approaches often focus on associational methods, this paper explores the application of causal forests to analyze textual data from electronic health records (EHRs), aiming to identify causal relationships between specific words and the likelihood of receiving certain medical diagnoses. Utilizing the MIMIC-III dataset, we assess how linguistic factors influence diagnosis probabilities for three conditions: diabetes, hypothyroidism, and adrenal gland disorders. Our findings reveal significant causal links between certain clinical terms and diagnosis probabilities, emphasizing the potential of causal inference techniques to improve the analysis of language in clinical narratives. Additionally, we uncover heterogeneity in treatment effects, demonstrating that specific words can identify high-risk patient subgroups. This study highlights the importance of integrating causal inference in natural language processing within healthcare settings.
Albano, A., Di Maria, C., Sciandra, M., Plaia, A. (2025). Causal Forests for Discovering Diagnostic Language in Electronic Health Records. APPLIED STOCHASTIC MODELS IN BUSINESS AND INDUSTRY, 41 [10.1002/asmb.70038].
Causal Forests for Discovering Diagnostic Language in Electronic Health Records
Alessandro Albano;Chiara Di Maria
;Mariangela Sciandra;Antonella Plaia
2025-01-01
Abstract
Textual analysis has gained significant interest in medical research, particularly for automated patient diagnosis based on clinical narratives. While traditional approaches often focus on associational methods, this paper explores the application of causal forests to analyze textual data from electronic health records (EHRs), aiming to identify causal relationships between specific words and the likelihood of receiving certain medical diagnoses. Utilizing the MIMIC-III dataset, we assess how linguistic factors influence diagnosis probabilities for three conditions: diabetes, hypothyroidism, and adrenal gland disorders. Our findings reveal significant causal links between certain clinical terms and diagnosis probabilities, emphasizing the potential of causal inference techniques to improve the analysis of language in clinical narratives. Additionally, we uncover heterogeneity in treatment effects, demonstrating that specific words can identify high-risk patient subgroups. This study highlights the importance of integrating causal inference in natural language processing within healthcare settings.| File | Dimensione | Formato | |
|---|---|---|---|
|
Appl Stoch Models Bus Ind - 2025 - Albano - Causal Forests for Discovering Diagnostic Language in Electronic Health.pdf
accesso aperto
Descrizione: Manuscript
Tipologia:
Versione Editoriale
Dimensione
976.49 kB
Formato
Adobe PDF
|
976.49 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


