This paper illustrates the development of Minerva Diagnostic Retriever (DR-Minerva), a Visual Language Model specialized in the medical domain. Prompted using a textual input with the patient’s information along with a CT or MR scan, the model provides information about the body part and the scanning modality of the given image. The model relies on the Flamingo architecture, which is well known for its good in-context and few-shot learning capabilities, and it encodes textual data using Minerva, a novel Large Language Model trained on English and Italian data. Model performances are improved via fine-tuning the aforementioned model, and using external knowledge by means of a Retrieval Augmented Generation approach. At inference time, the model is injected with the retrieved examples in form of in-context learning. The authors developed a rearranged version of the MedPix® multi-modal medical dataset, that was used for both the development and the test of the model as long as for retrieval. A detailed description of the system is reported along with the experimental results that are discussed in thoroughly. Dataset and models used are available on GitHub (https://github.com/CHILab1/MedPix-2.0.).

Siragusa, I., Contino, S., Pirrone, R. (2025). DR-Minerva: A Multimodal Language Model Based on Minerva for Diagnostic Information Retrieval. In AIxIA 2024 – Advances in Artificial Intelligence (pp. 288-300). Springer [10.1007/978-3-031-80607-0_22].

DR-Minerva: A Multimodal Language Model Based on Minerva for Diagnostic Information Retrieval

Siragusa, Irene
Primo
Methodology
;
Contino, Salvatore
Secondo
Data Curation
;
Pirrone, Roberto
Ultimo
Project Administration
2025-01-01

Abstract

This paper illustrates the development of Minerva Diagnostic Retriever (DR-Minerva), a Visual Language Model specialized in the medical domain. Prompted using a textual input with the patient’s information along with a CT or MR scan, the model provides information about the body part and the scanning modality of the given image. The model relies on the Flamingo architecture, which is well known for its good in-context and few-shot learning capabilities, and it encodes textual data using Minerva, a novel Large Language Model trained on English and Italian data. Model performances are improved via fine-tuning the aforementioned model, and using external knowledge by means of a Retrieval Augmented Generation approach. At inference time, the model is injected with the retrieved examples in form of in-context learning. The authors developed a rearranged version of the MedPix® multi-modal medical dataset, that was used for both the development and the test of the model as long as for retrieval. A detailed description of the system is reported along with the experimental results that are discussed in thoroughly. Dataset and models used are available on GitHub (https://github.com/CHILab1/MedPix-2.0.).
1-gen-2025
9783031806063
9783031806070
Siragusa, I., Contino, S., Pirrone, R. (2025). DR-Minerva: A Multimodal Language Model Based on Minerva for Diagnostic Information Retrieval. In AIxIA 2024 – Advances in Artificial Intelligence (pp. 288-300). Springer [10.1007/978-3-031-80607-0_22].
File in questo prodotto:
File Dimensione Formato  
978-3-031-80607-0_22.pdf

Solo gestori archvio

Tipologia: Versione Editoriale
Dimensione 1.23 MB
Formato Adobe PDF
1.23 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/668610
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact