Human communication has long relied on visual media for interaction, and is facilitated by electronic devices that access visual data. Traditionally, this exchange was unidirectional, constrained to text-based queries. However, advancements in human–computer interaction have introduced technologies like reverse image search and large language models (LLMs), enabling both textual and visual queries. These innovations are particularly valuable in Cultural Heritage applications, such as connecting tourists with point-of-interest recognition systems during city visits. This paper investigates the use of various Vision Language Models (VLMs) for Cultural Heritage visual question aswering, including Bing’s search engine with GPT-4 and open models such as Qwen2-VL and Pixtral. Twenty Italian landmarks were selected for the study, including the Colosseum, Milan Cathedral, and Michelangelo’s David. For each landmark, two images were chosen: one from Wikipedia and another from a scientific database or private collection. These images were input into each VLM with textual queries regarding their content. We studied the quality of the responses in terms of their completeness, assessing the impact of various levels of detail in the queries. Additionally, we explored the effect of language (English vs. Italian) on the models’ ability to provide accurate answers. Our findings indicate that larger models, such as Qwen2-VL and Bing+ChatGPT-4, which are trained on multilingual datasets, perform better in both English and Italian. Iconic landmarks like the Colosseum and Florence’s Duomo are easily recognized, and providing context (e.g., the city) improves identification accuracy. Surprisingly, the Wikimedia dataset did not perform as expected, with varying results across models. Open models like Qwen2-VL, which can run on consumer workstations, showed performance similar to larger models. While the algorithms demonstrated strong results, they also generated occasional hallucinated responses, highlighting the need for ongoing refinement of AI systems for Cultural Heritage applications

Chiara Vitaloni, D.S. (2025). A Comparative Study of Vision Language Models for Italian Cultural Heritage. HERITAGE, 8(3) [10.3390/heritage8030095].

A Comparative Study of Vision Language Models for Italian Cultural Heritage

Chiara Vitaloni
Primo
;
2025-03-02

Abstract

Human communication has long relied on visual media for interaction, and is facilitated by electronic devices that access visual data. Traditionally, this exchange was unidirectional, constrained to text-based queries. However, advancements in human–computer interaction have introduced technologies like reverse image search and large language models (LLMs), enabling both textual and visual queries. These innovations are particularly valuable in Cultural Heritage applications, such as connecting tourists with point-of-interest recognition systems during city visits. This paper investigates the use of various Vision Language Models (VLMs) for Cultural Heritage visual question aswering, including Bing’s search engine with GPT-4 and open models such as Qwen2-VL and Pixtral. Twenty Italian landmarks were selected for the study, including the Colosseum, Milan Cathedral, and Michelangelo’s David. For each landmark, two images were chosen: one from Wikipedia and another from a scientific database or private collection. These images were input into each VLM with textual queries regarding their content. We studied the quality of the responses in terms of their completeness, assessing the impact of various levels of detail in the queries. Additionally, we explored the effect of language (English vs. Italian) on the models’ ability to provide accurate answers. Our findings indicate that larger models, such as Qwen2-VL and Bing+ChatGPT-4, which are trained on multilingual datasets, perform better in both English and Italian. Iconic landmarks like the Colosseum and Florence’s Duomo are easily recognized, and providing context (e.g., the city) improves identification accuracy. Surprisingly, the Wikimedia dataset did not perform as expected, with varying results across models. Open models like Qwen2-VL, which can run on consumer workstations, showed performance similar to larger models. While the algorithms demonstrated strong results, they also generated occasional hallucinated responses, highlighting the need for ongoing refinement of AI systems for Cultural Heritage applications
2-mar-2025
Chiara Vitaloni, D.S. (2025). A Comparative Study of Vision Language Models for Italian Cultural Heritage. HERITAGE, 8(3) [10.3390/heritage8030095].
File in questo prodotto:
File Dimensione Formato  
Vitaloni et ali_ a comparative study of vlm_heritage-08-00095.pdf

accesso aperto

Descrizione: Articolo principale
Tipologia: Versione Editoriale
Dimensione 5.03 MB
Formato Adobe PDF
5.03 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/674143
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact