Human communication has long relied on visual media for interaction, and is facilitated by electronic devices that access visual data. Traditionally, this exchange was unidirectional, constrained to text-based queries. However, advancements in human–computer interaction have introduced technologies like reverse image search and large language models (LLMs), enabling both textual and visual queries. These innovations are particularly valuable in Cultural Heritage applications, such as connecting tourists with point-of-interest recognition systems during city visits. This paper investigates the use of various Vision Language Models (VLMs) for Cultural Heritage visual question aswering, including Bing’s search engine with GPT-4 and open models such as Qwen2-VL and Pixtral. Twenty Italian landmarks were selected for the study, including the Colosseum, Milan Cathedral, and Michelangelo’s David. For each landmark, two images were chosen: one from Wikipedia and another from a scientific database or private collection. These images were input into each VLM with textual queries regarding their content. We studied the quality of the responses in terms of their completeness, assessing the impact of various levels of detail in the queries. Additionally, we explored the effect of language (English vs. Italian) on the models’ ability to provide accurate answers. Our findings indicate that larger models, such as Qwen2-VL and Bing+ChatGPT-4, which are trained on multilingual datasets, perform better in both English and Italian. Iconic landmarks like the Colosseum and Florence’s Duomo are easily recognized, and providing context (e.g., the city) improves identification accuracy. Surprisingly, the Wikimedia dataset did not perform as expected, with varying results across models. Open models like Qwen2-VL, which can run on consumer workstations, showed performance similar to larger models. While the algorithms demonstrated strong results, they also generated occasional hallucinated responses, highlighting the need for ongoing refinement of AI systems for Cultural Heritage applications
Chiara Vitaloni, D.S. (2025). A Comparative Study of Vision Language Models for Italian Cultural Heritage. HERITAGE, 8(3) [10.3390/heritage8030095].
A Comparative Study of Vision Language Models for Italian Cultural Heritage
Chiara Vitaloni
Primo
;
2025-03-02
Abstract
Human communication has long relied on visual media for interaction, and is facilitated by electronic devices that access visual data. Traditionally, this exchange was unidirectional, constrained to text-based queries. However, advancements in human–computer interaction have introduced technologies like reverse image search and large language models (LLMs), enabling both textual and visual queries. These innovations are particularly valuable in Cultural Heritage applications, such as connecting tourists with point-of-interest recognition systems during city visits. This paper investigates the use of various Vision Language Models (VLMs) for Cultural Heritage visual question aswering, including Bing’s search engine with GPT-4 and open models such as Qwen2-VL and Pixtral. Twenty Italian landmarks were selected for the study, including the Colosseum, Milan Cathedral, and Michelangelo’s David. For each landmark, two images were chosen: one from Wikipedia and another from a scientific database or private collection. These images were input into each VLM with textual queries regarding their content. We studied the quality of the responses in terms of their completeness, assessing the impact of various levels of detail in the queries. Additionally, we explored the effect of language (English vs. Italian) on the models’ ability to provide accurate answers. Our findings indicate that larger models, such as Qwen2-VL and Bing+ChatGPT-4, which are trained on multilingual datasets, perform better in both English and Italian. Iconic landmarks like the Colosseum and Florence’s Duomo are easily recognized, and providing context (e.g., the city) improves identification accuracy. Surprisingly, the Wikimedia dataset did not perform as expected, with varying results across models. Open models like Qwen2-VL, which can run on consumer workstations, showed performance similar to larger models. While the algorithms demonstrated strong results, they also generated occasional hallucinated responses, highlighting the need for ongoing refinement of AI systems for Cultural Heritage applicationsFile | Dimensione | Formato | |
---|---|---|---|
Vitaloni et ali_ a comparative study of vlm_heritage-08-00095.pdf
accesso aperto
Descrizione: Articolo principale
Tipologia:
Versione Editoriale
Dimensione
5.03 MB
Formato
Adobe PDF
|
5.03 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.