This paper presents an innovative method to detect Deepfake videos. The proposed model, ResNet Vision Transformer (ResViT), incorporates two complementary components: a Convolutional Neural Network (CNN) founded on the ResNet50 architecture for effective feature extraction and a Vision Transformer (ViT) for categorization. The CNN captures spatial characteristics from video frames, which are then analyzed by the ViT employing attention mechanisms to differentiate between authentic and altered videos. We assessed ResViT using two benchmark datasets, the Deepfake Detection Challenge (DFDC) dataset and the FaceForensics++ dataset, attaining outstanding results. Our model attained an accuracy of 97.1% on the DFDC dataset, illustrating its efficacy in Deepfake detection. Furthermore, ResViT attained accuracies of 86.8%, 75.1%, 75.5%, and 94.9% on the FaceForensics++ subsets (i.e., Face2Face, FaceSwap, NeuralTextures, and DeepFakes), underscoring its robustness and adaptability across various manipulation methods. These findings emphasize the promise of ResViT as a reliable method for Deepfake video detection.

Aria, A., Mirtaheri, S.L., Asghari, S.A., Shahbazian, R., Pugliese, A. (2025). ResViT: A Hybrid Model for Robust Deepfake Video Detection. In Proceedings of the 2025 IEEE International Conference on Cyber Security and Resilience, CSR 2025 (pp. 366-371). Piscataway : Institute of Electrical and Electronics Engineers [10.1109/CSR64739.2025.11130110].

ResViT: A Hybrid Model for Robust Deepfake Video Detection

Shahbazian R.;
2025-01-01

Abstract

This paper presents an innovative method to detect Deepfake videos. The proposed model, ResNet Vision Transformer (ResViT), incorporates two complementary components: a Convolutional Neural Network (CNN) founded on the ResNet50 architecture for effective feature extraction and a Vision Transformer (ViT) for categorization. The CNN captures spatial characteristics from video frames, which are then analyzed by the ViT employing attention mechanisms to differentiate between authentic and altered videos. We assessed ResViT using two benchmark datasets, the Deepfake Detection Challenge (DFDC) dataset and the FaceForensics++ dataset, attaining outstanding results. Our model attained an accuracy of 97.1% on the DFDC dataset, illustrating its efficacy in Deepfake detection. Furthermore, ResViT attained accuracies of 86.8%, 75.1%, 75.5%, and 94.9% on the FaceForensics++ subsets (i.e., Face2Face, FaceSwap, NeuralTextures, and DeepFakes), underscoring its robustness and adaptability across various manipulation methods. These findings emphasize the promise of ResViT as a reliable method for Deepfake video detection.
2025
Aria, A., Mirtaheri, S.L., Asghari, S.A., Shahbazian, R., Pugliese, A. (2025). ResViT: A Hybrid Model for Robust Deepfake Video Detection. In Proceedings of the 2025 IEEE International Conference on Cyber Security and Resilience, CSR 2025 (pp. 366-371). Piscataway : Institute of Electrical and Electronics Engineers [10.1109/CSR64739.2025.11130110].
File in questo prodotto:
File Dimensione Formato  
ResViT_A_Hybrid_Model_for_Robust_Deepfake_Video_Detection.pdf

accesso aperto

Tipologia: Versione Editoriale
Dimensione 3.63 MB
Formato Adobe PDF
3.63 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/696385
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact