This paper presents an innovative method to detect Deepfake videos. The proposed model, ResNet Vision Transformer (ResViT), incorporates two complementary components: a Convolutional Neural Network (CNN) founded on the ResNet50 architecture for effective feature extraction and a Vision Transformer (ViT) for categorization. The CNN captures spatial characteristics from video frames, which are then analyzed by the ViT employing attention mechanisms to differentiate between authentic and altered videos. We assessed ResViT using two benchmark datasets, the Deepfake Detection Challenge (DFDC) dataset and the FaceForensics++ dataset, attaining outstanding results. Our model attained an accuracy of 97.1% on the DFDC dataset, illustrating its efficacy in Deepfake detection. Furthermore, ResViT attained accuracies of 86.8%, 75.1%, 75.5%, and 94.9% on the FaceForensics++ subsets (i.e., Face2Face, FaceSwap, NeuralTextures, and DeepFakes), underscoring its robustness and adaptability across various manipulation methods. These findings emphasize the promise of ResViT as a reliable method for Deepfake video detection.
Aria, A., Mirtaheri, S.L., Asghari, S.A., Shahbazian, R., Pugliese, A. (2025). ResViT: A Hybrid Model for Robust Deepfake Video Detection. In Proceedings of the 2025 IEEE International Conference on Cyber Security and Resilience, CSR 2025 (pp. 366-371). Piscataway : Institute of Electrical and Electronics Engineers [10.1109/CSR64739.2025.11130110].
ResViT: A Hybrid Model for Robust Deepfake Video Detection
Shahbazian R.;
2025-01-01
Abstract
This paper presents an innovative method to detect Deepfake videos. The proposed model, ResNet Vision Transformer (ResViT), incorporates two complementary components: a Convolutional Neural Network (CNN) founded on the ResNet50 architecture for effective feature extraction and a Vision Transformer (ViT) for categorization. The CNN captures spatial characteristics from video frames, which are then analyzed by the ViT employing attention mechanisms to differentiate between authentic and altered videos. We assessed ResViT using two benchmark datasets, the Deepfake Detection Challenge (DFDC) dataset and the FaceForensics++ dataset, attaining outstanding results. Our model attained an accuracy of 97.1% on the DFDC dataset, illustrating its efficacy in Deepfake detection. Furthermore, ResViT attained accuracies of 86.8%, 75.1%, 75.5%, and 94.9% on the FaceForensics++ subsets (i.e., Face2Face, FaceSwap, NeuralTextures, and DeepFakes), underscoring its robustness and adaptability across various manipulation methods. These findings emphasize the promise of ResViT as a reliable method for Deepfake video detection.| File | Dimensione | Formato | |
|---|---|---|---|
|
ResViT_A_Hybrid_Model_for_Robust_Deepfake_Video_Detection.pdf
accesso aperto
Tipologia:
Versione Editoriale
Dimensione
3.63 MB
Formato
Adobe PDF
|
3.63 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


