Deep Learning-Based Deepfake Detection in a Nutshell

A Brief Overview of Deep Learning-Based Deepfake Detection

Published in

Towards AI

6 min readOct 22, 2022

Remarking on a climacteric of Artificial Intelligence (AI), Deep Learning (DL) has become one of the most influential fields in computer science that directly impacts human life and society today. Like every other technological innovation in history, deep learning has also been exploited for both superior and inferior deeds. One such application of deep learning, which is notorious for bringing about abominable consequences in public, is Deepfakes. Over the past few years, hundreds of Research have been carried out to invent and optimize various Deepfake detection with AI. Hence, this article also discusses Deepfake detection rather than Deepfake creation.

There are Deep learning approaches, as well as machine learning (Non-Deep Learning/ Non-DL) approaches, have been developed to detect Deepfakes. Deep learning models have a large number of parameters to consider, and therefore it requires a large set of data to train such models. This is the very reason for the higher performance and accurate results of DL methods compared to non-DL approaches.

For the convenience of reference, the article content is sectioned as follows:

What Is Deepfake Detection
Deepfake Detection Pipeline
Data Preprocessing
Feature Extraction
Classification
Summary
References

1. What Is Deepfake Detection

The deepfake creation pipeline is not a perfectly monitored process. Hence, most Deepfake generators leave fingerprints in Deepfakes that are specific to the relevant Deepfake creation architecture or the individual generator. These variations in Deepfake videos can be classified as spatial inconsistencies: incompatibilities that occur within individual frames of the video and temporal inconsistencies: the incompatible features that occur across the sequence of frames of the video [1].

Spatial inconsistencies include facial region incompatibilities with the background of the video frames, resolution variations, and partially rendered organs and skin textures (all the human features of the face might not be rendered correctly). Most common Deepfake generators fail to render characteristics such as eye blinking and teeth. Sometimes, the Deepfake generation uses white strips instead of teeth that are even visible to the naked eye on still frames [Figure 2].

Figure 2: Spatial inconsistencies of Deepfakes. Left: incomplete rendering of hair. Right: use of white strip instead of individual teeth (Illustration created by Author)

Temporal inconsistencies include abnormal eye blinking, head poses, facial movements, and variations in luminance in the frame sequence across the video.

Fortunately, both spatial and temporal fingerprints left by deepfake generators could be identified by deepfake detectors made of Deep Neural Networks (DNNs), i.e., the principle behind the deepfake detection process. Nevertheless, the extensive application of Generative Adversarial Networks (GANs) in Deepfake generators has challenged the balance between deepfake detection and creation.

2. Deepfake Detection Pipeline

Deepfake detectors are binary classification systems that output whether the input digital media is real or fake. Deepfake detection is not carried out by one single black box-like module, but it comprises several other modules and steps which function together to deliver the detection result. The common steps in the Deepfake detection pipeline are as follows [2].

The input of the Deepfake digital media.
The preprocessing includes face detection and augmentation.
The feature extraction of the processed frames.
The Classification/Detection.
Output the authenticity of the image.

Generally, a typical DL-based Deepfake detector comprises 03 major components to perform the above tasks.

A preprocessing module.
Feature extraction module.
Evaluator module (deep learning classifier model).

In the next 03 chapters, the major steps: Data preprocessing, feature extraction, and detection/classification processes are explained in detail.

3. Data Preprocessing

After the data collection phase, data should be preprocessed before being used in the training and testing steps of the Deepfake detection pipeline. The data preprocessing is done automatically using libraries available such as OpenCV python library, Multi-task Cascaded Convolutional Networks (MTCNN), and You Only Look Once (YOLO) algorithm, etc.

Preparation of the data set to train the model also plays a vital role in the performance of a Deepfake detector. Enhancing techniques such as rescaling (stretching), shear mapping, zooming augmentation, rotating, brightness change, and horizontal/vertical flipping in suitable ranges, can be applied in order to increase the generalization of the data set [3].

The first step in data preprocessing involves extracting individual frames from the video clips. After the frames are extracted, the next steps involve the detection of faces from the extracted video frames. Since the abnormalities frequently occur in facial regions, selecting only the facial regions helps the feature extraction model to focus only on the Regions of Interest (ROI), saving the computation cost that was to be used on full-frame scanning. Once the facial regions are detected, they are cropped from the rest of the background of the frame and follow a series of steps to make them be usable in the model training and testing. Another reason to crop the facial regions is to make all the input images to the model at the same size.

4. Feature Extraction

The preprocessed frames are then fed to the feature extractor. Most of the feature extractors are Convolutional Neural Networks (CNN)-based. It is a new trend that some of the recent research has been carried out to prove the improved effectiveness and efficiency in the application of Capsule Networks on the feature extraction process.

The feature extractor extracts the spatial features available on the preprocessed video frames. The feature extract is capable of extracting vision features, local features/ facial landmarks such as the position of the eye, nose, mouth, dynamics of the mouth shape, biological features such as eye blinking, etc. The extracted feature vectors are then sent to the classifier network to output the decision.

5. Classification

The deep learning model used for classification is often called the backbone of the Deepfake detector. As the name suggests, the classification network is responsible for the most serious task of the Deepfake detection pipeline: i.e., classifying and determining the probability of being the input video is Deepfake or not. Most classifiers are binary classifiers where output is (0) for Deepfakes and (1) for pristine frames.

The classifier is again another convolution layer (a CNN) or different deep learning architecture such as Long Short-Term Memory (LSTM) networks and Vision Transformers (ViTs). The actual functionality of the classification model varies from DNN used. As an example, the eye-blinking features extracted in the feature extractor module can be used by the LSTM module in the classification module in order to determine the temporal inconsistencies of the eye-blinking patterns of the frames and, based on that, decide whether the input is a Deepfake or not [3]. In most cases, a fully connected layer can be observed in the Deepfake detector networks. Since the output from convolution layers denote high-level features of the data, those outputs can be flattened and connected to a single output layer to result in the ultimate decision.

6. Summary

There has been a significant competitive development in both Deepfake creation and detection in the past few years. Research related to Deepfake detection with deep learning techniques is heavily influenced due to the accuracy of the results compared to non-DL approaches. Deep Neural Network architectures like CNNs, RNNs, ViTs, and Capsule Networks are widely employed in the implementation of Deepfake detectors. The common Deepfake detection pipeline consists of a data preprocessing module, a CNN-based feature extractor, and the backbone classification module.

Further, there is a significant dependency of Deepfake detection on the fingerprints left by Deepfake generators on the Deepfakes. As the present GAN-based Deepfake generators are capable of synthesizing more realistic Deepfakes with minimum inconsistencies, new DL-based approaches must be developed to optimize Deepfake detection. Deepfake detection approaches based on Deep-Ensemble Learning techniques can be identified as such modern and comprehensive approaches to combat Deepfakes [4]. Nonetheless, the vacancy for an effective and efficient Deepfake detector is still open.

7. References

[1] Ruben Tolosana, Ruben Vera-Rodriguez, Julian Fierrez, Aythami Morales, Javier Ortega-Garcia, DeepFakes and Beyond: A Survey of Face Manipulation and Fake Detection (2020), Information Fusion, 2020.

[2] Nguyen, Thanh & Nguyen, Cuong M. & Nguyen, Tien & Nguyen, Duc & Nahavandi, Saeid. Deep Learning for Deepfakes Creation and Detection: A Survey (2019).

[3] P. K. P. K. Sawinder Kaur, Deepfakes: temporal sequential analysis to detect face-swapped video clips using convolutional long short-term memory (2020), Journal of Electronic Imaging, p. 29(3), 2020.

[4] M. S. Rana and A. H. Sung, DeepfakeStack: A Deep Ensemble-based Learning Technique for Deepfake Detection (2020), 7th IEEE International Conference on Cyber Security and Cloud Computing (CSCloud)/2020 6th IEEE International Conference on Edge Computing and Scalable Cloud (EdgeCom), 2020.