The rapid growth of Deep Learning (DL) based applications is taking place in this modern world. Deep Learning is used to solve so many critical problems such as big data analysis, computer vision, and human brain interfacing. The advancement of deep learning can also causes some national and some international threats to privacy, democracy, and national security. Deepfake videos are growing so fast having an impact on political, social, and personal life. Deepfake videos use artificial intelligence and can appear very convincing, even to a trained eye. Often obscene videos are made using deepfakes which tarnishes peoples reputation. Deepfakes are a general public concern, thus its important to develop methods to detect them. This survey paper includes a survey of deepfake creation algorithms and, more crucially we added some approaches of deepfake detection that proposed by researchers to date. Here we go over the problems, trends in the field, and future directions for deepfake technology in detail. This paper gives a complete overview of deepfake approaches and supports the implementation of novel and more reliable methods to cope with the highly complicated deepfakes by studying the background of deepfakes and state-of-the-art deepfake detection methods.
Deepfake videos are manipulated videos that use Mac-hine Learning based algorithms to swap a person in an existing image or video with someone or something else (Adee, 2020). Deep fake videos are divided into three major categories: head puppetry, face swapping, and lip-syncing. Head puppetry entails altering a video of a specific humans head and upper shoulder with the help of a source video persons head so that the modi-fied individual looks exactly like the source. Face swap- ping is the process of transferring a persons face to another persons face while maintaining the same facial expression. Only the lip region of a video is altered in lip-syncing, so the target individual says something that is not accurate in reality. Although some deep-fakes can be made using classic visualizations or com-puter graphics, deep-learning methods such as auto en-coders (Tewari, 2018) and generative-adversarial net-works (GAN) (Lin, 2021) which have been extensively used in the computer vision area, are the most recent popular process for deepfake video creation (Liu, 2021). These models are used to analyze a persons facial emotions and movement and synthesis facial images of someone with similar expressions and movements (Lyu, 2018). To train a model to generate photo-realistic pictures and videos deepfake technologies often requires a huge volume of image and video data-sets. Poli-ticians and celebrities are the first targets of deepfakes since they have a massive amount of videos and photo-graphs available on the internet. Deepfakes were uti-lized to create pornographic photographs and movies to replace the heads of politicians and celebrities bodies. In 2017, the first deepfake video was released, in which a celebritys face was replaced with a porn actor. Nowadays deepfake videos are becoming global secu-rity threat because now its used to make fake speech videos of international leaders (Hwang, 2020).
Deepfakes can thus be used to incite political or reli-gious tensions between countries, deceive the public and influence election results, or create havoc in finan-cial markets by spreading false information (Zhou, 2020; Guo, 2020). It can also be used to create fic-tional satellite photos of the Globe that contain objects which do not exist in the real world to deceive military ana-lysts, such as making a fictional bridge across a river which is not actually present. This can mislead a troop when crossing a bridge during a combat (Fish, 2019).
Because the democratization of creating effective vir-tual humans has beneficial consequences, so deepfakes can also being used in positive ways, such as in visual effects, digital avatars, snapchat filters, creating voices for those who have lost their voice, and updating epi-sodes of movies without reshooting them . The number of illegal implementations of deepfakes, on the other hand, far outnumbers the beneficial ones. Because of the advancement of deep neural networks and the ac-cessibility of enormous amounts of data, faked photos and movies are nearly unrecognizable to humans and even powerful computer algorithms. The procedure of making those modified photographs and films is also more easier nowadays, as it only requires a target persons identifying photo or a short videos. Now-adays, producing astonishingly realistic tempered video requires less and less efforts. Recent advan-cement of technology can generate a deepfake video with the help of a single picture (Zakharov, 2019).
As a result, deepfakes may pose a danger not only to public figures but also every individual. For example, a voice deepfake was used to scam a CEO out of $243, 000 (Damiani, 2019). Recently an application Deep Nude was released, that can turn a person into a non-consensual pornograp video, which even more troub-ling (Samuel, 2019). Similarly, the Chinese application Zao has recently created a buzz, it can allow even the most non - technical users to switch their faces onto the bodies of a famous movie stars and in-ject them-selves into well-known films and TV clips (Guardian, 2019). These types of falsification pose a serious dan-ger to privacy and identification, and they have an impact on many parts of peoples lives.
As a result, discovering the reality in the digital world has become particularly crucial. Its considerably more difficult when handling with deepfake videos, because theyre frequently utilized for harmful reasons, and virtually anyone can now construct deepfakes using current deepfake video creation tools. Several app-roaches for detecting deepfake videos have been pro-posed so far (Lyu, 2020) (Jafar, 2020). Because the majority of the deepfake creation and detection meth-od are deep learning based, a conflict has erupted bet-ween malevolent and beneficial uses of deep learning methods. To combat the problem of deepfakes or face-swapping technologies, the US Defense Advanced Research Projects Agency (DARPA) launched a mul-timedia forensics research program (called Media For-ensics or MediFor) to speed the invention of fake digital visual media detection methods (Turek, 2020). Facebook Inc., in collaboration with Microsoft Corp. and the Partnership on AI coalition, has established the Deepfake Detection Challenge to encourage greater research and innovation towards identifying and stop-ping the use of deepfakes to confuse viewers (Schroepfer, 2019).
The volume of deepfake papers has increased rapidly in recent years, according to data acquired by dimen-sions.ai at the end of 2020 (Dimensions, 2021). Fig. 1 shows the growth of deep-fake papers that increasing recently after 2017. Although the amount of deepfake papers received is likely to be lower than the original amount, the research trend on this area is clearly expanding.
Fig. 1: The growth of research on deepfake video detection and creation is given in this Fig.
This survey paper presented all the method of creating and detecting deepfake videos. There are so many sur-vey papers present now in this field (Verdoliva, 2020), but we done our survey from a different point of view and taxonomy. The fundamentals of deepfake algo-rithms and deep learning based deepfake creation method are presented in Section II. In Section III we discuss the various technique for identifying deepfake videos as well as their benefits and drawbacks. In last section, we described all the challenges, and future dir-ections for deep fake detection and media forensics concerns.
Deepfake Creation
Deepfakes have grown in popularity as a result of the high quality of manipulated videos and the ease with which their implementations may be used by a wide variety of users with varying computing skills, from professional to newbie. Deep learning methods are used to create the majority of these applications. The ability of deep learning to represent complicated and high-dimensional data is well-known. Deep auto en-coders, a type of deep network with that capability, have been widely used for dimensionality reduction and image compression (Punnappurath, 2019) (Cheng, 2019). FakeApp, created by a Reddit user using an auto encoder decoder pairing structure, was the first approach at deepfake creation (Reddit, 2015). The auto encoders obtain latent features from facial images, and the decoder reconstructs the images in that fashion. Two encoder-decoder pairs are required to exchange faces between source and target images, with each pair training on an image set and the encoders parameters shared between two network pairs (Guera, 2018).
Fig. 2: Two encoder-decoder pairs are used in this deepfake production strategy.
For the training process, two networks utilize the same encoder but distinct decoders (top). Deepfakes are made by encoding the image of face A with the com-mon encoder and decoding it with decoder B. (bottom) (Guera, 2018). In other words, the encoder networks of two pairs are identical. This technique allows the common encoder to find and learn the similarity bet-ween two sets of face images, which is very easy be-cause faces have comparable features like eyes, noses, and mouth positions. Fig depicts a deepfake produc-tion process in which the feature set of face A is linked the decoder B in order to reconstruct face B from face A. This method is used in a number of papers; in-cluding Deep-FaceLab, DFaker, and DeepFake tf (tensor flow based deepfakes). An improved version of deepfakes based on faceswap-GAN (Face, 2015) was proposed by adding adversarial losses and perceptual losses implemented in the VGG Face (Ker, 2014) to the encoder-decoder architecture. It is included in the VGGFace perceptual loss in order to produce a higher-quality output video, which is made possible by smoo-thing out artifacts in segmentation masks (Goodfellow, 2014). It is possible to generate outputs with pixel resolutions of 64x64, 128x128, and 256x256. Addi-tionally, FaceNet (net, 2015) introduces a multi-task convolutional neural network (CNN) (Albawi, 2017) to enhance facial recognition and alignment accu-racy. In order to implement generative networks, Cycle GAN (Cycle, 2017) is used (Zhao, 2016). An overview of the most popular deepfake tools is shown in Table 1.
Table 1: Here we give some tools and their work for creating deepfake videos.
Deepfake Video Detection
The growing numbers of deepfakes are threatening privacy, social security, and democracy (Chesney, 2018). As soon as the threat of deepfakes was identi-fied, methods for identifying them were proposed. Early approaches relied on manufactured features deri-ved from glitches and flaws in the deepfake video synthesis process. Deep learning was used in re-cent approaches to automatically extract significant and discriminative features in order to detect deepfakes (de Lima, 2020; Amerini, 2020). Deepfake detection is typically thought of as a binary classification problem, in which classifiers are employed to distinguish bet-ween real and manipulated videos. To train this type of classification model, we need a big dataset of actual and false videos. Although the quantity of deepfake videos is growing, there are still limitations in terms of establishing a benchmark for verifying multiple detec-tion methods. Korshunov and Marcel (Korshu-nov, 2019) used the open-source code Faceswap-GAN (Face, 2015) to create a significant deepfake dataset consisting of 620 videos based on the GAN model to address this issue. Low and high-quality deepfake videos were created using videos from the publicly available VidTIMIT dataset (Sanderson, 2002), which can convincingly simulate facial movements, lip motions, and eye blinking. These dataset videos were then put to the test to see how well numerous deepfake detection methods worked. The popular facial recog-nition algorithms based on VGG (Parkhi, 2015) and Facenet (Schroff, 2015) are unable to recognize deep-fakes successfully, according to test results. When used to detect deepfake videos from this freshly crea-ted dataset, other methods such as lip-syn-cing app-roaches (Chung, 2017) (Korshunov, 2018) and image resolution measures with support vector machine (SVM) (Boulkenafet, 2015) show very high mistake rates. This increases worries about the crucial need for more powerful approaches to distinguish true deep-fakes in the future.
Fig. 3: All categories of deepfake detection are given in this Fig.
Generally there are two categories of deepfake detec-tion, one is fake video detection and the other on is fake image detection. Fake video detection is divided further into two more categories namely Visual Arti-facts within Frame and Temporal Features across Frames. Fig. 3 shows all the categories of deepfake detection. In this paper we only survey about deepfake video detection methods and give researcher a future direction to in reach this research field.
Fake Video Detection
Due to the significant loss of frame data following video compression, most image detection techniques cant be applied to videos (Afchar, 2018). Additionally, videos have temporal features that vary between frames, making it difficult for systems built to detect merely still fraudulent images to detect them. Fake video detection is divided into two categories namely Temporal Features across Frames and Visual Artifacts within Frames. Those two categories are explained in this subsection.
1) Temporal Features Across Frames
Sabir et al. used spatio-temporal properties of video streams to detect deepfakes, based on the finding that temporal coherence is not main-tained well in the synthesis process of deepfakes (Sabir, 2019). Low-level abnormalities caused by face modifications are considered to express themselves as temporally artifacts with irreg-ularities between frames because video modi-fication is done frame by frame. To leverage tem-poral disparities across frames, a recurrent con-volutional model (RCN) was established based on a combination of the convolutional network DenseNet (Huang, 2017) and the gated recurrent unit cells (Cho, 2014) (Fig. 4).
Fig. 4: A two-step process for detecting face manipulation in which the first step aims to detect, crop, and align faces on a sequence of frames, and the second step uses a combination of convolutional neural networks (CNN) and recurrent neural networks to distinguish between manipulated and authentic face images (RNN) (Sabir, 2019).
According to Guera and Delp (Guera, 2018), deepfake videos feature intra-frame abnormalities as well as temporal anomalies between frames. They then suggested a temporal aware pipeline method for detecting deepfake videos that uses CNN and long short-term memory (LSTM) (Guera, 2018). Frame-level features are extracted using CNN, which are then input into the LSTM to build temporal series descriptors. Finally, based on the sequence descriptor, a fully-connected network is utilized to classify doctored video from genuine ones, as seen in Fig. 5.
Fig. 5: A deepfake recognition method that uses a convolutional neural network (CNN) and long short term memory (LSTM) to extract temporal information from a video sequence and express them using a sequence descriptor.
The sequence descriptor is used to calculate pro-bability of the frame sequence belonging to either aut-hentic or deepfake class using a detection network with fully connected layers (Guera, 2018). The use of a phy- siological signal, such as eye blinking, to detect deep-fakes, on the other hand, was proposed based on the finding that a person in deepfakes blinks far less fre-quently than a person in untampered videos (Li, 2018). A normal person blinks the eye between 2 and 10 times per minute, with each blink lasting between 0.1 and 0.4 seconds. Deepfake algorithms, on the other hand, frequently use Internet face pictures for training, which typically show people with open eyes (very few pictures on the online show persons with eyes closed). As a result, deepfake algorithms are unable to build fake faces that blink normally without access to photos of individuals blinking (Li, 2018). First breakdown the videos into frames, then extract face regions and sub-sequently eye areas based on six eye cues to distin-guish between actual and false videos. These cropped eye region sequences are distributed into long-term recurrent convolutional networks (LR-CN) (Donahue, 2015) for dynamic state prediction after a few stages of pre-processing like aligning faces, extracting and scaling the bounding boxes of eye land-mark points to build fresh sequences of frames. The LRCN consists of a CNN-based feature extractor, LSTM-based sequence learning, and a fully connected layer-based state prediction to forecast the probability of eye open and closure states. The use of LSTM helps to capture these temporal patterns efficiently because eye blinking re-veals substantial temporal dependencies. A blink is defined as a peak over the level of .05 with a length of fewer than 7 frames, and the blinking rate is measured on the prediction outcomes. This method is tested on a web-based dataset consisting of 49 interviews and lecture videos, as well as the deepfake classifiers fake versions of those videos. The experimental results show that the suggested method has potential detection accuracy for fake videos, which can also be taking into account the dynamic pattern of blinking, such as excessively rapid blinking, which could be a symptom of video manipulation.
Visual Artifacts With in Frames
As explained in the earlier section, the techniques for detecting deepfake videos that use temporal patterns between video sequences are generally based on deep recurrent network architectures. In this section, we explored some more methods for obtaining feature maps by disintegrating videos into frames and looking at visual artifacts within a single frame. To distinguish among fake and real videos, these characteristics are transmitted into a deep or shallow classification model. As a result, we divided the approaches in this section are into two categories: deep and shallow classifiers.
1) Deep classifiers
Deepfake video are typically made with low resolutions, necessitating an affine face warping strategy (i.e., resizing, rotating, and shearing) to match the originals configuration. This method produces artifacts that CNN models like VGG16 (Simonyan, 2014), ResNet50, ResNet101, and ResNet152 (He, 2016) can identify due to the re-solution mismatch between the warping face area and surrounding context. In a deep learning app-roach for detecting deepfakes was presented based on artifacts noticed during in the face war-ping phase of the deepfake generating algorithm (Li, 2018). On two deepfake datasets, the UAD-FV and Deepfake TIMIT, the suggested approach is assessed. In total, there are 32,752 pixels in the UADFV dataset (Li, 2020), which includes 49 genuine and 49 faked videos. The Deepfake TIM-IT dataset (Sanderson, 2002) contains a two set of low-quality 64 × 64 and high-quality 128 x 128 videos, totaling 10,537 real and 34,023 fake images retrieved from 320 videos for every qua-lity set. The recommended methods result is evaluated to other widely used methods such as Meso4 & MesoInception-4 (Afchar, 2018), Meso- Net methods, HeadPose (Yang, 2019), and the two-stream NN face tampering detection method. The suggested method has not needing to create deepfake video to train the detection methods (Zhou, 2017). This is the main advantages because deepfake videos have bad aspects also.
2) Shallow classifiers
The majority of deepfake detection algorithms focus on artifacts or inconsistencies in inherent properties among real and fake photos or videos. Yang et al. (Yang, 2019) suggested a detection approach based on monitoring the changes among 3D head positions, which are computed using 68 face regions in the crucial face recognition sys-tem. As there is a defect in the deep-fake image creation process, so the 3-dimensional head posi-tions are investigated to detect it. To get higher accuracy, the retrieved characteristics are passed into an SVM Classification model. They test the model into two datasets and show that the sug-gested model outperformed the alternatives. In total, there are 32,752 pixels in the UADFV dataset (Li, 2020), which includes 49 genuine and 49 faked videos. The second dataset, which is a subset of data from the DAR-PA MediFor GAN Image/Video Challenge, contains 241 real photos and 252 deep fake photos (Guan, 2019). Simi-larly, in (Matern, 2019)a strategy for exploiting deepfake and face modification artifacts based on visual aspects of eyes, teeth, and facial contours were investigated. The graphical artifacts are caused by inadequate global consistency, an in-correct or inaccurate estimate of incident illumin-ation, or an inaccurate estimate of the actual geometry. Missing reflections and details in the eye and teeth areas, as well as texture features retrieved from the facial region based on facial landmarks, are used to detect deepfakes. The eye feature vector, the teeth feature vector, and the features retrieved from the full-face crop are employed as a result. After extracting the charac-teristics, two classifiers are used to distinguish the deepfakes from genuine videos: logistic regres-sion and a tiny neural net-work. Experiments on a YouTube video dataset yielded the best result of 0.851 in terms of the area under the receiver oper-ating characteristics curve. The proposed app-roach, on the other hand, has the drawback of requiring images to meet particular requirements, such as open eyes or visual teeth.
Peoples faith in media information has been eroded by deepfakes, as seeing them no longer equates to trust in them. They have the potential to generate anguish and negative consequences for people targeted, increase misinformation and offensive language, and even exa-cerbate political tensions, incite public outrage, vio-lence, or war. This is particularly important today because deepfake technology is becoming more acc-essible, and social media sites can swiftly propagate false news (Zubiaga, 2018). The serious problem of deepfake, the research community has concentrated on building deepfake learning algorithms, with multiple results published. The state-of-the-art methodologies were addressed in this work, and Table 2 presents an overview of common approaches. Its clear that a struggle is brewing between people who utilize power-ful machine learning to build deepfakes and others who take the opportunity to recognize them. The qua-lity of deepfakes has been improving, and monitoring systems performance has to advance through too. The concept is that what AI has broken can also be mended by AI (Floridi, 2018). Detection techniques are still in their infancy, and a variety of approaches have been presented and tested, but on scattered datasets. Crea-ting a growing updated benchmark dataset of deep-fakes to verify the ongoing development of detection methods is such a way to improve detection method performance. This will find things simpler to train recognizers, especially deep learning models, which require massive training sets (Dolhansky, 2020).Cur-rent detection methods, on the other hand, are most often focused on the flaws in deepfake generating pathways. In adversarial contexts, where attackers fre-quently tries not to expose deepfake creation methods, this kind of information and knowledge is not always available. The deepfake detection task has become more complex as a result of recent work on adversarial perturbation assaults to deceive DNN-based monitors (Hussain, 2021) (Yang, 2021). These are actual obs-tacles in the creation of detection systems, and future studies should focus on developing more reliable, adaptable, and generally applicable methods. Some other line of inquiry is to include monitoring systems into production platforms like social media to ensure their efficiency in grappling with deepfakes wide-spread influence. On these platforms, a screening or filtering mechanism with effective detection methods can be created to make deepfakes detection easier (Citron, 2018). Law limitations could be imposed on internet corporations that own these sites, requiring them to immediately delete deepfakes in order to mitigate their effects. Photo editing tools can also be embedded into devices that humans use to create digital content to create unchanging metadata for maintaining originality details like time and place of audiovisual items, as well as their untampered attestation (Citron, 2018).
This connection is tough to implement, so using disruptive blockchain technology as a solution could be a viable option. The block chain has been actively employed in a variety of fields, but there has been little research tackling deepfake detection issues using this technology. Its an excellent tool for digital provenance because it cans construct a chain of unique, immutable metadata chunks. Although the application of block chain systems to this challenge has generated some promising findings (Hasan, 2019), this study area is still in its inception. Its necessary to use detection tools to recognize deepfakes, but its even more cri-tical to grasp the true motivations of all who publish them. Users must appraise deepfake regarding the social context in which it is detected, like who circulated it and what they said here. This really is important be-cause deepfakes are becoming increasingly lifelike, and detecting software is expected to fall behind. It is thus worthwhile to conduct research on the social aspect of deepfakes in order to support users in making such decisions.
Table 2: Here we show some popular deepfake videos detection methods.
In police investigations and criminal trials, photo-graphs and videos have been routinely used as evi-dence. Digital media forensics professionals with a degree in computer or law enforcement and skill collecting, reviewing, and analyzing digital material may present them as evidence in a court of law. Be-cause even experts are unable to discern manipulated contents, the development of machine learning and AI technologies may have been used to modify these digital contents, and thus the experts personal views may not be enough to verify this evidence. Because of the development of a wide range of digital manipulation tools, this aspect must be recognized in todays cour-trooms when photographs and videos are used as evi-dence to convict perpetrator (Maras, 2019). Before digital content forensics results may be utilized in court, they must be proved to be real and reliable. This necessitates meticulous documentation for every step of the forensics process as well as the methodology used to acquire the results. Although most of these algori-thms are inexplicable. AI and Machine learning algori-thms can be used to support the determination of the authenticity of digital media and have provided accu-rate and reliable results. This is a significant obstacle for the use of AI in forensics challenges for not only do forensics experts lack experience in computer algo-rithms, but computer professionals also lack the ability to properly explain the results because most of these algorithms are black-box models (Malolan, 2020).
This is incredibly significant because the most recent models to generate the most accurate results are based on deep learning methods that involve a large number of neural network parameters. As an outcome, ex-plainable AI in computer vision is a research direction that is required to promote and employ AI and machine learning advances and effects in digital media forensics.
Technologies based on deep learning, such as deep-fake, have been advancing at an unprecedented rate in recent years. The global pervasiveness of the Internet makes it possible for malicious face-manipulated videos to be distributed rapidly, posing a threat to social order and personal safety. In order to mitigate the negative effects of deepfake videos on people, res-earch groups and commercial companies around the world are conducting relevant studies. Firstly, we pre-sent deepfake video generation technology, followed by the existing detection technology, and finally the future research direction. An emphasis in this review is particularly placed on current detection algorithm pro-blems and promising research. The review places special emphasis on generalization and robustness. This article will hopefully prove useful for researchers who are interested in deepfake detection and in limi-ting the negative impact of deepfake videos.
We would like to thanks our parents and our entire teacher to support us mentally and monetary.
Researcher can use this work only research purpose. There are conflicts of interest for research community.
Academic Editor
Dr. Wiyanti Fransisca Simanullang, Assistant Professor, Department of Chemical Engineering, Universitas Katolik Widya Mandala Surabaya, East Java, Indonesia.
Assistant Professor, Department of Computer Science and Engineering, Bangladesh University of Business and Technology, Dhaka-1216, Bangladesh
Rahman A, Islam MM, Moon MJ, Tasnim T, Siddique N, Shahiduzzaman M, and Ahmed S. (2022). A qualitative survey on deep learning based deep fake video creation and detection method. Aust. J. Eng. Innov. Technol., 4(1), 13-26. https://doi.org/10.34104/ajeit.022.013026