Cinnamon and our interpolation solution in the domain of Japanese animated movies
Introduction
I remember back in the time when I hadn’t gotten so busy with life and still had time for anime (Japanese animated films), I used to have to wait a whole week for a new episode of my favorite anime show, Naruto. One week passes so quickly now that I am a working adult, but back then, waiting such an amount of time just to get a 20-minute episode can be frustrating for the carefree young lad I used to be. To make things worse, each episode would dedicate 5 minutes for the opening and ending songs, and another 5 for a recap of the previous episode, so it would have just 10 minutes of new content. Not to mention its manga (Japanese comic) counterpart was always 1 year ahead in terms of story progression, which means the authors didn’t have to work on the plot and only needed to focus on drawing.
Little did I know then that things are not that simple. Let’s (roughly) calculate the amount of drawing that the authors have to do to produce one episode. As a common standard, animated films would have 24 fps (frames per second), so 10 minutes of screen time would equal 24x60x10 = 14400 frames. Dividing that by 7 days results in about 2057 frames per day or about 1.43 frames every minute. Considering there would only be 1 to 2 artists working on that to keep the drawing consistent, that is an insane amount of work.
However, there are some observations worth pointing out here:
- Most of the frames are very similar since there is not much happening between the time span of 2 consecutive frames (1/24 of a second).
- For the frames that do contain a lot of motion, I noticed the artists would usually underrepresent some frames, making them less detailed and sketchier, thus reducing the effort needed. The normal viewer would most likely not notice those frames, but if you ever randomly pause an anime during a high-speed action scene, chances are you will encounter weird and funny-looking moments like the ones shown below.
This opens an opportunity for optimization and elimination of repetitive work, and the technology for that is referred to as frame interpolation. For the rest of this article, I will talk about the technical aspect of said problem and briefly present our solution.
Frame interpolation
Formally speaking, frame interpolation is the process of synthesizing in-between frames between a pair of input frames. As you might have guessed after reading the introduction, the theme of this article will be frame interpolation in Japanese animated films, so-called anime. This is a part of our effort to apply artificial intelligence on assisting Japanese animators in the anime industry, as previously introduced in the article here.
By now, I hope you understand what frame interpolation is and why it is important. But as simple as the problem definition may seem, frame interpolation is hard. This paragraph discusses some challenges in anime interpolation. Occlusion is a common obstacle in interpolation problems. When occlusion happens, new details of the character show up in the next frame, or existing details get hidden during the movement. As a result, the algorithm needs to model which parts of the character will move in and move out of the view. Another challenge that is unique to anime interpolation is that the output must not have color bleedings or visual artifacts. Each segment in a frame must have homogenous color; otherwise, the quality of a film will drop drastically. Lastly, the animation artists wish to edit the animation speed and refine each segment’s shape. To do so, we need to sample points on the segment’s contour and use them as anchors in a morphing operation.
s frame interpolation is a popular topic, various methods have been proposed to increase the frame rate of a video. Although these models are usually trained on generic real-world videos, we can collect a huge dataset of anime-styled frames and fine-tune them. Most of these methods are based on convolutional neural networks, such as Super-Slomo [2]. A few models introduce additional modules to estimate supplementary information. For example, DAIN [3] proposes a depth estimation network to explicitly detect occlusions between frames. A recent work called FILM [1] presents a simple single-stage architecture that works well on frames with large in-between motion.
The existing frame interpolation methods are solid, but before fine-tuning them on Japanese animation data, we need to look at a few caveats. Firstly, due to the scarcity of annotated anime data (for depth, optical flow, etc), training auxiliary modules will be difficult, so we prefer single-stage architecture. Still, generative methods based on convolutional neural networks tend to produce artifacts like blurred regions, motion jittering, and color leaks. What’s worse, when these artifacts show up, there is no way that users can check, investigate and find the root causes of the generation mistakes. Since convolutional neural network-based approaches are prone to being a black box, we use a mesh deformation method instead. The deformation-baseddeformation-based method is the default, where our users can inspect each step and even edit the output. In case there is a complex motion between frames and the deformation fails to work, we will have to resort to CNN-based methods.
The overall architecture of our interpolation method is described in the following figure. The inputs to our pipeline consist of two color pictures, the source image at t=0.0 and the target image at t=1.0, along with time value t where 0.0 < t < 1.0. The mesh deformation pipeline has three main modules: mesh creation, mesh matching and mesh interpolation. Additionally, we may also use a FILM model fine-tuned on a dataset of Japanese animated films.
Mesh Deformation
As seen in the above figure, the mesh deformation model consists of three steps: mesh creation, mesh matching, and mesh interpolation. Given two input images, this model will create two corresponding 2D triangular meshes that represent the source and target pose of the character and sample a boundary point set for each. In the mesh matching step, we match two sets of boundary points to build point correspondences between meshes. Finally, based on the correspondences, the mesh interpolation module will estimate the interpolated mesh, with which the in-between image can be generated.
Firstly, we will discuss the mesh creation module. This module needs one image as its input, and we will run it on both the source and target images. As our input image contains a character or an object on a white background, we can extract the foreground by separating the white region. After that, we use the Douglas-Peucker algorithm to select a number of extreme points in the curve that we will use as a part of our sampling points. These points are located at the extreme positions of the curve, so they are the most representative points in describing the character’s shape. In addition to these extreme points, we also sample a few more points between a pair of extreme points in order to build a denser boundary representation. Apart from the boundary set, we build another inner point set to take into account the inner structure of a character. We extract the contours of every segment of unique colors within the character, then sample points on each contour similar to those on the boundary. The inner points that are too close to those in the boundary set are considered redundant, so we filter those out. Lastly, based on the boundary and inner point sets, we apply Delaunay triangulation to divide the foreground into multiple small triangles. This set of triangular pieces is the triangular mesh that we will use as the basis of our interpolation.
When we run two input images through the mesh creation module, we get two triangular meshes, one for the source and one for the target. Given these two meshes, we perform mesh matching, which is to create correspondences between the mesh vertices. The algorithm being used is the coherent point drift (CPD) registration. In short, CPD considers matching two point sets a probability density estimation problem, where the source set are the Gaussian mixture model centroids and the target set are the data points.
After optimization with the expectation-maximization algorithm, two-point sets become aligned and the correspondence is acquired by using the maximum of the GMM posterior probability for a given target point. The transformation model of CPD forces GMM centroids to move coherently as a group to preserve the mesh topological structure. In our method, we choose the deformable transformation.
The final step of our pipeline is the mesh interpolation module, which is based on the As-Rigid-As-Possible (ARAP) algorithm. The core idea of ARAP is to preserve the isometry during interpolation from source to target as much as possible. In other words, at ideal isometry, the mesh’s edges will change in translation and rotation during interpolation, but their scaling and shearing will remain constant. We can represent this idea as:
ARAP will try to minimize the divergence from ideal isometry at a specific time t:
We can solve for the optimally interpolated vertices as a sparse linear system of equations. The output mesh vertices define the output mesh faces, which contain corresponding color values from the source image. As a result, we can construct an in-between frame based on the mesh and return it to our users.
Generative Network
Although the mesh deformation approach works well in cases where the motion between frames is small or large, it does not consider 3D transformations or occlusions. As a result, we believe a CNN-based generative model is necessary for these cases. A potential such model is FILM, which shows solid interpolation results even without fine-tuning while being easy to train due to its single-stage architecture. FILM is especially useful in tackling large motion because this model uses a feature extractor that shares weights across scales. The intuition is that large motion at finer scales should be similar to small motion at coarser scales. Furthermore, FILM is trained with Gram loss to increase robustness against large motions and improve crispness.
Summary
In this article, we present our interpolation solution for Japanese animated movies. This pipeline is a mesh deformation process with three steps: mesh creation, mesh matching and mesh interpolation. This solution shows good generation performance on both frame pairs with small and large motion, but its limitations lie in complex 3D transformations. In the future, we are planning to include a FILM generative model in our pipeline for difficult cases, which can potentially provide users with another great method that they can choose in interpolation.
References
[1] Reda, Fitsum, et al. “Film: Frame interpolation for large motion.” arXiv preprint arXiv:2202.04901 (2022).
[2] Jiang, Huaizu, et al. “Super slomo: High quality estimation of multiple intermediate frames for video interpolation.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[3] Bao, Wenbo, et al. “Depth-aware video frame interpolation.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.
[4] Igarashi, Takeo, and Yuki Igarashi. “Implementing as-rigid-as-possible shape manipulation and surface flattening.” journal of graphics, gpu, and game tools 14.1 (2009): 17–30.