Sonic: Shifting Focus to Global Audio Perception in Portrait Animation

1Tencent, 2Zhejiang University
TL;DR: We propose a novel paradigm, dubbed as Sonic, to shift focus on the exploration of global audio perception. To effectively leverage global audio knowledge, we disentangle it into intra- and inter-clip audio perception and collaborate with both aspects to enhance overall perception. For the intra-clip audio perception, 1). Context-enhanced audio learning, in which long-range intra-clip temporal audio knowledge is extracted to provide facial expression and lip motion priors implicitly expressed as the tone and speed of speech. 2). Motion-decoupled controller, in which the motion of the head and expression movement are disentangled and independently controlled by intra-audio clips. Most importantly, for inter-clip audio perception, as a bridge to connect the intra-clips to achieve the global perception, Time-aware position shift fusion, in which the global inter-clip audio information is considered and fused for long-audio inference via through consecutively time-aware shifted windows.

Generated Videos

We propose a novel unified paradigm without the guidance of visual motion like motion frames, dubbed as Sonic, to focus on the exploration of global audio perception. Sonic produces vivid portrait animation videos given different styles of images and various types of audio inputs. The images and audios are collected from recent works or sourced from the Internet.

Stable Long Video Generation

We show the generation of stable long videos ranging from 1 to 10 minutes, highlighting the effectiveness and ingenuity of our time-aware position shift fusion technique.

More Video Results

We exhibit more cases on stylized non-real-human and diverse resolution ratios, indicating the generalization of our Sonic well-adapted for non-real-human cases (e.g. stylized or cartoon characters) or multi-resolution ratios.

Comparison with open-source methods

We compare Sonic with recent state-of-the-art methods. Sonic produces a wider range of expressions consistent with the audio and promotes more natural head movement.

Comparison with closed-source methods

We compare Sonic with closed-source methods such as EMO and Jimeng. For EMO, we use the reference image and audio from their public demos.

EMO occasionally exhibits sudden unnatural facial expressions around 7s, and the reflections in the glasses lack realism.


In singing scenarios, Sonic demonstrates more precise articulation and a greater diversity of movements.



For Jimeng, we use their commercial API (jimeng.com), which may be a performance-accelerated version of Loopy.

In this anime case 1, Sonic's results exhibit lip movements and lip appearance that more closely align with the original input, accompanied by blinking actions.


In the following case 2, it is observed that Jimeng's head poses are monotonous, whereas Sonic appears more dynamic.


In case 3, Jimeng exhibits a decrease in character similarity around 5~8s, and the lip-sync with the audio is inadequate.


In case 4, during long-term generation, Sonic is not affected by motion frames, thereby avoiding the occurrence of artifacts at the end of the video.