A "match cut" is a common video editing technique where a pair of shots that have a similar composition transition fluidly from one to another. Although match cuts are often visual, certain match cuts involve the fluid transition of audio, where sounds from different sources merge into one indistinguishable transition between two shots. In this paper, we explore the ability to automatically find and create "audio match cuts" within videos and movies. We create a self-supervised audio representation for audio match cutting and develop a coarse-to-fine audio match pipeline that recommends matching shots and creates the blended audio. We further annotate a dataset for the proposed audio match cut task and compare the ability of multiple audio representations to find audio match cut candidates. Finally, we evaluate multiple methods to blend two matching audio candidates with the goal of creating a smooth transition.
In the movie "The Chronicles of Narnia: The Lion, the Witch and the Wardrobe", we see a great example of an audio match cut, where the sword clinking within its sheath is matched to the strike of a hammer in the next scene. Our work seeks to enable the automatic creation of audio match cuts like this!
Here are example audio match cuts using our retrieval and transition method described in our paper. Each example cuts between two entirely separate movies/videos! Try to spot where the transition occurs!
In our paper, we explore multiple transition methods that blend the audio of two retrieved clips. Here are examples of the different transition methods of the same scenes.
In this example, the click of a gun is matched to the closing of a lighter. Since there is little noise, crossfading the audio between scenes is not essential, which results in a similar transition when using concatenation or crossfading. When using Max-SS + Adaptive Crossfading, we see the clicking of both scenes are precisely matched on the sound event. Depending on what a video editor is looking for, both scenarios may be useful (matching on the sound event, or a deliberate delay). Our transition strategy allows for the deliberate match of sound events, while simple crossfading allows for a more generic, but high-quality transition.
In this example, the noise of a vacuum cleaner is matched to the ambient noise inside a racecar. Since these clips have high noise at a slightly different pitch, simple concatenation results low-quality transition where the transition is noticeable and not fluid. When crossfading, the slight differences in pitch are blended together that improve the quality of the transition, making the direct transition point less noticeable. When using Max-SS + Adaptive Crossfade, we see that it results in a similar transition to simple crossfading. Since the clips have a high amount of noise without any specific sound events, the exact transition point is not as important. In addition, since the clips have a consistent noise, the adaptive crossfade results in a longer crossfade that effectively blends together the differences in pitch between the noisy clips.