Traditionally, people used TVs that have a 16:9 or a 4:3 aspect ratio to watch videos. However, with recent devices, people view and create videos in an array of aspect ratios. Cropping videos to fit the screens of these devices is a tedious task for video curators. Thankfully, Google is on the case to crop videos smoothly.

Google created this tool to get rid of the conventional static cropping method for cropping videos. The static cropping method involves unreliable techniques of video reframing, i.e., specifying a camera viewport for the video and then cropping everything outside that area. This method produces an undesirable output of the videos.

Shot (Scene) Detection

A scene or a shot in a video is a continuous sequence of frames without any cuts. If there is any change in the shot or scene of a video, Google’s AutoFlip can detect the change by comparing the colour histogram of the previous frames with the new ones. A shot change is detected when the distribution of frame colour changes at a different rate than a sliding historical window. The tool, to optimise the reframing process, buffers the whole video before making any reframing decisions.

By using this strategy, the tool detects important objects and people in the video. It uses deep learning-based object detection models to identify objects. With this model, the tool can even detect any text overlays or brand logos and other elements like motion or ball for sports videos. The face and object detection models are integrated into the tool through MediaPipe. It is basically a framework for processing multimodal data by developing pipelines. This framework uses Google’s TensorFlow Lite ML framework on CPUs.

After identifying people and objects in videos, the tool makes logical decisions on how to reframe the video. AutoFlip chooses one of the three reframing strategies to crop the content – stationary, panning or tracking. The tool chooses the optimal strategy based on the content of the video. For instance, in stationary mode, the reframed camera viewport remains fixed in a stationary position where most of the important scenes of the video are present. For videos that contain motion, it uses Panning by moving the reframed camera viewport at a constant velocity. When there are interesting subjects in the frame, the Tracking mode comes into effect.

Based on the reframing strategy chosen by the algorithm, an optimised cropping window for each frame is set by AutoFlip. This preserves the important content of the video in the best possible way.