Screen size and display resolution limit the experience of watching videos on mobile devices. The viewing experience can be improved by determining important or interesting regions within the video (called regions of interest, or ROIs) and displaying only the ROIs to the viewer. Previous work focuses on analyzing the video content using visual attention model to infer the ROIs. Such content-based technique, however, has limitations. In this paper, we propose an alternative paradigm to infer ROIs from a video. We crowdsource from a large number of users through their implicit viewing behavior using a zoom and pan interface, and infer the ROIs from their collective wisdom. A retargeted video, consisting of relevant shots determined from historical users behavior, can be automati- cally generated and replayed to subsequent users who would prefer a less interactive viewing experience. This paper presents how we collect the user traces, infer the ROIs and their dynamics, group the ROIs into shots, and automatically reframe those shots to improve the aesthetics of the video. A user study with 48 participants shows that our automatically retargeted video is of comparable quality to one handcrafted by an expert user.
The work consists in several steps:
Building Heatmaps. In a given frame, we associate to each viewport (i.e. rectangular region that has been visualized by a user through a zoomable video interface) a gaussian footprint, and sum up the contribution from all users to obtain a heatmap (see second line of Figure 1).
Estimating GMM. For each frame, we model the associated heatmap by a mixture of gaussians. We estimate the parameters of this mixture using the Mean-Shift algorithm.
Generating Candidate Shots. We build a graph by defining a node for each gaussian (as estimated in the previous step) in each frame. Edges in this graph can only be from one gaussian in one frame to another in the following frame. Edges are weighted according to several criterias, including for example euclidean distance between gaussian means associated to the nodes. We then compute the Minimum Spanning Tree of this graphe, and remove additional edges with a high weight. At the end of the process, nodes that are still grouped together form a shot.
Editing the video. We start with categorizing shots : fixed shot (the virtual camera is fixed), dolly shots (the virtual camera moves along the frame) and zooming shots (the virtual camera moves perpendicularly to the frame). We then choose most popular shots (those which have been visualized by the highest number of users), stabilize them and add transitions between them, to produce the final video.
A. Carlier, V. Charvillat, W.T. Ooi, R. Grigoras et G. Morin : Crowdsourced Automatic Zoom and Scroll for Video Retargeting ACMMM’10, 201-210