Liqiang Nie, Meng Liu, and Xuemeng Song
The unprecedented growth of portable devices contributes to the success of micro-video sharing platforms such as Vine, Kuaishou, and Tik Tok. They enable users to record and share their daily life within a few seconds in the form of micro-videos at any time and any place. As a new media type, micro-videos gain tremendous user enthusiasm, in virtue of their value in brevity, authenticity, communicability, and low-cost. The proliferation of micro-videos confirms the old saying good things come in small packages.
Like the traditional long videos, micro-videos are the unity of textual, acoustic, and visual modalities. These modalities are correlated rather than independent and essentially characterize the same micro-videos from distinct angles. Effectively fusing heterogeneous modalities towards video understanding have been indeed well-studied in the past decade. Yet, micro-videos have their unique characteristics and the corresponding research challenges:1) Information sparseness. Micro-videos are very short, lasting for 6-15 seconds, and they hence usually convey a very few concepts. In the light of this, we need to learn their sparse and conceptual representations for a better discrimination.2) Hierarchical structure. Micro-videos are implicitly organized into a four-layer hierarchical tree structure with respect to their recording venues. We should leverage such a structure to guide the organization of micro-videos by categorizing them into the leaf nodes of this tree.3) Low quality. Most portable devices have nothing to offer for video stabilization. Some videos can thus be shaky or bumpy, which greatly hinders the visual expression. Furthermore, the audio track that comes along with the video, can be in different types of distortion and noise, such as buzzing, hums, hisses, and whistling, which is probably caused by the poor microphones or complex surrounding environments. We thus have to harness the external visual or sound knowledge to compensate the shortest boards.4) Multimodal sequential data. Beyond textual, acoustic, and visual modalities, micro-videos have a new one, namely social modality. In such a context, a user is enabled to interact with micro-videos and other users via social actions like click, like, and follow. As time goes on, multiple sequential data in different forms emerge and they reflect users historical preferences. To strengthen micro-video understanding, we have to characterize and model the sequential patterns. And 5) the last challenge we are facing is the lack of benchmark datasets to support our research.
In this book, we present some state-of-the-art multimodal learning theories and verify them over three practical tasks of micro-video understanding: popularity prediction, venue category estimation, and micro-video routing. In particular, we first construct three large-scale and real-world micro-video datasets corresponding to the three practical tasks. We then propose a multimodal transductive learning framework to learn the micro-video representations in an optimal latent space via unifying and preserving information from different modalities. In this transductive framework, we integrate the low-rank constraints to somehow alleviate the information sparseness and low quality problems. This framework is verified on the popularity prediction task. We next present a series of multimodal cooperative learning approaches, which explicitly model the consistent and complementary modality correlations. In the multimodal cooperative learning approaches, we make full use of the hierarchical structure via the tree-guided group lasso, and further solve the information sparseness via dictionary learning. Following that, we work towards compensating the low quality acoustic modalities via harnessing the external sound knowledge. This is accomplished by a deep multimodal transfer learning scheme. Both of the multimodal cooperative learning approaches and the multimodal transfer learning scheme are justified over the task of venue category estimation. Thereafter, we develop a multimodal sequential learning model, relying on temporalgraph-based LSTM networks, to inteligently route micro-videos to the target users in a personalized manner. We ultimately summarize the book and figure out the future research directions in multimodal learning towards micro-video understanding!
This is a preliminary research on learning from multiple correlated modalities of given micro-videos, and we anticipate that the Lectures in this series will dramatically influence future thought on these subjects. If we have seen further it is by standing on the shoulders of giants.