This tutorial will deliver a broad overview of the main technologies that enable the automatic generation of video summaries for re-use in different distribution channels, and the optimisation of the video summary-based reach and engagement of the audience; and, provide an in-depth analysis of selected SoA methods and tools on these topics. It will comprise two main modules. The first module, on video summaries generation, will provide an overview of deep-learning-based video summarization techniques, and then will discuss in depth a few selected SoA techniques that are based on Generative Adversarial Networks. Special emphasis will be put on unsupervised learning techniques, whose advantages will also be elaborated. An overview of video summarization datasets, evaluation protocols and related considerations & limitations will also be presented. The second module, on video summaries (re-)use and recommendation, will discuss the use of Web and social media analysis to detect topics in online content and trends in online discussion. It will subsequently examine the application of predictive analytics to suggest future trending topics, in order to guide video summaries publication strategies. Besides the underlying technologies, a few complete tools will be demonstrated, to link the research aspects of video summarization, trend detection and predictive analytics with the practitioners’ expectations and needs for video summarization and (re-)publication online. The tutorial’s target audience includes researchers in the video summarization and deep learning topics and, in general, in deep-learning-based multimedia understanding; researchers in web and social media data analysis, topic and trends detection, and predictive analytics; and practitioners in video content creation and (re-)use, including YouTube/Instagram prosumers, TV and film producers, representatives of broadcasters and online media platforms.
This tutorial addresses the advances in deep Bayesian learning for spatial and temporal data which are ubiquitous in speech, music, text, image, video, web, communication and networking applications. Multimedia contents are analyzed and represented to fulfill a variety of tasks ranging from classification, synthesis, generation, segmentation, dialogue, search, recommendation, summarization, answering, captioning, mining, translation, adaptation to name a few. Traditionally, “deep learning” is taken to be a learning process where the inference or optimization is based on the real-valued deterministic model. The “latent semantic structure” in words, sentences, images, actions, documents or videos learned from data may not be well expressed or correctly optimized in mathematical logic or computer programs. The “distribution function” in discrete or continuous latent variable model for spatial and temporal sequences may not be properly decomposed or estimated. This tutorial addresses the fundamentals of statistical models and neural networks, and focuses on a series of advanced Bayesian models and deep models including Bayesian nonparametrics, recurrent neural network, sequence-to-sequence model, variational auto-encoder (VAE), generative adversarial network, attention mechanism, memory-augmented neural network, skip neural network, temporal difference VAE, stochastic neural network, stochastic temporal convolutional network, predictive state neural network, and policy neural network. Enhancing the prior/posterior representation is addressed. We present how these models are connected and why they work for a variety of applications on symbolic and complex patterns in sequence data. The variational inference and sampling method are formulated to tackle the optimization for complicated models. The embeddings, clustering or co-clustering of words, sentences or objects are merged with linguistic and semantic constraints. A series of case studies are presented to tackle different issues in deep Bayesian modeling and learning. At last, we will point out a number of directions and outlooks for future studies.
The tutorial provides an overview on the latest emerging video coding standard VVC (Versatile Video Coding) to be jointly published by ITU-T and ISO/IEC. It has been developed by the Joint Video Experts Team (JVET), consisting of ITU-T Study Group 16 Question 6 (known as VCEG) and ISO/IEC JTC 1/SC 29/WG 11 (known as MPEG). VVC has been designed to achieve significantly improved compression capability compared to previous standards such as HEVC, and at the same time to be highly versatile for effective use in a broadened range of applications. Some key application areas for the use of VVC particularly include ultra-high-definition video (e.g. 4K or 8K resolution), video with a high dynamic range and wide colour gamut (e.g., with transfer characteristics specified in Rec. ITU-R BT.2100), and video for immersive media applications such as 360° omnidirectional video, in addition to the applications that have commonly been addressed by prior video coding standards. Important design criteria for VVC have been low computational complexity on the decoder side and friendliness for parallelization on various algorithmic levels. VVC is planned to be finalized by July 2020 and is expected to enter the market very soon. The tutorial details the video layer coding tools specified in VVC and develops the concepts behind the selected design choices. While many tools or variants thereof have been available before, the VVC design reveals many improvements compared to previous standards which result in compression gain and implementation friendliness. Furthermore, new tools such as the Adaptive Loop Filter, or Matrix-based Intra Prediction have been adopted which contribute significantly to the overall performance. The high-level syntax of VVC has been re-designed compared to previous standards such as HEVC, in order to enable dynamic sub-picture access as well as major scalability features already in version 1 of the specification."
Similar to people identification through human fingerprint analysis, multimedia forensics and security assurance through device fingerprint analysis have attracted much attention amongst scientists, practitioners and law enforcement agencies around the world in the past decade. Device information, such as device models and serial numbers, stored in the EXIF are useful for identifying the devices responsible for the creation of the images and videos in question. However, stored separately from the content, the metadata in the EXIF can be removed and manipulated at ease. Device fingerprints deposited in the content by the devices provide a more reliable alternative to aid forensic investigations and multimedia assurance. Various hardware or software components of the imaging devices leave model or device specific artifacts in the content in the digital image acquisition process. These model or device specific artifacts, if properly extracted, can be used as device fingerprints to identify the source devices. This tutorial will start with an introduction to various types of device fingerprints. The presentation will then focus on sensor pattern noise, which is currently the only form of device fingerprint that can differentiate individual devices of the same model. We will also discuss the real-world applications of sensor pattern noise to source device verification, common source inference, source device identification, content authentication (including fake new detection) and source-oriented image clustering. Some real-world use cases in the law enforcement community will also be presented. Finally we will discuss the limitations of existing device fingerprints and point out a few lines for future investigations including the use of deep learning to inference device fingerprints.
Recently, 3D visual representation models such as light fields and point clouds are becoming popular due to their capability to represent the real world in a more complete, realistic and immersive way, paving the road for new and more advanced visual experiences. The point cloud (PC) representation model is able to efficiently represent the surface of objects/scenes by means of a set of 3D points and associated attributes and is increasingly being used from autonomous cars to augmented reality. Emerging imaging sensors have made easier to perform richer and denser PC acquisitions, notably with millions of points, making impossible to store and transmit these very high amounts of data. This bottleneck has raised the need for efficient PC coding solutions that can offer immersive visual experiences and good quality of experience. This tutorial will survey the most relevant PC basics as well as the main PC coding solutions available today. Regarding the content of this tutorial is important to highlight: 1) a new classification taxonomy for PC coding solutions to more easily identify and abstract their differences, commonalities and relationships; 2) representative static and dynamic PC coding solutions available in the literature, such as octree, transform and graph based PC coding among others; 3) MPEG PC standard coding solutions which have been recently developed, notably Video-based Point Cloud Coding (V-PCC), for dynamic content, and Geometry-based Point Cloud Coding (G-PCC), for static and dynamically acquired content; 4) rate-distortion (RD) performance evaluation including the G-PCC and V-PCC standards and other relevant PC coding solutions, using suitable objective quality metrics. The tutorial will end with some discussion on the strengths and weaknesses of the current PC coding solutions as well as on future trends and directions.