Volumetric video (VV) is an emergent digital media that enables novel forms of interaction and immersion within virtual worlds. VV allows 3D representation of real-world scenes and objects to be visualized from any viewpoint or viewing direction; an interaction paradigm that is commonly seen in computer games. Based on this innovative media format, it is possible to design new forms of immersive and interactive experiences that can be visualized via head-mounted displays (HMDs) in virtual reality (VR) or augmented reality (AR). The talk will highlight technology for VV content creation developed by the V-SENSE lab and the startup Volograms. It will further showcase a variety of creative experiments applying VV for immersive storytrelling in VR and AR.
COVID-19 pandemic has shown the weaknesses of our existing healthcare systems, as a city, state, country, and global village. Due to the lack of availability of vaccine and the fact that the pathogen transmits from human-to-human, it has affected the whole world. In order to flatten the curve, the healthcare providers have resorted to traditional clinical solutions, which does not scale to mass level. Thanks to the recent advancements in Multimedia Healthcare technologies in the areas such as Self-Explainable Artificial Intelligence, Blockchain, IoT, and Beyond 5G, to name a few, researchers have shown that multimedia can play a key role in managing the digital twin of each individual during the pandemic. In this keynote talk, I will present 25 different domains of multimedia supported healthcare solutions that has contributed to the COVID-19 pandemic management. Finally, I will share some recommendations regarding the way forward.
With the rising popularity of virtual and augmented reality applications, 3D visual representation formats such as point clouds (PCs) have become a hot research topic. Since PCs are essentially a set of points in the 3D space with associated features, they are naturally suitable to facilitate user interaction and offer a high level of immersion. However, as providing realistic, interactive and immersive experiences typically requires PCs with a rather large number of points, efficient coding is critical as recognized by standardization groups such as MPEG and JPEG, which have been developing PC coding standards. Scalability is often a requirement for several PC applications where the access time to a PC is relevant, even if at lower quality or resolution, usually by partially decoding a bitstream structured in multiple layers. Although it may come at the cost of a reduced compression efficiency, scalable PC coding is nonetheless a coding paradigm that has been relatively unexplored in the literature.
The popularity of deep learning in multimedia processing tasks has largely increased in recent years due to its impressive performance. In terms of coding, recent deep learning-based image coding solutions offer very promising results, even outperforming state-of-the-art image codecs. Part of this success may be attributed to convolutional neural networks, which take advantage of the spatial redundancy by hierarchically detecting patterns to obtain a more meaningful latent representation. In this context, it is natural to extend the deep learning-based coding approach to PCs, for example coding 3D blocks of voxels instead of 2D blocks of pixels as for image and video coding.
In this context, this talk will address the emerging developments in point cloud coding, notably the recent MPEG and JPEG standardization projects as well as the very recent deep learning-based coding approach, with a special focus on scalability.
Crying is the infant’s first communication. Before learning how to express the emotions or physiological/psychological requirements with language, infants usually express how they feel to parents through crying. According to reports of pediatricians, normal newborns cry two hours a day. However, it is sometimes difficult for parents to figure out why the baby cries. In 2014, we cooperated with the National Taiwan University Hospital Yunlin Branch to develop the "Infant Crying Translator" to identify babies who are hungry, wet diapers, want to sleep, and pain. On the other hand, in order to provide more comprehensive newborn care services, we also develop a comprehensive intelligent baby monitor based on incremental learning.
In this talk, I will introduce the functional development of the smart baby monitor including crying detection, crying analysis, vomiting detection, facial heart rate and breathing detection. We proposed a new deep learning network for cry recognition to break through the limitations of traditional machine learning method and introduce an incremental learning mechanism to shorten the modification of individual crying models. Features such as infant vomiting detection, facial heart rate and breathing detection technology have also been developed to improve the monitoring system for newborns, make it easier for novice parents to care for their babies and reduce accidents.
Just-noticeable difference (JNDs), as perceptual thresholds of visibility, determine the minimal amounts of change for differences to be sensed by the human being (e.g., 75% of a population), and play an important role both explicitly and implicitly in many applications. The measurement, formulation and computationally-modeling for JND are the prerequisite for user-centric designs for turning human perceptual limitation into meaningful system advantages. In this talk, a holistic view will be presented on visual JND research and practice: absolute and utility-oriented JNDs; pixel-, subband- and picture- based JNDs; conventional and data-driven JND estimation; databases and model evaluation. Other factors influencing JND include culture and personality, as will be also highlighted. JND modeling for visual signals (naturally captured, computer-generated or mixed ones) has attracted much research interests so far, while those for audio, haptics, olfaction and gestation are expected to attract increasing research interests toward true multimedia. Possible new directions are then to be discussed in order to advance the relevant research.
Person re-identification (re-id) is an important research topic for associating persons across non-overlapping camera views in visual surveillance. While person re-id has a rapid development in the last ten years, person re-id is still suffering from many unresolved serious influences, such as illumination and clothing change. In addition, at present most of the performance of re-id algorithms heavily depends on the annotation of mass data, and how to deal with a large number of weak annotation or the re-identification of person re-id under no annotation data is still an urgent challenge to solve. In this talk, we will introduce the weak person re-identification research, including weakly supervised solutions for the solving re-id with weak labels and some new models to solve the re-id with weak visual cues.
In surveillance, person re-identification (re-id) has emerged as a fundamental capability that no tracking system aiming to operate over a wide area network of disjoint cameras can, concretely, renounce to have. Performing person re-identification is a challenging task because of the presence of many sources of appearance variability like lighting, pose, viewpoint, occlusions, especially in the outdoor environment, where they are even more unrestrained.
In this talk, based on real-world scenarios, we address person re-id problems from the following straight-forward but effective intuitions:
- Addition. Video-based person re-id deals with the inherent difficulty of matching unregulated sequences with different lengths and with incomplete target pose/viewpoint structure. To this end, we propose a novel approach that can exploit more effectively the rich video information by **addition**. Specifically, we complement the original pose-incomplete information carried by the sequences with synthetic GAN-generated images, and fuse their feature vectors into a more discriminative viewpoint insensitive embedding .
- Subtraction. In contrast, video-based person re-id suffers from low-quality frames, caused by severe motion blur, occlusion, distractors, etc. To address this, we introduce Class-Aware Attention (CAA) in deep metric learning that **subtract** abnormal and trivial samples in video sequences .
- Geometry. The viewpoint variability across a network of non-overlapping cameras is a challenging problem affecting person re-id performance. We investigate how to mitigate the cross-view ambiguity by learning highly discriminative deep features with **geometric** information. The proposed objective is made up of two terms, the Steering Meta Center (SMC) term and the Enhancing Centers Dispersion (ECD) term that steer the training process to mining effective intra-class and inter-class relationships in the feature domain of the identities .
 Alessandro Borgia, **Yang Hua**, Elyor Kodirov, and Neil Robertson. GAN-based Pose-aware Regulation for Video-based Person Re-identification. In IEEE Winter Conference on Applications of Computer Vision (WACV) 2019 (oral).
 Xinshao Wang, **Yang Hua**, Elyor Kodirov, Guosheng Hu, and Neil Robertson. Deep Metric Learning by Online Soft Mining and Class-Aware Attention. In AAAI Conference on Artificial Intelligence (AAAI) 2019 (oral).
 Alessandro Borgia, **Yang Hua**, Elyor Kodirov, and Neil Robertson. Cross-view Discriminative Feature Learning for Person Re-Identification. IEEE Transactions on Image Processing, 27(11):5338-5349, 2018.