Welcome to the 3rd Workshop on Image/Video/Audio Quality in Computer Vision and Generative AI!
Many machine learning tasks and computer vision algorithms are susceptible to image/video/audio quality artifacts. Nonetheless, most visual learning and vision systems assume high-quality image/video/audio as input. In reality, noises and distortions are common in image/video/audio capturing and acquisition process. Oftentimes, artifacts can be introduced in the video compression, transcoding, transmission, decoding, and/or rendering process. All of these quality issues play a critical role on the performance of learning algorithms, systems and applications, therefore could directly impact the customer experience.
This workshop addresses topics related to image/video/audio quality in machine learning, computer vision, and generative AI. The topics include, but are not limited to:
Alongside the workshop we are hosting a Grand Challenge on the identification of audiovisual synchronization (AV-Sync) errors. An AV-Sync error, the temporal misalignment of audio relative to the video, is a defect that can have a large impact on a viewer’s Quality of Experience. The Telecommunications Union’s Rec. ITU-T J.248: Requirements for operational monitoring of video-to-audio delay in the distribution of television programs (2008) used subjective experiments to determine participants could detect an AV-Sync error if the audio leads the video by greater than 45 ms and if the audio lags the video by greater than 125 ms.
The Grand Challenge will consist of two tasks:
The two tasks will have a combined prize fund of $12000. We invite participants to enter one or both of the tasks at the links above with the following timelines:
Abstract: Visual content is increasingly being used for more than human viewing. For example, traffic video is automatically analyzed to count vehicles, detect traffic violations, estimate traffic intensity, and recognize license plates; images uploaded to social media are automatically analyzed to detect and recognize people, organize images into thematic collections, and so on; visual sensors on autonomous vehicles analyze captured signals to help the vehicle navigate, avoid obstacles, collisions, and optimize their movement. The above applications require continuous machine-based analysis of visual signals, with only occasional human viewing, which necessitates rethinking the traditional approaches for image and video compression. This talk is about coding visual information in ways that enable efficient usage by machine learning models, in addition to human viewing. We will touch upon recent rate-distortion results in this field, describe several designs for human-machine image and video coding, and briefly review related standardization efforts.
Bio: Ivan V. Bajić is a Professor of Engineering Science and co-director of the Multimedia Lab at Simon Fraser University, Canada. His research interests include signal processing and machine learning with applications to multimedia processing, compression, and collaborative intelligence. His group’s work has received several research awards, including the 2023 TCSVT Best Paper Award, conference paper awards at ICME 2012, ICIP 2019, MMSP 2022, and ISCAS 2023, and other recognitions (e.g., paper award finalist, top n%) at Asilomar, ICIP, ICME, ISBI, and CVPR. Ivan has served on the organizing and/or program committees of the main conferences in his field, and has received several awards in these roles, including Outstanding Reviewer Award (six times), Outstanding Area Chair Award, and Outstanding Service Award. He was on the editorial boards of the IEEE Transactions on Multimedia and IEEE Signal Processing Magazine, and is currently a Senior Area Editor of the IEEE Signal Processing Letters.
Abstract: Informally, color images may be perceived as higher quality than grayscale. All major face image datasets used by the research community contain RGB color images, and all deep CNN face matchers process color images. But, do CNN face matchers that work with color face images achieve better accuracy than equivalent matchers working with equivalent grayscale images? This talk examines this question in detail, and concludes that grayscale could in fact be better than color for deep CNN face matchers.
If you have any questions or inquiries, please contact us at email@example.com.