Reasoning about complex media from weak multi-modal supervision - Robotics Institute Carnegie Mellon University
Loading Events

VASC Seminar

February

3
Mon
Adriana Kovashka Assistant Professor University of Pittsburgh
Monday, February 3
3:00 pm to 4:00 pm
GHC 6501
Reasoning about complex media from weak multi-modal supervision

Abstract:  In a world of abundant information targeting multiple senses, and increasingly powerful media, we need new mechanisms to model content. Techniques for representing individual channels, such as visual data or textual data, have greatly improved, and some techniques exist to model the relationship between channels that are “mirror images” of each other and contain the same semantics. However, multimodal data in the real world contains little redundancy; the visual and textual channels complement each other. We examine the relationship between multiple channels in complex media, in two domains, advertisements and political articles.

We develop a large annotated data set of advertisements and public service announcements, covering almost forty topics (ranging from automobiles and clothing, to health and domestic violence). We pose decoding the ads as automatically answering the questions “What should do viewer do, according to the ad” (the suggested action), and “Why should the viewer do the suggested action, according to the ad” (the suggested reason). We collect annotations and train a variety of algorithms to choose the appropriate action-reason statement, given the ad image and potentially a slogan embedded in it. The task is challenging because of the great diversity in how different users annotate an ad, even if they draw similar conclusions. One approach mines information from external knowledge bases, but there is a plethora of information that can be retrieved yet is not relevant. We show how to automatically transform the training data in order to focus our approach’s attention to relevant facts, without relevance annotations for training. We also present an approach for learning to recognize new concepts given supervision only in the form of noisy captions.

Next, we collect a data set of multimodal political articles containing lengthy text and a small number of images. We learn to predict the political bias of the article, as well as perform cross-modal retrieval. To better understand political bias, we use generative modeling to show how the face of the same politician appears differently at each end of the political spectrum. To understand how image and text contribute to persuasion and bias, we learn to retrieve sentences for a given image, and vice versa. The task is challenging because unlike image-text in captioning, the images and text in political articles overlap in only a very abstract sense. To better model the visual domain, we leverage the semantic domain. Specifically, when performing retrieval, we impose a loss requiring images that correspond to similar text to live closebyin a projection space, even if they appear very diverse purely visually. We show that our loss significantly improves performance in conjunction with a variety of existing recent losses.

Bio:  Adriana Kovashka is an Assistant Professor in Computer Science at the University of Pittsburgh. Her research interests are in computer vision and machine learning. She has authored seventeen publications in top-tier computer vision and artificial intelligence conferences and journals (CVPR, ICCV, ECCV, NeurIPS, TPAMI, IJCV, AAAI, ACL) and ten second-tier conference publications (BMVC, ACCV, WACV). She has served as an Area Chair for CVPR in 2018-2020. She has been on program committees for over twenty conferences or journals. She has co-organized seven workshops at top-tier conferences. Her research is funded by the National Science Foundation, Google, Amazon and Adobe.

Homepage:  http://people.cs.pitt.edu/~kovashka/