Abstract:
Text in images carries essential information for multimodal reasoning, such as VQA or image captioning. To enable machines to perceive and understand scene text and reason jointly with other modalities, 1) we collect the TextCaps dataset, which requires models to read and reason over text and visual content in the image to generate image descriptions and 2) we propose M4C, which incorporates words from the text it reads from the image, based on a multimodal transformer architecture with iterative prediction and rich feature representations for OCR tokens. It is able to generate captions incorporating scene text on our TextCaps dataset and outperforming previous work on three VQA datasets that require reading text in images.
BIO:
Ronghang Hu is a research scientist at Facebook AI Research (FAIR). His research interests include vision-and-language reasoning and visual perception. He obtained his Ph.D. degree in computer science from University of California, Berkeley in 2020, advised by Prof. Trevor Darrell and Prof. Kate Saenko. In 2019 summer and 2017 summer, he was a research intern at FAIR working with Dr. Marcus Rohrbach and Dr. Ross Girshick, respectively. He obtained his B.Eng. degree from Tsinghua University in 2015.
Homepage: https://ronghanghu.com/
Sponsored in part by: Facebook Reality Labs Pittsburgh