Alignment for Vision-Language Foundation Models - Robotics Institute Carnegie Mellon University

Alignment for Vision-Language Foundation Models

Master's Thesis, Tech. Report, CMU-RI-TR-23-83, December, 2023

Abstract

Recent advancements in vision-language foundation models, exemplified by GPT4-Vision and DALL-E 3, have significantly transformed both research and practical applications, ranging from professional assistance to content creation. These models excel with minimal downstream data and limited human input, primarily leveraging prompt-based interactions. However, aligning them precisely with specific user goals presents a notable challenge. This thesis introduces innovative strategies for improving this alignment. It begins with a novel cross-modal adaptation framework, utilizing textual data to tailor foundational models such as CLIP more effectively to tasks such as visual recognition. It then explores an approach based on ChatGPT for aligning popular proprietary models, like DALL-E 3, to better meet user needs. Lastly, the thesis addresses the challenges in visio-linguistic reasoning, discussing efforts to assess and enhance model fidelity in complex tasks requiring advanced compositional reasoning.

BibTeX

@mastersthesis{Lin-2023-139172,
author = {Zhiqiu Lin},
title = {Alignment for Vision-Language Foundation Models},
year = {2023},
month = {December},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-23-83},
keywords = {Vision-language Models, Computer Vision, Machine Learning},
}