Learning with Diverse Forms of Imperfect and Indirect Supervision
Abstract
Powerful Machine Learning (ML) models trained on large, annotated datasets have driven impressive advances in fields including natural language processing and computer vision. In turn, such developments have led to impactful applications of ML in areas such as healthcare, e-commerce, and predictive maintenance. However, obtaining annotated datasets at the scale required for training high capacity ML models is frequently a bottleneck for promising applications of ML. In this thesis, I study alternative pathways for acquiring domain knowledge and develop methodologies to enable learning from weak supervision, i.e., imperfect and indirect forms of supervision. I cover three forms of weak supervision: pairwise linkage feedback, programmatic weak supervision, and paired multi-modal data. These forms of information are often easy to obtain at scale, and the methods I develop reduce--and in some cases eliminate--the need for pointillistic ground truth annotations.
I begin by studying the utility of pairwise supervision. I introduce a new constrained clustering method which uses small amounts of pairwise constraints to simultaneously learn a kernel and cluster data. The method outperforms related approaches on a large and diverse group of publicly available datasets. Next, I introduce imperfect pairwise supervision to programmatic weak supervision label models. I show empirically that just one source of weak pairwise feedback can lead to significantly improved downstream performance.
I then further the study of programmatic data labeling methods by introducing approaches that model the distribution of inputs in concert with weak labels. I first introduce a framework for joint learning of a label and end model on the basis of observed weak labels, showing improvements over prior work in terms of end model performance on downstream test sets. Next, I introduce a method that fuses generative adversarial networks and programmatic weak supervision label models to the benefit of both, measured by label model performance and data generation quality.
In the last part of this thesis, I tackle a central challenge in programmatic weak supervision: the need for experts to provide labeling rules. First, I introduce an interactive learning framework that aids users in discovering weak supervision sources to capture subject matter experts’ knowledge of the application domain in an efficient fashion. I then study the opportunity of dispensing with labeling functions altogether by learning from unstructured natural language descriptions directly. In particular, I study how biomedical text paired with images can be exploited for self-supervised vision--language processing, yielding data-efficient representations and enabling zero-shot classification, without requiring experts to define rules on the text or images.
Together, these works provide novel methodologies and frameworks to encode and use expert domain knowledge more efficiently in ML models, reducing the bottleneck created by the need for manual ground truth annotations.
BibTeX
@phdthesis{Boecking-2023-135132,author = {Benedikt Boecking},
title = {Learning with Diverse Forms of Imperfect and Indirect Supervision},
year = {2023},
month = {February},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-23-03},
keywords = {weak supervision, multi-modal, self-supervised learning, data programming},
}