Learning with Diverse Forms of Imperfect and Indirect Supervision - Robotics Institute Carnegie Mellon University
Loading Events

PhD Thesis Proposal

December

1
Wed
Benedikt Boecking Robotics Institute,
Carnegie Mellon University
Wednesday, December 1
4:00 pm to 5:30 pm
Learning with Diverse Forms of Imperfect and Indirect Supervision

Abstract:
High capacity Machine Learning (ML) models trained on large, annotated datasets have driven impressive advances in several fields including natural language processing and computer vision, in turn leading to impactful applications of ML in areas such as healthcare, e-commerce, and predictive maintenance. However, obtaining annotated datasets at the scale required for training such models is costly and often becomes a bottleneck for promising applications of ML. In this thesis, I study imperfect and indirect forms of supervision (weak supervision) such as partial rules and pairwise constraints as a mechanism to encode domain knowledge, as these are frequently easy to obtain at scale and can enable learning without pointillistic ground truth annotations.

I begin by studying the utility of small amounts of pairwise supervision for clustering, by using known group-membership constraints to learn a kernel to improve constrained clustering performance. Next, I propose methodology that uses imperfect pairwise labels to augment learning for programmatic data labeling methods which traditionally only learn from Labeling Functions (LFs), i.e. user defined functions that directly but imperfectly label subsets of data. Such label models aggregate sources of imperfect supervision to estimate the latent ground truth and act as teachers to end models, thereby playing an essential role in achieving generalization. Preliminary results show promising performance improvements.

I further the study of programmatic data labeling methods by introducing integrated, end-to-end learning frameworks and novel label models. I first introduce a framework for joint learning of a label and end models from LFs, showing improved performance over prior work in terms of end model performance on downstream test sets. I then propose new methodology based on discrete latent variable modeling in generative adversarial networks to improve estimates of the unobserved ground truth through uncovering of disentangled, discrete structures in the features.

Finally, I study two extremes on the spectrum of domain knowledge acquisition in weak supervision: user interactivity for discovering useful sources of imperfect labels, and learning merely from data paired with unstructured natural language descriptions. I first introduce an interactive learning framework that aids users in discovering weak supervision sources to systematically and proactively capture subject matter experts’ knowledge of the application domain in an efficient and effective fashion. I then propose to study how unstructured natural language descriptions (such as doctors notes) paired with images can be exploited for image representation learning and zero-shot classification, without requiring experts to define rules on the text or images as in prior related work.

Together, these works provide novel methodologies and frameworks to more efficiently encode expert domain knowledge in ML models, reducing the bottleneck created by the need for pointillistic ground truth annotations.

More Information

Thesis Committee Members:
Artur Dubrawski, Chair
Barnabas Poczos
Jeff Schneider
Hoifung Poon, Microsoft Research