Removing the i’s from i.i.d : Testing generalization on hard datasets
Abstract
The last few years have seen the widespread success of over-parameterized deep learning models on various applications with massive datasets. However, these models are often critiqued for assuming access to perfect data, that is, a large amount of clean, i.i.d sampled data. In real-world scenarios, neither of these assumptions is entirely true. We consider four arbitrary domains as examples of some of these scenarios, namely, point cloud completion (with distribution shift), visual dialog (dataset size/bias issues), meta-rl for control (noisy, high variance and sparse training signal) and poaching prediction task (unstructured dataset with skew, noise and distribution shift). Using these datasets, we show that data and priors are meant to complement each other in machine learning models and it’s important to think of them jointly on a task to task basis for better generalization.
BibTeX
@mastersthesis{Gurumurthy-2019-118729,author = {Swaminathan Gurumurthy},
title = {Removing the i’s from i.i.d : Testing generalization on hard datasets},
year = {2019},
month = {December},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-19-81},
keywords = {Machine learning, out-of-distribution, distribution shift, point cloud, dialog, vision, meta reinforcement learning},
}