Abstract:
For intelligent agents (e.g. robots) to be seamlessly integrated into human society, humans must be able to understand their decision making. For example, the decision making of autonomous cars must be clear to the engineers certifying their safety, passengers riding them, and nearby drivers negotiating the road simultaneously. As an agent’s decision making depends on its reward function to a great extent, we focus on teaching agent reward functions to humans. Through reasoning that resembles inverse reinforcement learning (IRL), humans naturally infer reward functions that underlie demonstrations of decision-making. Thus agents can teach their reward functions through demonstrations that are informative for IRL. However, we critically note that IRL does not consider the difficulty for a human to learn from each demonstration. Thus, this thesis proposes to augment teaching for IRL with principles from the education literature to provide demonstrations that belong in a human’s zone of proximal development (ZPD) or their “Goldilocks” zone, i.e. demonstrations that are not too easy nor too difficult given their current beliefs. This thesis provides contributions in the following three areas.
We first consider the problem of teaching reward functions through select demonstrations. Based on ZPD, we use scaffolding to convey demonstrations that gradually increase in information gain and difficulty and ease the human into learning. Importantly, we argue that a demonstration’s information gain is not intrinsic to the demonstration itself but must be conditioned on the human’s current beliefs. An informative demonstration is accordingly one that meaningfully differs from the human’s expectations (i.e. counterfactuals) of what the agent will do given their current understanding of the agent’s decision making.
We secondly consider the problem of testing how much the human has learned from demonstrations, by asking humans to predict the agent’s actions in new environments. We demonstrate two ways of measuring the difficulty of a test for a human. The first is a gross measure of difficulty that correlates test difficulty with the answer’s information gain at revealing the agent’s reward function. The second is a more tailored measure that conditions the difficulty of a test on the human’s current beliefs of the reward function, estimating difficulty as the proportion of the human’s beliefs that would yield the correct answer.
Finally, we introduce a closed-loop teaching framework that brings together teaching and testing. While informative teaching demonstrations may be selected a priori, student learning may deviate from the preselected curriculum in situ. Our teaching framework thus provides intermittent tests and feedback in between groups of related demonstrations to support tailored instruction in two ways. First, we are able to maintain a novel particle filter model of human beliefs and provide demonstrations targeted to the human’s current understanding. And second, we are able to leverage tests not only as a tool for assessment but also for teaching, according to the testing effect in the education literature.
Through various user studies, we find that our demonstrations targeted for a human’s ZPD increase learning outcomes (e.g. the human’s ability to predict the agent’s actions in new environments). However, we find that learning gains can be associated with increased mental effort for the human to update their beliefs, highlighting again the importance of selecting demonstrations that differ just enough from human expectations to be informative but not too difficult to understand. We also see that such informative demonstrations often illuminate trade-offs inherent in an agent’s reward function that may be subtle and difficult to predict a priori, such as how far an agent is willing to detour around a potentially dangerous terrain like mud. And finally, we find interesting interaction effects between our various gridworld domains and our results in our later user studies, and we provide further insights on how domains may be characterized in light of the observation that the best teaching method is likely domain-dependent.
Thesis Committee Members:
Reid Simmons, Co-chair
Henny Admoni, Co-chair
David Held
Scott Niekum, UMass Amherst