Planning to Optimize and Learn Reward in Navigation Tasks in Structured Environments with Time Constraints - Robotics Institute Carnegie Mellon University
Loading Events

PhD Thesis Defense

July

23
Fri
Max Korein Robotics Institute,
Carnegie Mellon University
Friday, July 23
9:15 am to 10:15 am
Planning to Optimize and Learn Reward in Navigation Tasks in Structured Environments with Time Constraints

Abstract:

Planning problems in which an agent must perform tasks for reward by navigating its environment while constrained by time and location have a wide variety of applications in robotics. Many real-world environments in which such planning problems apply, such as office buildings or city streets, are very structured. They consist of passages with notable locations along them such as streets or highways, and notable locations might be grouped into clusters such as floors in a building.

In this thesis, we introduce algorithms designed for an agent performing tasks for reward while constrained by time and location in a structured environment. Our goal is specifically to exploit the structure of the environment when planning in order to improve the reward received. We present three algorithms that do so in different ways. First, we present the Task Graph algorithm, which groups actions the agent can perform into larger “tasks” and creates a plan from a sequence of tasks, rather than individual actions. We test the algorithm in a problem we call “Exploration Scheduling,” in which an agent seeks to gather information in the free time between mandatory tasks in order to improve the reward received from the mandatory tasks, and find it yields strong results.

Second, we present the RegionPlan algorithm, which divides the environment into regions based on its structure and creates a separate plan for each region to choose from. We test the RegionPlan algorithm solving the orienteering problem in simulated structured environments, and demonstrate environment structures and reward distributions in which the RegionPlan approach allows the agent to avoid getting stuck at certain locally-optimal solutions.

Third, we present Route-Based Multi-Level Variable Neighborhood Search (Route-MLVNS), a variable neighborhood search algorithm that represents solutions to the problem as routes the agent takes through the environment, rather than sequences of actions. In this way, Route-MLVNS takes into account the locations that the agent passes by as it travels when evaluating the reward of a location, rather than only considering the reward of traveling to a single location at a time. We test Route-MLVNS in simulation of the orienteering problem in structured environments comparing it to an MLVNS algorithm from the literature that treats solutions as ordinary task plans, and find Route-MLVNS to yield superior results. We also demonstrate a unique capability that Route-MLVNS has, the ability to solve the orienteering problem for a continuous range of time constraints simultaneously rather than only for a single time constraint or discrete set of time constraints.

Finally, we consider scenarios in which an agent solves repeated instances of the orienteering problem in a structured environment with stochastic rewards that are initially unknown. We present algorithms which treat locations in the environment as levers in a multi-armed bandit problem and assign them values based on both the expected reward and the potential value of the knowledge gained by visiting them, and use planning algorithms based on the environment structure to find a plan that maximizes the assigned value received. We perform tests demonstrating the algorithms’ effectiveness at both receiving reward over time and learning an effective model. Additionally, we analyze the relationship between the structure of the environment and the distribution of the data received by the agent.

Overall, our work demonstrates the benefit that an agent seeking to maximize the reward it receives in a structured environment by considering and exploiting that structure, and how the structure affects the data it can gather in environments with unknown stochastic rewards.

More Information

Thesis Committee Members:
Manuela Veloso, Chair
Reid Simmons
Illah Nourbakhsh
Peter Stone, University of Texas at Austin