
Enhancing Reinforcement Learning with Error-Prone Language Models
The correct specification of reward models is a well-known challenge in reinforcement learning. Hand-crafted reward functions, which are usually sparse, often lead to inefficient or suboptimal policies, misalignment with user values, or difficulties in attributing credit or blame within multi-agent systems. Reinforcement learning from human feedback is a successful technique that can mitigate such issues by generating dense reward functions, but the collection of human feedback can be laborious. Recent works have solicited feedback from pre-trained large language models rather than humans to reduce or eliminate human effort. However, these approaches yield poor performance in the presence of hallucination and other errors.
For these challenges, this thesis proposes a simple yet effective method for soliciting and applying noisy LLM feedback via a potential-based reward shaping function in both single-agent and multi-agent settings. We theoretically show that inconsistent rankings – which approximate ranking errors – lead to uninformative rewards with our approach. Our method thus mitigates ranking errors while enabling the LLM to evaluate agents’ individual contributions to the task and provide feedback via the reward function at every timestep. Our method empirically improves convergence speed and policy returns over commonly used baselines across single-agent and multi-agent benchmarks even with significant ranking errors.
Committee:
Katia Sycara (advisor)
Zackory Erickson
Renos Zabounidis
Zackory Erickson
Renos Zabounidis