Behavioral Dataset Compression for Efficient Reinforcement Learning

Lei, Shiye

Permalink

Access status:

USyd Access

Type

Thesis

Thesis type

Doctor of Philosophy

Author/s

Lei, Shiye

Abstract

Offline reinforcement learning (offline RL) provides a principled framework for learning decision-making policies from fixed datasets without environment interaction, enabling applications in safety-critical, privacy-sensitive, and resource-constrained settings. However, modern ...
See moreOffline reinforcement learning (offline RL) provides a principled framework for learning decision-making policies from fixed datasets without environment interaction, enabling applications in safety-critical, privacy-sensitive, and resource-constrained settings. However, modern offline RL systems often rely on large-scale datasets collected from suboptimal policies, leading to substantial computational overhead and limited scalability. Improving data efficiency is therefore critical for making offline RL practically viable. In this thesis, we develop dataset compression algorithms for offline RL that explicitly account for these intrinsic data properties. From the action perspective, we establish a theoretical equivalence between the policy performance gap and an action-value-weighted decision discrepancy. This insight motivates an action-value-weighted objective for offline behavior distillation (OBD), which distills large offline RL datasets into compact synthetic training sets. From the state perspective, we identify state diversity as a key factor governing the effectiveness of offline behavior distillation. We show that insufficient state coverage in the original dataset limits policy performance after compression. To address this issue, we propose state-weighted OBD, which explicitly incorporates state diversity into the distillation objective and significantly improves robustness to dataset compression. Finally, by jointly considering action-value information, state density, and trajectory-level sequential structure, we propose stepwise dual ranking (SDR), a simple and scalable coreset selection algorithm that constructs compact yet informative subsets from large offline behavioral datasets without additional training overhead.
See less

Date

2026

Rights statement

The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.

Faculty/School

Faculty of Engineering, School of Computer Science

Awarding institution

The University of Sydney

Subjects

Data Compression
Dataset Distillation
Offline Reinforcement Learning