The truth is that RL training is somewhat different than general deep learning training (or general ML training). The difference here is that we are utilizing the reward measure, rather than a proxy loss function when determining performance. As such there are fewer guarantees related to typical early stopping criteria. We also aren’t dealing with the usual bias/variance trade-offs between training and testing datasets, since in RL we typically care about task performance, and train on the test data, so to speak.
What is typically done is that the learning rate is set to decrease at a fixed schedule, and will eventually approach 0.