Hi Fabio,

If the you are using an optimal policy, and it is deterministic, and the environment is also deterministic, then you will have zero variance and bias (unless you are using a function approximator with limited representation capacity).

For the actual learning process however, this is almost never the case, since some amount of exploration is required in order to improve the policy.

PhD. Interests include Deep (Reinforcement) Learning, Computational Neuroscience, and Phenomenology.

