Hi Simon,

These are good questions. I think that because of the nature of the Q-update, I would expect that the values closer to the goal are more accurate, and the values further away are less so. This is because the values are updated recursively backward from the reward-providing state.

As far as the path not being “optimal” the stochasticity of the gridworld as well as the dangers of the pits both change what would be the optimal path somewhat. It is also the case that depending on the amount of exploration the agent is performing, it may not arrive at the absolute optimal path, but rather a local optima which achieves “good enough” reward.

PhD. Interests include Deep (Reinforcement) Learning, Computational Neuroscience, and Phenomenology.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store