Hi Sung,

Thanks for your comment, and for reading my blog.

  1. If you think of the Q algorithm in the context of gradient descent, then `r + γ(max(Q(s’,a’))` is what we would like to approach, but we know that it is a noisy estimate of the true Q value for that given region. So instead of directly updating toward it, we take a small step in the direction that will make the Q value closer to the desired one.
  2. You are right. I am simply adding noise as a means of encouraging exploration. E-greedy works as well, but this method takes into account the size of the Q-values, and is actually a little more robust.
  3. A non-stochastic environment would definitely be easier to learn. Since the Q-algorithm is robust to stochasticity however, it still works, and is a nice property of the system.

I hope those responses are helpful.

PhD. Interests include Deep (Reinforcement) Learning, Computational Neuroscience, and Phenomenology.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store