Thanks for your comment, and for reading my blog.
- If you think of the Q algorithm in the context of gradient descent, then `r + γ(max(Q(s’,a’))` is what we would like to approach, but we know that it is a noisy estimate of the true Q value for that given region. So instead of directly updating toward it, we take a small step in the direction that will make the Q value closer to the desired one.
- You are right. I am simply adding noise as a means of encouraging exploration. E-greedy works as well, but this method takes into account the size of the Q-values, and is actually a little more robust.
- A non-stochastic environment would definitely be easier to learn. Since the Q-algorithm is robust to stochasticity however, it still works, and is a nice property of the system.
I hope those responses are helpful.