Hi Pavan,

This is actually a result of the distribution the random result is being drawn from, not the method of optimization. The probability of success is drawn from a normal distribution with a mean around 0. As such, every integer above or below 0 represents a standard deviation from the mean of the distribution. Given the values of -2 and -5, they are both likely to provide the +1 reward almost every time, and as such there is an extremely small difference between the two in terms of expected reward over time. As such the signal is just far too weak to allow the algorithm to distinguish between them.

I would encourage you to play with values between -1 and 1 to get a better sense of the algorithm’s sensitivity to different probabilities of reward.

One other thing you might want to try is to adjust the magnitude of the reward for each arm of the bandit, rather than the probability of the reward. In that case you should definitely expect an arm giving a reward of 5 to be prefered over an arm giving a reward of 2.

Hope that helps!