Hi Wei,

The reason for summing instead of using mean is that by summing we are preserving the magnitude of each update. Mean would also result in an update in the correct direction, but it will be smaller than it needs to be, and likely take longer to converge . It is also the case that when using batches of variable sizes, we want to ensure that gradients collected from a larger batch contain a larger magnitude than gradients collected from a smaller batch.

Hope that helps.

PhD. Interests include Deep (Reinforcement) Learning, Computational Neuroscience, and Phenomenology.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store