Hi Gabriel,
What is important about the reverse accumulation is that the reward discounting happens in reverse (which the discount
function takes care of). The gradients are all applied at the same time, so it doesn’t matter what order each example is processed in.
Hope that clears it up for you.