I am glad you were able to get the network working with Pong! The reason that policy collapse happens with small batch sizes is because
V(s') is estimated whenever the batch doesn’t contain a terminal state. This
V(s') is then used to calculate the advantages for all the steps in the rollout. This estimate becomes more robust as the batch becomes bigger, since the network can get a more clear understanding of the environment with 100 steps of experience for example as opposed to 30.
Conversely, the errors in the estimated
V(s') can accumulate, especially when a large and unpredicted reward is obtained. This then leads to a ‘bad’ update to the network, pushing it away from a viable policy. In a perfect situation the batch size would be the length of the episode, but for most situations this is infeasible, so we normally choose something that is big enough to be robust, but small enough to be passed through the network quickly.
Hope that helps!