Good question. I haven’t tried it myself, but I would assume that it takes exponentially longer the more arms available. If you are using rewards of 1.0 for a single arm, and 0.0 for all other arms, the reward becomes increasingly sparse, and less meaningful signal is being provided to the network per episode. It could be that having independent arms means a more rich signal, since more than one arm could provide a reward each episode. Thus allowing for faster learning in that experiment.
Just some thoughts.