This is a really good question. The truth is that by doing async updates, there is the danger of updating the policy in a bad direction. This is alleviated by having a number of agents each providing small updates to the policy. By doing this, on the whole the updates are able to stably improve the policy. Of course in practice this doesn’t always happen, and I have found methods like DQN to be more reliable, although they often take longer to train and lead to lower overall performance.
In my implementation Tensorflow handles the multiple calls automatically, and in the order that they come in from each of the workers. It may be the case that a more manual control over the synchronization process will lead to better performance as well.