The approach to sampling multiple times would go something like this: You would run the network multiple times with the sample input, and different dropout masks each time. This would give you a set of Q estimations for each action. You would then look at the variance in these estimations to give you a sense of the uncertainty of the model. You can then take this uncertainty into account when taking actions. For example, you could decide to be “optimistic under uncertainty” and take actions that are uncertain, but have a large upper bound of Q, rather than a more certain action with a higher average Q, but lower max Q, if that makes sense. This approach introduces more hyperparameters, but once again takes more advantage of the uncertainty information to guide exploration.