Hi Reira,

This is a good question. In such a case there would still be a fixed set of action probabilities, but they would vary contextually given the state. In a card game for example, the cards may change, but the maximum number of cards available to a single person at a given time is often limited. For example, if a player could have 10 cards in their hand, then each of the actions could correspond to playing card 1,2,3,..10. The network would be provided an input tensor which conveyed what the actual content of each of those “card positions” were, and the agent could act contextually given this information.

Also, while large action spaces of hundreds or thousands are difficult to work with in discreet settings, policy gradient methods, if trained properly, could support dozens of potential actions, which may be enough for some of the changing scenarios you have in mind.

Hope that helps answer your question!

PhD. Interests include Deep (Reinforcement) Learning, Computational Neuroscience, and Phenomenology.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store