Hi Andy,
You are right. In my implementation I use a single frame as input, and an LSTM layer to capture temporal dependencies. For environments in which all the temporal dynamics can be captured within a few frames, then frame stacking may be a better option, since it is more stable.