I really appreciate you going through my code and attempting to rework it in a cleaner way. This was another one of my earlier tutorials, and definitely could be more compact.
As for the training procedure however, it is actually pretty essential that the model and policy are trained simultaneously. The issue is that a model trained on a random policy won’t get exposure to the whole state space. If you think of a game like Breakout: a random model may never actually hit the ball even once, and as such it would learn nothing about the dynamics of the bricks at the top of the screen. The essential thing about these models is that they are able to predict all the dynamics of the environment.
You are right that balancing the relative importance of the different aspects of the model loss can be an issue. One potential approach is to normalize the state/reward/done losses so each is equally relevant. Or in the case of done loss potentially being most important, to provide a greater weight to it than the other losses.
Another approach entirely to model-based RL would be to train a single network which learns to jointly provide both a policy and model of the dynamics. Such a network would both be more compact, but also each aspect would benefit from a single low-level representation.
Hope this helps, and please let me know if you get your approach to work successfully!