Hi Akhan,
Generally speaking RL methods see little improvement when bias terms are used. They are typically used in situations in which generalization is beneficial, but in control settings, specifically deterministic ones, we don’t need to “generalize” to anything but the training environment.
In this case with a simple one-layer network they would actually serve to cause problems, since our weights are standing in as a direct way of measuring the Q values, and a bias term would push those values around in an inaccurate way.
Hope that helps!