Are you trying to have the network identify two sounds simultaneously. If so, you will need a somewhat different network architecture that can support classifying multiple objects within a single sample. For example, you may want to use multiple output layers. Or, instead of using a softmax layer, use a sigmoid layer which would support more than a single unit having full-activation within the output layer. You would have to change your loss function appropriately as well however.
Best of luck.