Thanks for this interesting question. I think the fundamental approach could be workable. You would simply need to collect a large enough training dataset of the kinds of sounds you are interested in. You could then segment the audio data into short windows and classify the dominant sound during that specific window of time.
Hope something like that makes sense.