Recognizing Sounds (A Deep Learning Case Study)


In the recent years, image classification has become an increasingly popular machine learning task, utilized in large-scale applications such as Google Photos, and Facebook tagging. With the advent of fast and reliable convolutional neural networks, sets of thousands of images with hundreds of classes can be easily classified with high accuracy (Kirzhevsky & Hinton, 2012). The success of these networks in the image classification domain begs the question of its applicability to other domains where discreet objects exists. We can image one of these domains being audition, where there are discreet sounds that happen over time. This is analogous to image recognition, where discreet objects exist across space. As such, it is an ideal domain to explore.

For image recognition, there exist large curated databases such as IMAGE-NET and MS-COCO, which contain millions of labeled examples to use for training a network. For audition, there is no such expansive database. As such, I had to develop a novel database from scratch. When thinking about the kinds of sounds which we typically associate as being discreet objects, instruments immediately come to mind. A note played by an individual instrument is something that exists for a fixed period in time, and contains an auditory signature that most individuals can distinguish from one another. This is at least true on the inter-instrument level where the sound of a drum can be told apart from a guitar for example. They were also a fun dataset to obtain, as I was able to spend time working with my musician friends.


Sounds were recorded by two friends of mine playing a drum kit and guitar. Within the drum kit, there were eight individual percussive sub-instruments: Snare, Rim, Hi-hat, Ride, Kick, Small-tom, Mid-tom, and Floor-tom. Three notes (3rd fret E string, 7th fret E string, and 9th fret D string) and two chords (A# D F and B B D#) were recorded from the guitar, for a total of 13 sound object classes. Each of these sound objects were captured at 240bpm for two minutes in order to obtain 480 samples each. The musician kept time using a metronome, and the data was checked to ensure each sound occurred within the designated 0.25 second window. The samples were taken using a single channel microphone recording at 44100hz.

Data was then randomized and split into 70% training set, 20% validation set, and 10% test set. With raw audio data, there are a number of possible ways to represent this information. For the experiments in the current study, I choose to use the sound-pressure information at each sample time, and a spectrogram of the audio for each sample which represented the data in a frequency space. When represented visually, both methods produce patterns which can be told apart by an ordinary observer. Furthermore, the spectrogram contains frequency and amplitude information over time, something considered essential in most analyses of acoustic information. I had a hunch this would be the better representation, but I wanted to see whether a neural network was capable of learning from the pure sound-pressure information as well. The sound-pressure dataset was initially 11025 dimensional, however this was reduced to 1024 dimensions in order to match the size of the frequency space information, which was represented in a 32 x 32 matrix. By doing so, the models could be compared directly.

Figure 1. Left: a single sound object displayed in sound-pressure space. Right: a single sound object displayed in frequency-space. In both bases x axis represents time.

I chose to compare three different model architectures of increasing complexity in order to learn what model features may be needed to successfully distinguish between the different sound objects in the dataset. The first model was the simplest neural network possible: a softmax regression, which consisted of a single linear layer. This architecture was chosen for its simplicity, and generally impressive performance on a number of tasks, including image recognition in certain contexts, such as MNIST handwritten digit recognition. The second chosen architecture was a neural network with a single hidden layer. The size of the hidden layer was adjusted as a hyper-parameter, and optimal tuning parameters are presented in the Results section. A neural network was chosen as an intermediate because it allows for the introduction of nonlinearity into the model in a way that could be controlled, while maintaining interpretable weight interpretations.

A neural network with a single hidden layer.

The final chosen architecture was a convolutional neural network with a design similar to that of Le-Net (LeCun et al., 1998). It is here that the deep learning comes in, since this model contained multiple hidden layers. Specifically, a model with the following feed-forward architecture (32x32x1) -> CONV (32x32x32) -> POOL (16x16x32) -> CONV (16x16x64) -> POOL (8x8x64) -> FC (4096x1024) -> FC (1024x13) -> OUT was used. A convolutional network was chosen as the third model for its success in image recognition tasks, as well as its similarity to biological models of the nervous system. Convolutional layers in particular capture the property of receptive fields, which are essential to the operation of both the human visual and auditory system (Kandel et al., 2000). All model architectures were implemented using TensorFlow, and dataset training was performed on an NVIDIA Geforce GTX 970. Adam gradient descent was utilized for all training regiments, and the loss was always a softmax loss. Adam was chosen for its improved performance when compared to traditional gradient descent methods (Kingma & Ba, 2014).

Le-Net5. The framework I used was slightly modified, but the general architecture remained the same.


All models were trained for 500 iterations regardless of convergence. I first compared learning accuracy between the two data representation formats. The three model architectures were trained using each kind of representation, and the results show unanimously better accuracy for networks trained using the frequency-space representation than the sound pressure representation. After 500 iterations none of the three architectures were able to achieve greater than 20% accuracy on the latter representation. In contrast both the neural network and convolutional network architectures achieved greater than 80% accuracy using the frequency-space represented data.

Figure 2. Comparison of different model architectures by training on sound-pressure representation of data.
Figure 3. Comparison of different model architectures by training on frequency-space representation of data.

Within the neural network architecture, I examined the effect of accuracy when the size of the hidden layer was adjusted. Given the main result suggesting the much worse accuracy when models were trained using the sound-pressure space data representation, I only conducted this analysis on the frequency-space data. Accuracy scaled roughly linearly with an increase in layer size until 50 units. At this point the training set accuracy for both the 40-unit model and 50-unit model was not appreciably different after 500 iterations. The validation set accuracy was higher for the 50-unit model, and as such it was chosen as the model used in overall comparisons. Given the likely minimal returns using a neural network with more than 50 hidden units, additional models with larger hidden layers were not constructed.

Figure 4. Five different hidden layer sizes were used to generate models using frequency-space data representation. The 50-unit model showed highest combined training and validation set accuracy.

Comparing between the model architectures, a clear ordering of training accuracy appears: specifically, we find that the softmax architecture fails to successfully classify the training data, and performs at chance regardless of number of iterations. Next we find that the neural network architecture is comparable to the convolutional architecture in performance using the sound-pressure representation, but worse in the frequency-space representation. After roughly half of the total iterations the convolutional neural network achieves an accuracy of over 97% on the test dataset, making it the most successful of the algorithms compared. For all architecture and data pairings there was no evidence of overfitting, as the validation accuracy tracked training accuracy for the duration of the training regimen.


These results show that it is indeed possible to use machine learning methods to classify subtle differences in sound objects both between and within instruments. This was made possible through the healthy training set size, and the general internal consistency of the sound objects. When comparing representation type, frequency space was overwhelmingly the more successful choice. This was expected, given that most acoustic analyses are performed in frequency space, and not on the raw sound pressure information. Given the ability for convolutional networks to learn higher representations from raw image pixel information in the image classification domain however, it is disappointing that a convolutional neural network was unable to approximate such a transformation in the auditory domain. Of course, the network would have to learn the equivalent of a Fourier transformation, and the LeNet architecture may not be suited to such a task (but maybe a more complex network is).

When comparing the architectures themselves, the convolutional network was the most successful in achieving high classification accuracy. With enough training iterations however, the neural network performed surprisingly well, considering how much simpler a representation it learned. The softmax architecture failed to successfully learn either data representation. This is likely due to the inherent nonlinearity in the problem domain. Within a 0.25 second window, the sound will likely not appear in the same place every time, thus preventing a simple linear model from learning to represent the sounds in an invariant way. The other important difference between the two models was training time. The convolutional network took more than twice as long to train. In a production environment the convolutional network would likely take longer to classify new sounds as well, given its reliance on computationally more expensive convolutional layers. Taking all of this into account however, the high accuracy would likely outweigh any overhead downsides of the larger network.


Kandel, E. R., Schwartz, J. H., & Jessell, T. M. (Eds.). (2000). Principles of neural science (Vol. 4, pp. 1227–1246). New York: McGraw-hill.

Kingma, D., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105).

LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.

PhD. Interests include Deep (Reinforcement) Learning, Computational Neuroscience, and Phenomenology.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store