If you are a metal fan, the chances are that you will also play a musical instrument, or perhaps several instruments. However, for readers without any formal musical training, the word timbre might still not be familiar, although musicians are likely to have a well-developed sense of timbre without realising it. So, what is timbre, and why does it matter in metal music? When asked to describe a sound, a musician might use a combination of direct or metaphorical language: ‘the bass is woolly’, ‘the kick drum needs more punch’, ‘the guitar is very harsh’.1 This kind of descriptive vocabulary can present an obstacle for people who want to measure or manipulate sonic perceptual characteristics, such as musicians looking for a certain sound, producers responding to a client description, or engineers designing tools like equalisation or distortion algorithms. Essentially, the language of timbre is a way to communicate with others about the sound we hear. This extends beyond performance or recording and includes describing our experience of listening to music. For example, production critiques often fall instinctively under the purview of timbral analysis.
This chapter first introduces timbre from a formal point of view and reviews some specific metal-centred timbral studies from literature. Next, we look at the use of technology to meter timbral attributes. Finally, the chapter reflects on the ability to implement a machine learning system, a type of data-driven learning that falls under the umbrella of artificial intelligence, in music production tasks, which has borne fruit in automated mastering and mixing in recent years.2 This chapter gives a brief overview of how these ideas work and how they might speculatively be implemented in the context of metal music and timbre, particularly with examples looking at convolution and tone matching.
Defining Timbre (‘Then’)
Listeners often struggle to describe their experience of music to others in a meaningful way. What do adjectives like ‘warm’, ‘punchy’ or ‘heavy’ really mean, and how can listeners, musicians and producers exploit this knowledge? To get a handle on timbre, we need to set a few definitions. The American Standards Association (ASA) defines timbre as ‘that attribute of sensation, in terms of which a listener can judge that two sounds having the same loudness and pitch are dissimilar’.3 This definition can be considered a difficult starting point, as it does not define what timbre is but rather what it is not. To illustrate this difficulty, we can consider unpitched or environmental sounds that would, according to the ASA definition, have no timbre. A more satisfactory definition is commonly given as ‘the sensation on whereby a listener can judge that two sounds are dissimilar using other criteria than pitch, loudness or duration’.4
Tone colour and sound quality are terms that might be used synonymously with timbre. However, tone colour can imply that the spectral properties of the sound are solely responsible for its timbre, contradictory to research specifying the importance of temporal acoustic correlates with relation to the perception of timbre.5 What this tells us is that timbre is a psycho-acoustic attribute. This means it is a perceptual parameter, something which is slightly different for each of us, but which has underlying acoustic contributing factors that can be quantified.
Fans of metal music are, in fact, particularly well trained for timbral analysis, as they will have a sense of what ‘heavy’ means and be familiar with at least three categories of instrumental timbral attributes: distorted guitar tones, screaming vocals, and polished closely microphoned and/or triggered drum sounds.
Metal Specific Timbral Attributes (‘Now’)
Let us consider our metal-specific timbral attributes in more detail. We might consider an ontological pyramid containing our descriptors, perhaps with ‘heavy’ at the top of the pyramid, as it pertains to individual instrumentation, the overall mix, lyrical content, and indeed the semantic whole of a performance. Below this, we might see our flavours of distortion, both for bass and guitar, perhaps then broken into smaller descriptors like ‘bright’, ‘dirty’ or ‘chuggy’. Similarly, the second layer might feature other instruments, with a category for drums having sub-categories that include descriptors like ‘punchy’. Perhaps the most unique and challenging of this second layer will be our category for vocals. There are very few genres of music with as much variety of timbre in their vocal delivery. To illustrate this, imagine a line of lyrics being delivered at the same pitch, loudness and spatial placement, but by different vocalists. Let us take Chris Barnes, the original vocalist of the death metal band Cannibal Corpse. The third layer of the pyramid might now include timbral descriptors like ‘growly’ or ‘death grunt’. We might imagine a rather different vocal timbre if Barnes was replaced by, for example, Till Lindemann from the German industrial metal band Rammstein. Lindemann has a larger range than Barnes, and in German musicology, there is even a word for the type of spoken-word singing style he employs with Rammstein, Sprechgesang, literally spoken singing and not limited to the world of metal. For example, in the operatic idiom, it would be perfectly acceptable to mark a passage as Sprechgesang for singers. But in the metal world, we might fill our third layer of the timbral pyramid for Lindemann with descriptors like ‘breathy’ or ‘raspy’. The vocal sound we hear in this case is a combination of several factors, including Lindemann’s own performance, but also very much reliant on sound engineering and music production techniques. Closely microphoned vocals lend particular timbral properties, as does the use of dynamic range reduction, also known as compression. We also have, and often use colloquially, a generic timbral descriptor when considering metal vocals: ‘clean’. In genres like metalcore, for example, vocalists might switch between a clean and a dirty style. These examples illustrate how psycho-acousticians borrow descriptors analogously from other domains, with ‘clean’ being semantically at the opposite end of a bipolar scale to ‘dirty’.
‘Heaviness’ sits at the top of our pyramid of metal timbre. We all have an idea of what it means, and it pervades each of the lower levels of timbral attributes and their constituent descriptors. In the last fifteen years, we have started to see serious attempts by scholars and practitioners in qualifying and quantifying what we mean by this. We find across the literature qualities related to denser or distorted guitar timbres, perceived rhythmic difficulty,6 and multiple combinations of perceptual and acoustic correlates.7 The mechanism of distortion itself is quite well-known: additional harmonic overtones are added (perhaps with some inharmonic or noisy content) in particular ratios, which are akin to the physical sensation of distortion a listener might experience if their auditory mechanism was overloaded by a very loud sound in the real world. By rights, distortion should be a bad thing in sound engineering, and many technical training devices exist, which seek to teach the listener to identify and remove unwanted distortion. But in metal timbres, the key distinction is between wanted and unwanted distortion. Because of a technical focus by the sound engineering community on the latter, there are acoustic methods for describing distortion, both linear and non-linear types. As such, we can see that distortion is an acoustic parameter first and foremost, but it, in fact, has timbral descriptors that we might anchor to it, for example, ‘clean’ or ‘dirty’ as mentioned above, but also including ‘crunch’. The difference between acoustic shape and type of distortion is of particular interest to metal, as the non-linear distortion found in overdriven tube amplifiers has a markedly different perceived response in listeners to linear distortion.8 There are also studies suggesting that musicians and non-musicians have different responses to distorted timbres.9
By contrast to the favourable type of distortion, the drums in a metal performance are typically preferred to be ‘clean’, or other descriptors we might expect, such as ‘punchy’ or ‘clicky’. There are always exceptions. Industrial metal, for example, has made creative use of distorted drums over the years. In modern metal drum sounds, we see a different lens passed over the timbre, then, that of realism and performance augmentation, whether it be through close microphone techniques or, more commonly, drum replacement and sample triggering strategies. For a detailed overview of the process of fine-tuning metal drums at the point of production, the interested reader might visit Mark Mynett’s articles for Sound on Sound magazine,10 which lift the curtain on these processes.
Applying Timbre in Performance and Production (‘Next’)
There are times when the lens of history allows us to see how timbre can be as fickle as fashion. In the mid-1990s, technology had sufficiently evolved to the point of being able to shape the production aesthetics of metal, giving rise to a new sub-genre: nu metal. Ross Robinson was arguably its most successful producer and had a distinctive sound characterised by a series of timbral attributes: detuned guitars or a ‘ticky’ bass sound. The first Korn record, Korn (1994), for all these trappings, is still a remarkably dynamic record, so much so that it features on recognised mastering engineer Bob Katz’s shortlist for records deserving of praise in the face of the loudness war. Thus, we might consider a place somewhere in our timbral pyramid for the attribute set, which corresponds to specific sub-genres in metal. It has useful applications for listeners in terms of classifying their own choices of music and thus finding or recommending new music they might like. But beyond this, many examples are possible, taking advantage of recent advances in processing power and availability of digital signal processing, for example, digital convolution and the availability of machine learning techniques.
Proposing an AI Approach
One such example would be to propose tools that would allow us to harness what we know about metal-specific timbral features in our own music-making activities – whether as a performer, producer or simply a listener – using an artificial intelligence (AI) or machine learning approach. One example of a machine learning model would be a supervised learning algorithm. In our metal example, we could extract acoustic features from a dataset, for example, spectral centroid, which is correlated with bright guitar tones. We mark up the relative timbral descriptors for our example tones. The algorithm then seeks to discover how much of each acoustic feature there is in the brightest of tones and applies this as a weighting to each feature in the dataset. We now have a marked-up dataset of guitar tones with a series of weighted features contributing to specific timbral descriptors. Imagine now that instead of having traditional equalisation controls on your guitar amp (e.g., bass, middle, treble), you now have ‘brightness’, ‘sharpness’, ‘heaviness’ and other timbral descriptors of your choosing, trained on a dataset of metal tones that you enjoy.
Over a recorded audio mix, this tool might allow you to meter timbral qualities in your mix versus that of a reference mix that you, or a client, particularly enjoy. In fact, automatic mixing is one such example of which machine learning has already made great strides towards, and the LANDR platform uses exactly this technology in the world of music mastering.11 Historically, an engineer or producer might spend many tedious hours automating a volume fader of a particular source – typically the lead vocal – to make sure that it sat correctly in the balance throughout a song. This can be a time-consuming process if done by hand, but it can be automated, freeing the engineer up for more creatively challenging, and ultimately enjoyable, work on the rest of the mix.
Beyond this, we might imagine future work combining machine learning for parameter control with the power of digital convolution going beyond tone matching and amplifier profiling and instead facilitating the creation of entirely new timbres. This process owes a lot to morphing,12 in which a new timbre is created combining particular timbral attributes from two source sounds. It is, of course, impossible to predict what creative people will do with new tools once they become available. We can see historical examples of creative use of misused technology: the distortion created with electric guitar amplification was originally created by misuse of the amplification chain; equalisation that can now be used to carefully craft mixes, individual instrumental and vocal timbres was originally a tool to correct problems in the frequency response caused by telephony; sampling that in the metal world has given us consistent – perhaps overly consistent – kick drums, more or less gave rise to an entire genre in the shape of hip-hop music.
Feature-based Comparison as a Machine Learning Classifier
The following section provides a walk-through or ‘thought exercise’ as to how the ideas and techniques discussed earlier in this chapter might manifest in novel audio signal processing algorithms. It is important to emphasise that, at the time of writing, the idea illustrated here is somewhere between the realms of what we know now is possible and science fiction.
First, let us reiterate our goal: We imagine a technology that might apply acoustic measurement of specific features to describe the timbral attributes of a particular mix, perhaps in a sub-genre of metal, or with a desirable production characteristic – essentially, a well-mixed piece of music. Heaviness might be the goal, or clarity, or energy, or any number of timbral attributes that we enjoy in a song, album, artist or genre. We consider these as a hierarchical pyramid of idiom, with artists resting on a foundation of genre or sub-genre, and albums and songs on top of, or as divisions of, the artists’ sound.
Why do we need this technology? To assist the artist in realising a sonic goal for an album or song in the context of the genre. There are, of course, sonic characteristics that we might ascribe to any of the layers of the pyramid described previously, but that are difficult for anyone outside of highly skilled production or engineering to really achieve. To put this in the simplest layman’s terms: would it not be great if you could make your own recording sound exactly the way you want? This need not mean exactly like someone else’s recording – although matching equalisation, loudness or distorted guitar tones is a common task for the working recording engineer – but rather a platform that allows a vocalisable production quality to be achieved by an artist. Those of us who have experimented with recording our own music will know the frustration of ‘bad’ recording or poor production quality more generally. Indeed, metal music more widely requires excellent production in comparison to many other genres, such as garage rock, which prizes a ‘rough and ready’ production style. Metal requires careful manipulation of the frequency spectrum, an understanding of harmonic distortion, triggering and phase in multi-microphone drum kit situations, and a host of other technical and aesthetic production decisions. In short, producing good-sounding metal is difficult.
Technology might now assist us, as many of the elements of production mentioned above are nowadays computer processes that operate on specific acoustic parameters. However, that would only be useful to our wider audience if such technology can first perceive the necessary processing differences. Thus, our goal is initially in metering, or machine listening. This gives us three questions:
(1) What does our machine hear (source)?
(2) What is its goal (target or training material)?
(3) What processing might then be required to bridge the gap (action)?
Here, we have stepped into science fiction, although just barely, as the world of machine learning is advancing so quickly that even at the time of writing, we see this paradigm in deep fake news items.
For the purposes of this chapter, let us illustrate this with a real-world example of the first stage – what does our machine hear – which is entirely possible already. Indeed, we could train a classifier, in machine learning terms, to deal with endless examples in the same manner, essentially creating the first stage of our machine listener: a meter. For this, we use audio stems of the main instruments of ‘A Secret Kiss’ from British doom metal band My Dying Bride’s 2020 album Macabre Cabaret.13 We are particularly interested in the elements that our production aid might help with: the relationship between the timbre of the guitars, the bass,14 and the kick and remaining drum balance. We therefore analyse four stems: guitar, bass, kick drum and remaining drum mix.
Acoustic Feature Extraction
We extract the following features from each audio stem:
Spectral centroid: A spectral ‘centre of gravity’ and commonly used psycho-acoustic parameter in music analysis (e.g., high-hats will have a higher spectral centroid than a kick drum), especially as a correlate for ‘brightness’.
Spectral spread: The ‘instantaneous bandwidth’ of a spectrum, used as a metric for tonality, where if a pair of tones converge, the spread decreases (e.g., we might expect a smaller spectral spread for harmonic distortion than inharmonic or partial distortion on a guitar tone).
Spectral skewness: A degree of symmetry around the spectral centre, also known as ‘spectral tilt’ in speech analysis, indicating the relative strength of harmonic and fundamental content (e.g., useful in the analysis of the amount of distortion on both guitar and bass stems). A positive skew indicates the fundamental is more dominant than the upper harmonics or tones. In our case, power chords are the most relevant.
Spectral decrease: The measure of the decrease in a magnitude spectrum. This parameter is not often seen in speech analysis but is common in musical instrument recognition (e.g., if we want to discriminate between an analysis of drums in a stereo mix and features that are guitar-driven, we might expect a minimal decrease in guitar spectrum, with a much more dynamic result in the spectra of the drums).
Spectral flux: A simple metric of the amount of change in the spectrum over time (e.g., how consistent is the spectrum in the stems or the finished master we are listening to. This is especially interesting if we are considering emulating multiband compression parameters).
Spectral roll-off point: Like the spectral decrease, a marker of the bandwidth of a signal over the total energy, helpful for instrument separation and also common in music genre classification.
Spectral crest: A ratio of the peaks to the mean average (e.g., to reveal the amount of creative, production-informed compression).
mel spectrogram: A pitch-related plot.
mel-frequency cepstral coefficient delta-delta: A speech-derived correlate, which shows the rate of change in the spectrum of the spectrum (e.g., useful for reducing complex spectra to their most relevant components).
We might consider training our model on many more acoustic parameters, but the list above gives features that might be most useful in the analysis of metal production timbres.
It is beyond the scope of this chapter to provide each figure but let us examine some types of visualisations and consider a simple example from our sample material.
Figures 6.1–6.5 show a few of the types of visualisation we can produce to help our understanding of the numerical properties, which acoustic feature extraction produces. Sound engineers will likely be familiar with two of these types of presentation: (1) visualisation in the time domain, with time on the X-axis (as we see in common digital audio workstations when a waveform of a sound signal is presented); (2) visualisation in the frequency domain, as in the mel-spectrograms, which show us amplitude (using the colour of the signal), time on the X-axis, and frequency on the Y-axis. As such a representation is 3D, we can rotate it as in the spectral skewness plot. In these figures, we can start to see immediate differences, as indeed we would hear them if we listened to the stems.
The next step of the AI learning process would be to train our machine listener to recognise these feature changes in order to provide novel metering or even suggest parameter adjustments based on a set of training material, for example, any set of stems the artist enjoyed and wanted to emulate. Note that we are not suggesting that sound engineers should be made redundant or that the art of record production might be reduced to a sequence of algorithms. New music will always require creativity and artistic interpretation. However, we might envision tools to take away some of the tedious learning curve in music production tasks and facilitate more creativity.
To spare the reader a complete overview of machine learning in the analysis of a training set, we will provide the shortest possible example here: let us say we want to emulate a guitar sound from a stem and therefore train a meter on the parameters listed above. First, we need to train our meter by linking the acoustic features in various ratios to a dependent output variable, the timbral descriptor. We then try a new input, which is classified according to the same features. The difference between source (our new input) and target (our training material) gives a suggestion for changes in a number and ratio of acoustic features. For example, to raise the spectral centroid, we might have a suggestion for some boost in the upper-frequency EQ of the guitar. Our fully-realised system could then analyse the performance of this output in comparison to the input target (in machine learning terminology, this step is called validation), and adjust the suggestion according to ‘how far out’ it was (in machine learning terminology, this is the error function). We can see this type of classifier-optimisation problem in almost any example of machine learning. We end with a system that makes some recommendations in terms of audio production character, including EQ, dynamic range, harmonicity of distortion, amount of distortion, balance between harmonic distortion and fundamental frequency, spectral rate of change. All of those are measured from our acoustic analysis parameters. However, we are not bound by the suggestions. This is a jumping-off point, which might take the next generation of musicians in directions that are, at the time of writing, rather unimaginable, but no less exciting for it.
Conclusion
Psycho-acoustic research generally regards timbre as one of three perceptual attributes of sound along with loudness and pitch. This list of perceptual attributes can be expanded to include perceived duration, location and reverberant environment. These are essentially a series of cues about the spatial properties of the sound source or, in the case of a mix, placement of a sound source in one, two or three dimensions: width, depth and height. The range of perceptual descriptors included in timbre studies is reflected in their acoustic correlates and the lack of a unidirectional scale suitable for timbre. Subsequent research has endeavoured to quantify these interrelationships in an effort to move towards a robust measure of timbre, usually through a combination of acoustic analysis and perceptual testing. The acoustic correlates of timbral descriptors determined by these approaches include harmonicity, which is particularly relevant for the distorted guitar, and various combinations of amplitude envelope and spectral or spectro-temporal variation.
Metal has its own specific timbral descriptors, which are a combination of performance and technology, all of which can fall under a pyramid with ‘heavy’ encompassing each subsequent timbral attribute. Vocal performance styles like growling and screaming are almost uniquely found in metal, and much like the distortion used in guitar or bass, and the triggering and layering used in drum production techniques, the technology can be used to enact large metal-specific timbral variation at the point of recording, mixing or live reproduction. Some work has been done by researchers looking at these metal-specific descriptors in the same vein as work by earlier psycho-acousticians, who looked at correlates for other timbral and perceptual descriptors like loudness and pitch. In this work, we see that timbral attributes can be either acoustical or descriptive, and some descriptive terms have been shown to overlap or agree in terms of their acoustic correlates. The nomenclature is mostly universal, although a large number of labels and descriptors have not been acoustically quantified as yet.
Work to reduce the range of descriptors, perhaps down to those which are acoustically independent, and a subsequent set of those with acoustic overlap, would be useful for our ‘next’ ideas. Similarly, attributes with acoustic overlap, or indeed attributes, which appear to have contradictory acoustic correlates, would also require the ratio between their acoustic correlates to be quantified (timbre metering, matching and designing). In the ‘next’ ideas, we can imagine a combination of these approaches with the now-readily available level of machine learning techniques to provide tools for musicians and producers that will help with timbre: to match existing timbres (streamlining the search for a good guitar tone or emulating favourite tones) and, perhaps most excitingly, to craft new timbres.