Machine Listening as Sonification

András Blazsek

doi:10.1017/S1355771824000104

Machine Listening as Sonification

Published online by Cambridge University Press: 19 December 2024

András Blazsek

Show author details

András Blazsek*: Affiliation:
University at Buffalo, The State University of New York, Buffalo, NY, USA
*: Email: andrasbl@buffalo.edu

Article contents

Abstract
INTRODUCTION
WHAT IS SONIFICATION?
A MEDIA ARCHAEOLOGICAL READING OF FOURIER ANALYSIS
TECHNICAL DETAILS OF FREQUENCY-DOMAIN PROCESSING
NON-HUMAN LISTENING
A TECHNO-EPISTEMOLOGICAL READING OF SPECTROGRAMS
MODES OF LISTENING EXTENDED
CONCLUSION
Footnotes
References

Rights & Permissions

Abstract

Machine listening takes place through sonification. Sound is treated as data by a computer that listens by deconstructing and reconstructing sound. To better explore the aesthetic, relational and ontological aspects of machine listening, this article reflects upon the Fourier analysis, which is vital for machine learning algorithms. It then outlines the listening modes articulated by French composer Pierre Schaeffer and updates them for the new material conditions of contemporary listening. It proposes that a new mode, identified while working with sonification, be added to Schaeffer’s classic array. It explores non-human listening among machines that listen with other concerns beyond the human need to interpret content. Thus, this article makes a particular strategic move: it centres around machine listening, which enables computers to perform analyses of a soundscape, and resynthesis, a mode of sonification that treats sound as data, to reintroduce the ‘sound object’ nature of resynthesised sounds. It looks at sonification through discourse analysis and media archaeology, and gives importance to experiments in art that privilege sensorial and affective dimensions often ignored by scientific approaches. It proposes that thinking about machine listening through sonification can assist in developing sensibilities that are more responsive to the present sonic ecologies between human and non-human listeners.

Type: Article
Information: Organised Sound , Volume 29 , Issue 3: More-Than-Human, More-Than-Music , December 2024 , pp. 275 - 284

DOI: https://doi.org/10.1017/S1355771824000104 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike licence (https://creativecommons.org/licenses/by-nc-sa/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the same Creative Commons licence is used to distribute the re-used or adapted article and the original article is properly cited. The written permission of Cambridge University Press must be obtained prior to any commercial use.
Copyright: © The Author(s), 2024. Published by Cambridge University Press

1. INTRODUCTION

Machine listening takes place through sonification. It transforms sound into data that it analyses and rewrites again as sound, translating inaudible dimensions of the audible into the sonic realm. Conventional discourses around machine listening, however, do not discuss it as such; instead, they focus on it as a metaphor for human listening. By asking about the degrees by which they are separated or connected, this article analyses the relationship between human and machine listening.

French composer Pierre Schaeffer’s explanation of listening reveals a practice that shifts around different intuitive modalities; he says, ‘I understand [je comprends] what I was aiming to listen to [mon écoute], thanks to what I chose to hear [entendre]’ (Schaeffer Reference Schaeffer and North1966: 74). Additionally, he also defined raw perception of sound as ‘to perceive aurally’ (ouïr) (83). The separate but causal modes are also interchangeable: we first hear, then aim to listen by understanding the sound. For Schaeffer, the practice of listening is tangential to a series of relations, ‘I can understand the exact cause of what I have heard by connecting it with other perceptions or by means of a more or less complex series of deductions’; sometimes the lack defines this relationship or the disruption causes something else (78). Typically, people present machine learning as a technical process that mimics human behaviour; therefore, the question is, to what extent machine listening is a reproduction of our hearing process, and what the differences and similarities reveal.

The association between machine and human listening distracts attention away from the non-human kinds of relations among variables that develop in the machine as it listens in the way that only a machine can. These kinds of relationality are typically reduced in standard machine listening discourse as ‘interpretation’ and ‘understanding’, both of which separate objects to create mutually limiting relations of meaning-making between them. Thinking about machine listening as sonification permits an analysis that allows it to be examined as a non-human process that humans are using to create epistemological modes that are more-than-human, and aesthetic modes that are more-than-music.

Calling the process ‘listening’ already limits the machine’s process in defining the engagement that takes place between data and the analysis process. The discourse of sonification asks the human listener to understand the relations between values and sound. I argue that machine listening is a similar relational engagement between sonic data – samples, sonograms and scalograms in multidirectional translation processes that operate in hand with the intuition of the human listener to understand (comprendre) it as different. Hearing a perfect mimicry of someone’s voice, and not recognising it as different, proves our deafness to the more-than-human epistemic turn.

Indeed, among the most emblematic examples of machine listening – that which is perhaps most appealing to the general public – is its use in making human voices into simulacra of themselves: the artificial voice that we perceive as sounding like our voices can be made to say anything we want. With speech reproduction anyone can record and reproduce voices to the point of absolute authenticity. The stakes are high because one of the affordances of sonification that applies to tools for speech reproduction, speaker recognition, or voice anonymisation is scalability. Networked computer systems allow us to train models on publicly available voice data exponentially faster. Voice data are also abundant: the Voxceleb2 voice dataset is constructed from 7,000 speakers gathered from YouTube videos (Nagrani et al. Reference Nagrani, Chung, Xie and Zisserman2020: 4). This led to privacy concerns, and the recognition of ‘audio recordings’ as data that reveal ‘personal information through acoustic patterns’ which, for example, are now also included in the General Data Protection Regulation (GDPR) by the European Union (Weitzenboeck et al. Reference Weitzenboeck, Lison, Cyndecka and Langford2022: 185).

In the meantime, the study of sonification has fallen behind the discourses that took a new turn with this popularisation of machine learning and advanced signal processing. To explore machine learning beyond the technicality of the practice, and the audible as a non-human listening mode transforming human listening modes, this article examines machine listening through sonification, framing it both through a media archaeological study of Fourier analysis – vital for machine learning algorithms, like those used in speech reproduction – and through the modalities of acousmatic listening, introduced by Schaeffer, in order to think about sounds whose sources are unseen.

2. WHAT IS SONIFICATION?

A representative mode of sonification is the deep-fake voice. A computer program that listens, deconstructs, learns and reconstructs voices treats sound as data using machine learning algorithms. The only characteristic that cannot be erased from speech reproduction is the microphone that recorded the training samples. However, with some patience and knowledge of sound editing, even this can be cancelled out, and a convincing artificial voice can be produced.

Scholars who study sonification define it as ‘the transformation of data relations into perceived relations in an acoustic signal for the purposes of facilitating communication or interpretation’ (Hunt and Hermann Reference Hunt, Hermann, Hermann, Hunt and Neuhoff2011: 274). In addition to the ends of ‘communication’ and ‘interpretation’, what tends to be stressed in such definitions is the ‘transformation’ part of the process. Less emphasis is placed on ‘relations’ which, almost hidden, are the way sonification works with the material conditions of listening. Physicist and new materialist Karen Barad would call this sonification’s ‘ethico-onto-epistemology’ (Barad Reference Barad2007: 185): the inseparability of theories of knowledge and theories of being from ethical questions concerning responsibility to the world. The attention to sonification’s relationality is, therefore, a critical concern.

The definition of sonification as a relational practice – the ‘transformation of data relations into perceived relations’ – for the purposes of facilitating human-associated relations called ‘communication or interpretation’ foregrounds relationality between people, values and measurement. The translation between data and sound, for example, already implies that sonification will occur by crossing epistemological boundaries, moving between scientific domains such as medicine, physics and statistics, and between artistic domains such as music and sound art.

Philosopher Robin James argues that when social scientists describe social behaviour through words such as ‘flow’, ‘rhythms’ and other acoustic analogies, they depoliticise the question of how human relations are described through mechanics and statistics (James Reference James2019: 161). The idea of using abstraction to break down barriers between people and scientific depictions of people is relatable to the discourse of sonification that works with multilayered representations when it translates real live data to sound. James, who, like Barad, is influenced by new materialist approaches, takes the discussion further, writing that ‘new materialism is a constituent of the sonic episteme to the extent that it uses acoustic resonance as an idealised ontological model that translates the mathematical relationship behind neoliberal market logic’ (ibid.: 89). She is among the few who engage with the ‘political baggage’ of abstraction that discourses around sonification often do not account for.

On a broad level, sonification makes connections between fields of knowledge, practice and politics; on a granular level, it operates on relations between intensitiesFootnote ¹. When it makes data into sound intensities, it operates between scales where data relations can express themselves; when sonification crosses disciplinary boundaries, it operates also in scale. Relationality is subordinate to scale, or scale defines the spectrum of micro and macro relationships. Intensity would be the focus in any effort to engage with the non-human dimensions of relationality in sonification.

Artist and theorist Brandon LaBelle speaks of the ability of sound to be a conduit that transmits and builds relationalities. According to him, the force of matter (bodies and things) collaboratively produces linkages between the different actors in the network, and in the process ‘subjectivity is defined according to a state of interruption … I am, as an intensity, and interference to others’ (LaBelle Reference LaBelle2018: 67). As LaBelle indicates, the notion of subjectivity does not only include human actors. In the larger ecological milieu, forces are produced by inorganic things, such as data sensors, algorithms and adaptive systems that cohabitate with living layers also made up of different scales.

Just as the multi-relational dimension of sonification and the scalability of relationships is usually obscured by its transformational capacity (and, when it is explored, centres human actors and socialites rather than more-than-human implications) standard definitions of sonification make similar assumptions about listening presumed to be relatively static. Ethics and ontology disappear where sonification is only understood as helping the human interact with the world through information exchange rather than as a process that acts on human beings in a way that fundamentally changes them proprioceptively. Sonification upsets the abstractions of inside and outside that constitute knowledge thus allowing what is known, who knows and how they know to be challenged.

In data sonification, interpretation of data by a listener as data relationships implies a different intention from a machine learning model recognising a voice or reproducing it. Machines interpret sound data not because a human wants to interpret the data, but in order to use the performative power of machines to recognise nodes of connection, or as LaBelle suggested, interferences. LaBelle explores interference through the concept of ‘overheard’ as a force that captures our attention through listening. He writes, our ‘body [is] always already defined by the logic of vibratility,Footnote ² which makes one available to an array of controls’: listening to sonic vibrations and interruptions makes us both connected and, to some extent, restrained (LaBelle Reference LaBelle2018: 87).

Media scholar Wendy Hui Kyong Chun talks about how machine ‘recognition’ – speaker recognition as the mechanism of recognising one voice over another, as in, for example, Amazon’s virtual assistant technology Alexa – ‘implies a historical relation and response – and power’ (Chun and Barnett Reference Chun and Barnett2021: 228). Recognition for a computer is always preceded by classification; the computer selects things from labelled archives and, as Chun stresses, this process is not only a technical means of sorting, but also a discriminative factor embedded into algorithms through historic frameworks of power.

In summary, the ‘figure’ of relationality is an integral part of the sonification process, and it connects it on broader levels to database sound modelling practices, including machine listening. Scalability provides open access to transformations of speech and data to new vibratory patterns. These patterns call for a mode of listening relationalities. Listening to the machine and with the machine describe different relationships. Hearing with the machine makes up the more-than-human listening mode to which we are not accustomed.

3. A MEDIA ARCHAEOLOGICAL READING OF FOURIER ANALYSIS

Machine listening is a data sonification technique not traditionally meant for interpretation. Its investigation through media archeological research makes visible the material transformation of sound intensities into a non-human realm and its relationship to human listening. What role do data and history play in thinking about machine listening as the relational practice of sonification?

When the computer processes sound it manipulates time and frequency. Therefore, the archival processes that it performs are multilayered. Historicity emerges through the growing archive of voices collected in personal recordings, public microphones, underwater surveillance and sonar systems submitted for scientific and ethical analysis. Algorithms, such as the fast Fourier transform (FFT) and joint time-frequency scattering (JTFS) sort through digitally recorded sound at a speed inaccessible to humans. Despite the increasing speed, neither of these processes are true real-time analysis. German media theorist Friedrich Kittler argued that ‘Real time analysis simply means that deferral or delay, dead time or history are processed fast enough to move on to the storage of the next time window’ (Kittler Reference Kittler2017: 14).Footnote ³ Computer time is a delayed analysis window in connection to micro archives, and not the actuality of the present.

A media archaeological reading of Fourier analysis, the theory of the process behind the FFT, the algorithm that breaks down the signal into frequencies, can help illustrate the proposition that machine listening involves sonification. Fourier analysis is made up of multiple steps, the general nature of which would be expected in any kind of computer processing: the first makes components of a soundscape addressable, deconstructing it in numbers that are written into a dynamic archive; sometimes resynthesis takes this archive and makes it back to sound again. Though the process is not unique within computation, it is notable for the argument being made here that, with resynthesis, FFT becomes a mode of sonification that points towards the understanding of sound as data and archives. This is significant because using sonification as a method to think about Fourier analysis allows cross-over between the acknowledgement of relationalities in sonification and ways of rethinking how machine listening works.

The practice of media archaeology is, in one sense, historical: performativity, re-enactment, and artistic sensualities play an important role, in addition to discourses around tools, techniques and technology (Fickers and van den Oever Reference Fickers, van den Oever, Roberts and Goodall2019: 54). Media archaeological experimentation, as German media theorist Wolfgang Ernst describes it, is a dual system that studies media communication as a tool while at the same time recognising that it operates on the level of signals. Therefore, every time we go back to a media object, we are, in one sense, interfacing with a cultural artefact that is epistemological, and in another sense, with a ‘time-invariant event’ that only operates as it does in the present (Ernst Reference Ernst and Parikka2013: 185).

Kittler argued that ‘media facilitate all possible manipulations: acoustic signals control the optical signals of light consoles, electronic language controls the acoustic output of computer music’ (Kittler Reference Kittler1999: 50). He also studied technical instruments to understand more about the operation of media, not as a sign, but as a signal, and found that the interchangeability between sound and visual media is an experimental technique that artists had already been exploring; hence, the technicality of transformation balances between creative and scientific modalities. For example, the Hungarian American artist László Moholy-Nagy already had the idea to produce a vinyl disc and transcribe graphical information on it in 1923. The result of such an experiment would have produced a sound that had never otherwise existed and opened the gate between mediums that could be ‘transposed’ on each other (Kittler Reference Kittler1999: 46).

Kittler wrote about Fourier analysis providing a mathematical formula for breaking down complex waves into the summands of their uniform wave components; moreover, it similarly established modularity in signals that could then be added, divided and subtracted. The discourse of frequencies started with the emblematic paper of French mathematician Joseph Fourier on the ‘Analytical Theory of Heat’ (1822) and was followed by establishing the modularity of physical waves. While the exact study of Fourier about heat propagation does not speak about acoustic waves, his innovation described heat conduction with a universal formula that can be applied to other types of vibrations (Lostanlen et al. Reference Lostanlen, Andén and Lagrange2019a: 4). This contributed to research on sound synthesis and, during World War II, helped the military in efforts to build the precursor technologies to Fourier analysis. The vocoder transfers the envelope of one frequency to another; in the military world, this allowed safe communication between allies, while in the music industry, the same technology initiated the transfer of the graphical shape on the signal, as suggested by Moholy-Nagy (Kittler Reference Kittler1999: 48).

Shortly before World War II, in 1939, Russian painter and acoustician Boris Yankovsky also experimented with a ‘sort of Periodic table of Sound Elements’ (Smirnov Reference Smirnov2013: 201). He defined an analogue method for audio computing that analyses sound structure and character, and classifies and resynthesises sound from a library of drawn waveforms. In an unpublished manuscript, Yankovsky wrote, ‘graphical representation of the sound wave could be analysed and represented as the Fourier series of periodic function (sine waves)’ (Smirnov Reference Smirnov2013: 209). With these experiments and inventions, and because of the modularity of electromagnetic systems and later digitisation, sonification became an everyday practice.

With the advent of finite-state machines and digital signal processing which works on the discretisation of voltage flows into countable integers, the method itself also enabled mapping signals on the time-frequency axes. In 1966, mathematicians James Cooley and John Tukey developed a more advanced version for transforming soundscapes into numbers: this was the FFT (Lostanlen et al. Reference Lostanlen, Andén and Lagrange2019a: 8). Kittler saw this method as related to human perception when he wrote ‘Fourier analysis hears’: it listens to everything a computer processes with the delay to compute the process (Reference Kittler2017: 12). In the meantime, he also differentiated between the FFT and the hearing process according to what periodicity the two measure. He wrote that the ears ‘do not determine possible periodicities of acoustic events [FFT] but only whether there is any periodicity at all … [the ears] focus on whether after a measurable delay the received signal repeats itself’ (ibid.: 14). Every FFT process is a remediation of the past, however, the degree to which the past folds back and to what extent it is perceptible to people makes up the relationalities of more-than-human perception.

Human sound perception prioritises how things are heard; according to the principles of psychoacoustics – which, historically, studies the perception of sound events as they interplay with the physical properties of the ear and the main auditory cortex – it is always open to receive any signal (Clarke Reference Clarke2005: 12). Therefore, as Kittler pointed out, the physical properties of sound such as length and frequency, which, most of the time, are referred to as spectro-temporal properties, are as important for human hearing as for the computer. The human ear is extremely sensitive to chronometric measurements: according to Ernst, ‘two auditory signals can already be differentiated after 2 to 5 milliseconds, while an interval of 20 to 30 milliseconds is needed for visual stimuli’, yet this still cannot compete with the speed of current computing devices (Ernst Reference Ernst2016: 138).

German composer and sound artist Florian Hecker’s work is an example that further illuminates machine listening as a kind of sonification. Hecker uses sonification to audify listening through the observation of the organ of the ear. Hecker’s Halluzinatons, Perspektive, Synthese (Hallucinations, Perspectives, Synthesis, 2017) exhibition explores joint time-frequency scattering developed by French researchers Joakim Andén and Vincent Lostanlen (Blom Reference Blom2021: 101). Hecker reverse-engineered the complex process, which involved the training of neural networks to understand and model the mammalian main auditory cortex and map out neuron responses to specific time-frequency components of sounds.

Hecker used Andén and Lostanlen’s formula for sonification, proposing to attend to a sound that is described by the listening process, not of a mammal, but of a machine that learned it after studying mammals (Blom Reference Blom2021: 102). Hecker thus used sonification to present listening as both the material of Resynthese FAVN (2017) – installed for the exhibition in Vienna – and the way that the piece can be accessed. It is not typical that sound, or hearing itself, is what sonification uses as its raw material. Hecker’s work is a quintessential case of sonification not being used to represent data from another domain but being used to reflect on the features of sound itself as a medium, treating sound as an archive.

The artistic nature of performing and activating listening in the work is also emblematic; the piece re-enacts the listening practices of machines and activates human listening for potentially more-than-human listening. This example furthers the argument that listening experiments of resynthesised sounds can become essential for learning about how the different modes of listening are layered physiologically and technologically, as well as the degrees to which they differ from one another and are similar to each other. These layers now fold on top of each other and are only separated by the speed and contact time between the sound and the analysis process.

4. TECHNICAL DETAILS OF FREQUENCY-DOMAIN PROCESSING

A better understanding of the technical details of sound processing in the FFT further clarifies the way a computer performs sonification as its distinct way of processing sound. Single sinusoidal frequencies make up most human sounds on the audible spectrum between 20 and 20,000 Hz. There are tools such as spectrograms that visualise the audible frequency spectrum with the representation of individual sound frequencies. Canadian psychologist Albert Bregman explains that, physiologically, the human auditory system separates the different frequency components of complex sounds according to intensity and phase, and then distributes them through different neural pathways; he calls this the neural spectrogram (Bregman Reference Bregman1990: 733). Comparably, the FFT is a method that allows the computer to break down complex harmonic motions into a set of simple sine waves that can be further simplified on the basis of their frequency, amplitude and phase information (ibid.: 731).Footnote ⁴

Technically, the FFT in computers periodically creates a snapshot of the sound stream in which separate frequency components become addressable. American electronic music composer Curtis Roads writes that the FFT is the ‘input sound as a sum of harmonically related sinusoids – which it may or may not be’ (Roads Reference Roads2002: 240). Through the decomposition of complex waves to sinus waveforms, each partial will hold the characteristics of frequency, amplitude and phase components of the original sound. However, after the inversion of this process, the new signal, as Roads suggests, ‘may or may not’ be fully identical to the original input signal (ibid.: 240).

What happens to the information in these components? There are numerous ways of using the extracted information: cross-synthesis applies the spectral envelope or the amplitude of one signal over another, and an especially useful aspect of the FFT-based time-stretching is that it allows for manipulating the temporality of sound without changing its original pitch (Settel and Lippe Reference Settel and Lippe1995: 1). FFTs create an open flow of numbers where these component frequencies are cross-synthesised, stretched, and then either recomposed to the original time domain format with inverse fast Fourier transform (IFFT), or used to influence something else. For example, the presence of low values among these component frequencies is a good indicator of unwanted noise; by FFT they can be separated from the more important frequencies with higher amplitude before turning them back to signal with an IFFT.

This is also the information that is used to create the spectrogram image by mapping the frequency amplitudes on a pixel matrix as colour values. Oscillator-based resynthesis is not always the best option for remapping the signal, but to illustrate the connection between sonification and spectral processing better, it can be explained through an analogy: resynthesis is a sculpting mechanism, something like the creation of a virtual mould of the different frequencies existing inside sound. Resynthesis then uses this descriptive information about the shape of sounds to virtually recreate those shapes again from uniform oscillator sounds.

Resynthesis can also be understood as data sonification, where digitally recorded sounds are treated as data. Such is the case with a parameter-based sonification like that described by American composer Mark Ballora, who writes about it as a method of mapping data to frequency, or other parameters of the sound that influence it, by changing its characteristics (Ballora Reference Ballora2014: 32). In the case of oscillator-based resynthesis, data such as frequency, amplitude and phase information are represented as numbers mapped directly to sinusoidal oscillators. Because of its expansive central processing unit (CPU) usage, this resynthesis mode is less popular than the IFFT in music production.

Since the onset of visual programming languages such as Max/MSP and Pure Data (Max’s open-source version), and the code-based language Supercollider, the FFT has been part of libraries for accessing spectral-domain information to affect the frequency spectrum’s density or temporality. Externals in Max/MSP, such as Bonk∼, have note-following features, which allow live performers to trigger other processes based on the musical notes they play on their instruments. The Fluid Corpus Manipulation toolkit (FuCoMa), one of the most recent additions to Max/MSP with machine listening features, also uses the ‘audio descriptors [FFT] algorithm as the basis of their computation’ (Reference MooreMoore n.d.).

To take a step further in defining the degrees of how the machine processes the outside sonic world, it is helpful to return to the work of researchers Andén and Lostanlen with whom Hecker collaborated. They consider JTFS (joint time-frequency scattering) a more appropriate model to describe the neurophysiological processes taking place inside the main auditory cortex (Andén et al. Reference Andén, Lostanlen and Mallat2015: 1). JTFS comprises multiple processes: its ‘representation characterises time-varying filters and frequency modulation’; the frequencies for capturing the sound features are calculated by a specific spectrogram called the scalogram (ibid.: 6). The scalogram works with continuously computed wavelets – short samples of wave-like oscillations – to plot the analysed sound on the time-frequency axes (MathWorks n.d.). As the result of the continuous wavelet transform (CWT), time and frequency become a variant. According to Lostanlen et al. ‘neurophysiological experiments have demonstrated that … the wavelet scalogram … can be regarded as computationally analogous to the cochlea’ (Reference Lostanlen, Andén and Lagrange2019a: 10). Therefore JTFS – which incorporates the wavelet scalogram – provides a better representation of the conditions of hearing, also serving as a better qualifier for machine learning tools to understand and listen to sound.

Scientists employ formulas like JTFS to train neural networks to focus on specific sounds present in the recording and separate them from background noise, echo or other voices in the soundscape (Lostanlen et al. Reference Lostanlen, Salamon, Farnsworth, Kelling and Bello2019b: 3). An interesting prospect of these findings is that audio becomes addressable by snapshots of images such as spectrograms and scalograms where visual patterns indicate pitch and rhythm, and colours represent varying sound energy. Hence, the translation between pictorial shapes and the audio signal for the computer is a simultaneous activity, while for the human, the interaction between visual and aural stimuli can never really be quantified.

5. NON-HUMAN LISTENING

Studying the media archaeological dimension of the Fourier analysis also introduces the complicated connection between machine listening and what Ernst theorised as media archaeological listening. Kittler saw Fourier analysis as a mathematical model that helped in the quantification efforts of information and, therefore, acted as a translator of cultural products from the symbolic to the signal (Parikka Reference Parikka2012: 35). Similarly, Ernst writes ‘media-archaeological analysis, by computer-aided fast Fourier computations, of speech below the elementary units of what can be expressed by letters (vowels, consonants) gives access to the material dimension (the physical world) of a cultural moment’ (Reference Ernst and Parikka2013: 59).

According to Ernst, recorded sounds only become part of the semiotic system of signs after the human ear decodes the ‘psychological sensual data with cognitive cultural knowledge’ (Reference Ernst and Parikka2013: 61). However, there is also another process taking place when the data as a signal is activated as part of a computational process in media devices. As with Kittler, for Ernst, recording devices are not only tools for archiving or manipulating culture; rather, they can occupy the position of the media archaeologist through their technological ability to listen to sound frequencies in a non-semiotic fashion. Ernst wrote that when a computer translates a voice event by non-semiotic means, ‘the machine is the better media archaeologist of culture, better than any human’ (Reference Ernst and Parikka2013: 62).

The idea of media archaeological listening, as theorised by Ernst, is based on his long-term research into the micro-temporal operation of technical mediums. By reading it through Kittler’s ideas, which expand on the operational function of devices, media archaeological listening can be further theorised as a listening mode that intervenes on the discursive level of knowledge production. For Ernst, the layers of techno-epistemology are organised through temporal layers, as in, for example, computer architectures operating within the time and frequency domains. Frequency is essentially measured time. Ernst’s non-human mode of listening defines a listening mode which is not concerned with the effect of sound, the interpretation of information, or culture per se, but with listening to the machine that listens its way out of the tendency to relate to sound predominantly as signs. This method of listening adds agency to the computer that reorganises the past into snapshots of sound bursts; the idea of a ‘picture gallery’ of tiny segments of phonemes, a glissando, or a whisper, as already labelled images in archives.

6. A TECHNO-EPISTEMOLOGICAL READING OF SPECTROGRAMS

The FFT creates a spectrogram that gives visual feedback on the frequencies present in sound. In psychology, Bregman has argued that while a spectrogram of a complex tone represents all components of a sound event, the visual information of the individual frequencies provides no information about the possible source, or sources, of each frequency. Some could have been triggered by the same sound event, or could have all been part of different sounds that accidentally occurred at the same time (Bregman Reference Bregman1990: 9). Following this argument, it could be said – paraphrasing Kittler – that if artist Moholy-Nagy had realised his project to transcribe graphical sound to a vinyl disc – and this recording was played and recorded with a spectrogram – the observer of the spectrogram would not be able to tell from the image that the sound was the result of a graphical mark on a vinyl disc (Kittler Reference Kittler1999: 46).

Using Bregman’s argument that a visual spectrogram does not reveal the sources of the represented sounds, and applying it to the machine’s analysis of spectrograms and scalograms to recreate sound shows that it is, in part, a masking mechanism which conceals the originality of the sound source. The spectrogram of resynthesised sounds conceals not just the physical sources of the sounds, but also their ontological sources and the process of remediation.

This relates to the often-repeated argument of Canadian media theoretician Marshall McLuhan who wrote that ‘“the content” of any medium is always another medium. The content of writing is speech, just as the written word is the content of print’ (McLuhan Reference McLuhan2003: 8). Thinking through McLuhan’s idea of remediation, media scholars Sarah Kember and Johanna Zylinska found it essential to describe the vitality of media today, when the process of remediation is continuous; they write, ‘we need to do more to combine our knowledge of media objects with our sense of the mediation process that is continually reinventing them’ (Kember and Zylinska Reference Kember and Zylinska2012: 19). The result of the machine’s analysis and remediation of a sound – in its nature, an ambiguous sound object as we interact and engage with it – is we start knowing the mediation process better.

Computer applications such as Descript employ the products of Lyrebird, a Montreal-based company that developed machine learning tools to regenerate the human voice for text-to-speech synthesis from 10-minute-long samples. In a Wired interview, Lyrebird’s co-founder Jose Sotelo said, ‘rather than having a human categorise voices based on accent, pitch, cadence, or speed to figure out various factors that make you sound like you, deep learning allows us to teach the machines to do the sorting. And as they sort, they learn on their own’ (Wired 2018). The machine that listens is characterised in this interview as a sorter; the process of listening engages the machine in an activity that is primarily concerned with arranging and making connections using relations and scale. What does the machine’s parallel dimension – its operation on its own time domain and on its own logic of repetitive sorting and arranging – signify for the cultural moment?

The fact that the European Union’s GDPR regulating data now includes ‘audio recording’ as data and calls for the anonymisation efforts of unstructured (unprotected) data is telling (Weitzenboeck et al. Reference Weitzenboeck, Lison, Cyndecka and Langford2022: 185). Speaker anonymisation offers a solution for voice privacy because it ‘aims to remove speaker information from a speech utterance while leaving the other acoustic attributes unaltered’ (Matassoni et al. Reference Matassoni, Fong and Brutti2024: 14). We are steering between two efforts of generating more voices with speech reproduction while also trying to make the human voice anonymous. The popularisation of algorithms such as JTFS faces narrative complexities; one of the efforts should be to describe it as more-than-human listening, which consists of the relationships between the ontologically different practices of processing sound between humans and machines. To better operate within this cultural moment, we can listen with the finite-state machine – the mathematical abstractions used to design computational algorithms – by thinking through it historically and critically.

7. MODES OF LISTENING EXTENDED

How can humans become better listeners with machines to achieve the more-than-human sensibilities needed to confront ever more complex ethical, ontological and epistemological problems? In 1966, Schaeffer, the inventor of musique concrète, examined listening practices of acousmatic sounds, which, as he defines them, are sounds with unseen sources. His modes of acousmatic listening are helpful in thinking through this pressing question. These modes, however, must be reassessed to take into consideration the high degree of new interferences, such as the deep-fake voice and the anonymised voices.

Understanding these processes as the mere reproduction of human listening makes it difficult to access the new modes of listening that result from the interaction of speech reproduction and human listeners. Similarly, understanding human listening as a simple process of receiving audio signals, limited by the sorting between the components, such as frequency and phase relations, neglects questions about the fundamental changes that human listening is going through now. The idea of knowing by interacting with a sound object helped sonification to expand from a traditional framework. While sonification and machine listening differ in their intentions, the discourse recognises key aspects of the epistemologies of sound. For the purposes of this article and its conclusion, the speculative practice of sound archaeology helps to respond to the proprioceptive changes induced by media technologies in interacting with sensorial media, resynthesised sounds and deep-fake voices.

Listening is never an isolated interaction between objects and subjects. It is what, in Barad’s words, can be called a complex ‘intra-action’ (Reference Barad2007: 33). To fully engage with the way the relationality of sonification opens other ethico-onto-epistemologies (Barad, Reference Barad2007: 185) of listening, modes of listening must be further differentiated. Modes, such as Ernst’s media archaeological listening, must also be considered in response to technological changes. As Kember and Zylinska argue, ‘our intellects tend to divide the object world’: this fosters an illusion of a media landscape made from separate entities (Reference Kember and Zylinska2012: 25). In contrast, listening exists through entangled connections between people and reproduced voices – what the EU’s GDPR calls unstructured data – all part of a mediation process (Reference Kember and Zylinska2012: xvi).

For sonification, it has historically been important to understand the process of interpretation in listening. In their article ‘Interactive Sonification’, computer scientists Florian Grond and Thomas Hermann exemplify this, pointing out that sonification is contingent on the ‘act of interpretation’ and highlighting the fact that the listener has an equal part in the process of making sound (Grond and Hermann Reference Grond and Hermann2014: 41). They also argue that listening is not a static activity, but rather a process that bounces between different listening modes. They borrow from Schaeffer who distinguished between listening (écouter), perceiving aurally (ouïr), hearing (entendre), and understanding (comprendre). ‘Reduced listening’ (écoute réduite), an observation method using the four modes (see Grond and Hermann Reference Grond and Hermann2014: 43), is essential to thinking about the dualism of accessing sound for its own sake, or as music. Schaeffer explored this dualism in reduced listening as being ‘caught between sound object and musical structure’ (Reference Schaeffer and North1966: 279).

Using Schaeffer’s framework, Grond and Hermann situate sonification practices in the study of listening, arguing that information about data emerges from feedback loop cycles of attention and interpretation when a sound source and listener interact. They write that ‘an awareness of the variety of listening modes can help to enunciate how sound might refer to other modes of perception and action and how they influence the connection between data and sound’ (Grond and Hermann Reference Grond and Hermann2014: 42).

At this point, it should be noted that one of the most crucial challenges of data sonification is to disassociate the listener from misconceptions that lead them to listen to sonification as music. Grond and Hermann point out ‘if a sonification consists of several audible precepts, this reduction is inevitable and makes us perceive a sonification as more or less musical’; while this fundamental problem may relate to challenges of artificially generated music, which is important to consider, it is beyond the scope of this article (ibid.: 42). It must be left for consideration elsewhere.

By describing listening modes, Schaeffer’s methodologies also aimed to work with sound in relation to the lack of visual sources of passing auditory events. Such modes of acousmatic situations require a higher degree of interaction between the subject and the sound signal because the act of listening itself comes under analysis and must be interpreted by the listener. By equating what they call the ‘interactive’ dimension with interpretation, however, Grond and Hermann foreground meaning over relationality. For Schaeffer, the performativity of listening modes is a relational practice; as he explains it, the ‘act of understanding [semantically but also by differentiating] precisely coincides with the activity of listening: the whole work of deduction, comparison, and abstraction is part of the process and goes far beyond the immediate content, “what can be heard”’ (Reference Schaeffer and North1966: 79).

Schaeffer outlined four distinct acousmatic situations through which it is possible to attend to sounds with unseen sources: he named them with the Greek inspired term acousmatic used to describe the disciples of Pythagoras who listened to the philosopher’s lectures behind a curtain (Reference Schaeffer and North1966: 64). First is ‘pure listening’, the recognition of sound in relation to visual stimuli whether the source is or is not present for the listener. Second is ‘listening to effects’, which occurs when the listener engages with the sound as if it were a completely independent entity; it treats the sound event as something that can exist without the physical properties of the source even existing, but only imagined.

For disassociating the subject from the sound source, Schaeffer suggests performing this mode repetitively by listening to recorded materials. The ‘listening to effects’ mode situates the subject’s perception at the centre of the process and proposes that, in acousmatic listening, there is no difference between natural and reproduced sound; they are essentially all ‘sound objects’ (Schaeffer Reference Schaeffer and North1966: 66). Third, is the ‘variation in listening’, which takes place when the subject listens to natural and pre-recorded sounds by using reproduction devices; for example, a tape recorder and loudspeakers. This mode raises the listener’s attention to the many variations of listening that can occur. The fourth is ‘variations in signal’ which become evident while listening to pre-recorded and modified sounds modulated on an analogue, or in today’s world, a digital device.

A possible fifth mode, inspired by machine listening, could be added to Schaeffer’s categories: it is a situation in which the sounds themselves are digitally produced or modelled signals appearing to simulate the recorded sound so perfectly that the two variations – the original and the recreated – are indistinguishable by a listener. In this case, the listener has no space for differentiation between the two sounds. The possible fifth mode is the outcome of the aforementioned four; it dwells in an infinite variety of signals that a computer can use to model a sound, mimicking other recorded sounds. It is added to perception, therefore changing the dynamics of listening, and is taking place between humans and machines.

Schaeffer situated acousmatic listening in an active space where the subject is performing and studying the phenomena of its own perception process; a similar situation was performed by the artist Hecker who sonified data of a machine listening process for the audience. The creation of a space where the performance of listening can raise awareness of the process of listening – in some cases to the richness, complexity, or other characteristics of sound objects – is important. If the sound becomes indistinguishable from an ontologically different sound, it becomes impossible for the perception to recognise it as a machine-generated sound. This is where media archaeology is useful and is where, through performativity and re-enactment, such as Hecker’s listening sonification, more-than-human listening is better addressed.

A potential recognition of the difference between an original and a deep-fake might not change the interpretation or the meaning of the sound sign, but media practices that perform Schaeffer’s acousmatic situations give space – an essential space – to see these manipulations as interference with the materiality of listening practices. Schaeffer suggests thinking about the tape-recorded sound as Pythagoras’s curtain that ‘may create a new phenomena to be observed, but above all it creates new conditions for observation’ (Reference Schaeffer and North1966: 69); we have to extend our repositories of listening techniques to new manipulations of digital machine listening.

Perhaps it is by thinking through sonification that a specific mode of listening, in which the listener spins through the aforementioned acousmatic situations while sonification integrates speed and spatiality into the perception process, could be integrated further into the discourse of machine listening. More than just a techno-positivist attitude to technology, more than just a disruptive force, it could generate work with technology that understands its philosophical, epistemological, ontological and ethical grounds and implications. This would be a design for sonic interaction conceived for reconfiguring the proprioceptive dimensions of sensory perception for the more-than-human.

8. CONCLUSION

When the terminology ‘machine listening’ is used to anthropomorphise the machine, it obfuscates ontological differences between humans and machines. The discourse of sonification addresses the relationality and scalability between data and audible precepts; among its main concerns as Grond and Hermann suggest, therefore, is the phenomenology of listening to evanescent auditory events (Reference Grond and Hermann2014: 50). This article has sought to establish the connection between sonification and machine listening through relationality: between data and sound, voice and deep-fake voice, privacy concerns around data, and most importantly, listening practices. More-than-human listening is the relationship between machine listening – speech reproduction, speaker recognition, voice anonymisation – and human listening.

Media archaeological connections between historical terms, such as the Fourier analysis and the materiality of sound processing, show how machines become media archaeologists when humans try to navigate the ethico-onto-epistemological boundaries of audible perception. Art can play the part of the performer, the re-enactors of mediation, asking informed questions about epistemological boundaries. The remediation process relates to Schaeffer’s acousmatic situations reactivated by the listener who observes their own listening. In the remediation process of a voice sample, how can a listener identify differences? The difference is not about the ability to hear the voice as different, but about being able to hear that there is a new epistemology of sound objects present.

In one of his very last articles, published in 2003, Polish science-fiction writer Stanisław Lem reflected on his past research in artificial intelligence. He observed that the early experiments with non-human conversational tools were deceptive: they obfuscate not only because such is the nature of reproductions that aim for authenticity, but also because humans tend to assume that if someone speaks to them, this speaking entity must be another human. Lem surmised that for a human, the aim of conversation is cognitive; for a machine, it is disruptive. The machine looks for new, never-existing connections between words and collocations that, during the interaction, humans desperately try to interpret and understand as tiny fragments of the world (Lem Reference Lem2006: 320).

Likewise, algorithmic analysis establishes new relations between subjects and objects, and though these new relations often do not make human sense, they make the world fuller, with more words and voices than would have ever been heard or encountered. Machine listening is the spatial-sonic realisation in sonification of these new networks: this listening process transcribes sounds for the machine as integers and images of sounds that become intertwined with other numbers and are then transformed back to sound. For artificial intelligence, as Lem said, every call for a new action can mean getting somewhere no one has ever been: an intelligence that is profoundly ‘other’, not a tiny fragment of the world as we think we already know it.

Footnotes

¹ For example, temperature values indicate the intensity of heat energy, and sonification turns these measured values into sound energy. I also use the term ‘intensity’ to reflect on LaBelle’s use of the term. According to him, the intensities of sonic events form relationships between material behaviour and sociopolitical dimensions of sound (LaBelle Reference LaBelle2018: 60).

² The potential for exchanging vibration energy. LaBelle says, ‘sound is fundamentally a vibrant matter, one that conducts any number of contacts and conversations’ (LaBelle Reference LaBelle2018: 61).

³ Kittler’s original essay was written in German but appeared with its English title: ‘Real Time Analysis, Time Axis Manipulation’ (Kittler Reference Kittler2017: 16). In his translation, Geoffrey Winthrop-Young wrote ‘real time’ consistently without the hyphen to align with Kittler’s use.

⁴ Not every sound can be translated with the FFT. For example, it is hard to separate noise into distinct waveforms.

References

REFERENCES

Andén, J., Lostanlen, V. and Mallat, S. 2015. Joint Time-Frequency Scattering for Audio Classification. 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), 1–6. https://doi.org/10.1109/MLSP.2015.7324385.CrossRef Google Scholar

Ballora, M. 2014. Sonification, Science and Popular Music: In Search of the ‘Wow’. Organised Sound 19(1): 30–40. https://doi.org/10.1017/S1355771813000381.CrossRef Google Scholar

Barad, K. 2007. Meeting the Universe Halfway: Quantum Physics and the Entanglement of Matter and Meaning. Durham, NC: Duke University Press.CrossRef Google Scholar

Blom, I. 2021. Sound Effect. Artforum 59(6): 101–7.Google Scholar

Bregman, A. S. 1990. Auditory Scene Analysis: The Perceptual Organization of Sound. Cambridge, MA: MIT Press.CrossRef Google Scholar

Chun, W. H. K. and Barnett, A. 2021. Discriminating Data: Correlation, Neighborhoods, and the New Politics of Recognition. Cambridge, MA: MIT Press.CrossRef Google Scholar

Clarke, E. F. 2005. Ways of Listening: An Ecological Approach to the Perception of Musical Meaning. New York: Oxford University Press.CrossRef Google Scholar

Ernst, W. 2013. Digital Memory and the Archive, ed. Parikka, J.. Minneapolis, MN: University of Minnesota Press.Google Scholar

Ernst, W. 2016. Chronopoetics: The Temporal Being and Operativity of Technological Media. London: Rowman & Littlefield.Google Scholar

Fickers, A. and van den Oever, A. 2019. Doing Experimental Media Archaeology: Epistemological and Methodological Reflections on Experiments with Historical Objects of Media Technologies . In Roberts, B. and Goodall, M. (eds.) New Media Archaeologies. Amsterdam: Amsterdam University Press, 45–68.CrossRef Google Scholar

Grond, F. and Hermann, T. 2014. Interactive Sonification for Data Exploration: How Listening Modes and Display Purposes Define Design Guidelines. Organised Sound 19(1): 41–51. https://doi.org/10.1017/S1355771813000393.CrossRef Google Scholar

Hunt, A. and Hermann, T. 2011. Interactive Sonification. In Hermann, T., Hunt, A. and Neuhoff, J. G. (eds.) The Sonification Handbook. Berlin: Logos Verlag, 273–98.Google Scholar

James, R. 2019. The Sonic Episteme: Acoustic Resonance, Neoliberalism, and Biopolitics. Durham, NC: Duke University Press.Google Scholar

Kember, S. and Zylinska, J. 2012. Life after New Media: Mediation as a Vital Process. Cambridge, MA: MIT Press.CrossRef Google Scholar

Kittler, F. A. 1999. Gramophone, Film, Typewriter. Stanford, CA: Stanford University Press.Google Scholar

Kittler, F. A. 2017. Real Time Analysis, Time Axis Manipulation. Cultural Politics 13(1): 1–18. https://doi.org/10.1215/17432197-3755144.CrossRef Google Scholar

LaBelle, B. 2018. Sonic Agency: Sound and Emergent Forms of Resistance. London: Goldsmiths Press.Google Scholar

Lem, S. 2006. DiLEMmák. Budapest: Typotex.Google Scholar

Lostanlen, V., Andén, J. and Lagrange, M. 2019a. Fourier at the Heart of Computer Music: From Harmonic Sounds to Texture. C.R. Physique 20(5). https://doi.org/10.1016/j.crhy.2019.07.005.CrossRef Google Scholar

Lostanlen, V., Salamon, J., Farnsworth, A., Kelling, S. and Bello, J. P. 2019b. Robust Sound Event Detection in Bioacoustic Sensor Networks. PloS One 14(10): e0214168–e0214168. https://doi.org/10.1371/journal.pone.0214168.CrossRef Google Scholar PubMed

Matassoni, M., Fong, S. and Brutti, A. 2024. Speaker Anonymization: Disentangling Speaker Features from Pre-Trained Speech Embeddings for Voice Conversion. Applied Sciences 14(9). https://doi.org/10.3390/app14093876.CrossRef Google Scholar

MathWorks. n.d. What Are Wavelet Transforms? www.mathworks.com/discovery/wavelet-transforms.html (accessed 18 July 2024).Google Scholar

McLuhan, M. 2003. Understanding Media: The Extensions of Man. New York: McGraw-Hill Book Company.Google Scholar

Moore, T. Fourier Transform. n.d. https://learn.flucoma.org/learn/fourier-transform/ (accessed 25 May 2024).Google Scholar

Nagrani, A., Chung, J. S., Xie, W. and Zisserman, A. 2020. Voxceleb: Large-Scale Speaker Verification in the Wild. Computer Speech & Language 60: 101027. https://doi.org/10.1016/j.csl.2019.101027.CrossRef Google Scholar

Parikka, J. 2012. What is Media Archaeology? Cambridge: Polity Press.Google Scholar

Roads, C. 2002. Microsound. Cambridge, MA: MIT Press.CrossRef Google Scholar

Schaeffer, P. 1966. Treatise on Musical Objects: An Essay Across Disciplines, trans. North, C.. Berkeley, CA: University of California Press.Google Scholar

Settel, Z. and Lippe, C. 1995. Real-Time Musical Applications Using Frequency Domain Signal Processing. IEEE, 230–3. https://doi.org/10.1109/ASPAA.1995.482997.CrossRef Google Scholar

Smirnov, A. 2013. Sound in Z: Experiments in Sound and Electronic Music in Early 20th Century Russia. Cologne: Koenig Books.Google Scholar

Weitzenboeck, E., Lison, P., Cyndecka, M. and Langford, M. 2022. The GDPR and Unstructured Data: Is Anonymization Possible? International Data Privacy Law 12(3): 184–206. https://doi.org/10.1093/idpl/ipac008.CrossRef Google Scholar

Wired. 2018. How Lyrebird Uses AI to Find Its (Artificial) Voice. www.wired.com/brandlab/2018/10/lyrebird-uses-ai-find-artificial-voice/ (accessed 15 January 2023).Google Scholar

Article contents

Machine Listening as Sonification

Abstract

1. INTRODUCTION

2. WHAT IS SONIFICATION?

3. A MEDIA ARCHAEOLOGICAL READING OF FOURIER ANALYSIS

4. TECHNICAL DETAILS OF FREQUENCY-DOMAIN PROCESSING

5. NON-HUMAN LISTENING

6. A TECHNO-EPISTEMOLOGICAL READING OF SPECTROGRAMS

7. MODES OF LISTENING EXTENDED

8. CONCLUSION

Footnotes

References

REFERENCES

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests