Publications
Controllable neural audio synthesis of sound effects is a challenging task due to the potential scarcity and spectro-temporal variance of the data. Differentiable digital signal processing (DDSP) synthesisers have been successfully employed to model and control musical and harmonic signals using relatively limited data and computational resources. Here we propose NoiseBandNet, an architecture capable of synthesising and controlling sound effects by filtering white noise through a filterbank, thus going further than previous systems that make assumptions about the harmonic nature of sounds. We evaluate our approach via a series of experiments, modelling footsteps, thunderstorm, pottery, knocking, and metal sound effects. Comparing NoiseBandNet audio reconstruction capabilities to four variants of the DDSP-filtered noise synthesiser, NoiseBandNet scores higher in nine out of ten evaluation categories, establishing a flexible DDSP method for generating time-varying, inharmonic sound effects of arbitrary length with both good time and frequency resolution. Finally, we introduce some potential creative uses of NoiseBandNet, by generating variations, performing loudness transfer, and by training it on user-defined control curves.
In media production, the sound design process often involves the use of pre-recorded sound samples as the source of the audio assets. However, the increasing size and complexity of interactive media such as video games, may render this process very time-consuming and memory-demanding. In contrast, the use of sound synthesis for sound effects can improve the sound palette of media, tackling the challenges derived from current workflows. These synthesised sound effects are usually generated using digital signal processing (DSP) methods. Nonetheless, creating sound effects using DSP methods may be challenging, and can produce unsatisfactory results, which hampers their adoption among audio professionals. Recent data-driven approaches propose an alternative to these DSP methods for the synthesis of audio, surpassing them and establishing the state of the art in sound generation. This thesis explores the suitability of DSP systems, generative deep learning architectures, and a combination of both for the synthesis of sound effects, with an especial focus on game audio. The results show that some DSP methods, with constraints, can be perceptually effective for this task. Furthermore, it is shown: how generative deep learning methods, not necessarily bound by those constraints, are not far from achieving a plausibility comparable to pre-recorded samples; how they can also be trained in data-scarce scenarios outperforming DSP approaches in plausibility and variation of the synthesised sounds; and how a combination of deep learning and DSP processes can be used to build expressive models, linking human-interpretable controls to the output audio. The implications of the proposed work suggest that both generative deep learning methods and a combination of them alongside DSP approaches contribute to addressing the challenges hampering the adoption of synthesised sound effects. This work could lead to the establishment of novel data-driven workflows tailored to the preferences of audio professionals, in line with current industry demands.
Single-image generative adversarial networks learn from the internal distribution of a single training example to generate variations of it, removing the need of a large dataset. In this paper we introduce SpecSinGAN, an unconditional generative architecture that takes a single one-shot sound effect (e.g., a footstep; a character jump) and produces novel variations of it, as if they were different takes from the same recording session. We explore the use of multi-channel spectrograms to train the model on the various layers that comprise a single sound effect. A listening study comparing our model to real recordings and to digital signal processing procedural audio models in terms of sound plausibility and variation revealed that SpecSinGAN is more plausible and varied than the procedural audio models considered, when using multi-channel spectrograms. Sound examples can be found at the project website.
In this paper we explore the synthesis of sound effects using conditional generative adversarial networks (cGANs). We commissioned Foley artist Ulf Olausson to record a dataset of knocking sound effects with different emotions and trained a cGAN on it. We analysed the resulting synthesised sound effects by comparing their temporal acoustic features to the original dataset and by performing an online listening test. Results show that the acoustic features of the synthesised sounds are similar to those of the recorded dataset. Additionally, the listening test results show that the synthesised sounds can be identified by people with experience in sound design, but the model is not far from fooling non-experts. Moreover, on average most emotions can be recognised correctly in both recorded and synthesised sounds. Given that the temporal acoustic features of the two datasets are highly similar, we hypothesise that they strongly contribute to the perception of the intended emotions in the recorded and synthesised knocking sounds.
The use of real-time sound synthesis for sound effects can improve the sound design of interactive experiences such as video games. However, synthesized sound effects can be often perceived as synthetic, which hampers their adoption. This paper aims to determine whether or not sounds synthesized using filter-based modal synthesis are perceptually comparable to sounds directly recorded. Sounds from 4 different materials that showed clear modes were recorded and synthesized using filter-based modal synthesis. Modes are the individual sinusoidal frequencies at which objects vibrate when excited. A listening test was conducted where participants were asked to identify, in isolation, whether a sample was recorded or synthesized. Results show that recorded and synthesized samples are indistinguishable from each other. The study outcome proves that, for the analysed materials, filter-based modal synthesis is a suitable technique to synthesize hit sounds in real-time without perceptual compromises.
In this paper we present a system for the sonification of the electricity drawn by different household appliances. The system uses SpecSinGAN as the basis for the sound design, which is an unconditional generative architecture that takes a single one-shot sound effect (e.g., a fire crackle) and produces novel variations of it. SpecSinGAN is based on single-image generative adversarial networks that learn from the internal distribution of a single training example (in this case the spectrogram of the sound file) to generate novel variations of it, removing the need of a large dataset. In our system, we use a python script in a Raspberry PI to receive the data of the electricity drawn by an appliance via a Smart Plug. The data is then sent to a Pure Data patch via Open Sound Control. The electricity drawn is mapped to the sound of fire, which is generated in real-time using Pure Data by mixing different variations of four fire sounds - a fire crackle, a low end fire rumble, a mid level rumble, and hiss - which were synthesised offline by SpecSinGAN. The result is a dynamic fire sound that is never the same, and that grows in intensity depending on the electricity data. The density of the crackles and the level of the rumbles increase with the electricity data. Our testing of the system in two households, with different appliances confirms that the sonification works well and intuitively in increasing awareness about the energy consumed by different appliances. This sonification is particularly useful in drawing attention to "invisible" energy consumption. Finally, we discuss results and future work.
In this paper we demonstrate the use of Multidimensional Archive of Phenotypic Elites (MAP-Elites), a divergent search algorithm, as a game design assisting tool. The MAP-Elites algorithm allows illumination in the game space instead of just determining a single game setting via objective based optimization. We showed how the game space can be explored by generating a diverse set of game settings, allowing the designers to explore what range of behaviours are possible in their games. The proposed method was applied to the 2D game Cave Swing. We discovered different settings of the game where a Rolling Horizon Evolutionary Algorithm (RHEA) agent behaved differently depending on the selected game parameters. The agent's performance was plotted against its behaviour for further exploration, which allowed visualizing how the agent performed with selected behaviour traits.
Knocking sounds are highly meaningful everyday sounds. There exist many ways of knocking, expressing important information about the state of the person knocking and their relationship with the other side of the door. In media production, knocking sounds are important storytelling devices: they allow transitions to new scenes and create expectations in the audience. Despite this important role, knocking sounds have rarely been the focus of research. In this study, we create a data set of knocking actions performed with different emotional intentions. We then verify, through a listening test, whether these emotional intentions are perceived through listening to sound alone. Finally, we perform an acoustic analysis of the experimental data set to identify whether emotion-specific acoustic patterns emerge. The results show that emotional intentions are correctly perceived for some emotions. Additionally, the emerging emotion-specific acoustic patterns confirm, at least in part, findings from previous research in speech and music performance.