FB-DDSP: Controllable Neural Audio Synthesis of Sound Effects Using Multi-Rate Filter Banks
[Code (zip)]

Sound examples

Reconstruction

We evaluate FB-DDSP resynthesis capabilities on four sound categories against three variants of the original DDSP filtered noise synthesiser [1], with a configuration of FIR filter lengths (F) and block sizes (B) of DDSPF65B256, DDSPF255B64 and DDSPF255B512. We use the same training configuration (including conditioning) for all models.

FootstepsThunderstormPotteryKnocking
Original recording (training data)
FB-DDSP (ours)
DDSPF65B256
DDSPF255B64
DDSPF255B512
We evaluate the statistical significance of the MRSTFT reconstruction loss between the systems. A two-way ANOVA on loss data with factors for model (four levels, of FB-DDSP and the three variants of the original DDSP model) and sound effect (four levels of knocking, thunderstorm, pottery, and footsteps) reveals significant main effects of model ($F(3, 96) = 555.95, p < .001$) and sound effect ($F(3, 96) = 96.82, p < .001$) and a significant interaction ($F(9, 96) = 13.05, p < .001$), suggesting that the type of model drives differences in loss, so does the type of sound effect, and that certain combinations of model and sound effect lead to either particularly low or high loss values. An analysis of multiple pairwise comparisons (Tukey's Honest Significant Difference method) was conducted to investigate which pairings of groups differ. It was found that FB-DDSP significantly outperforms DDSPF65B256 ($\textrm{mean diff} = 0.325, p < .001$), DDSPF255B64 ($\textrm{mean diff} = 0.416, p < .001$), and DDSPF255B512 ($\textrm{mean diff} = 0.303, p < .001$), and that the knocking sound effect has significantly higher loss values for DDSPF65B256 and DDSPF255B64 compared to the other sound effect categories (e.g., $\textrm{mean diff} = 0.147,$ p < .001\$).

Note - Correction

After submitting the paper we found that the implementation of the original DDSP FIR synthesiser we used to evaluate our model may be flawed and may not represent the potential reconstruction capabilities of the baseline system. Experiments with a different implementation yield comparable reconstruction capabilities between the baseline DDSP FIR synthesiser and the current configuration of FB-DDSP. We are currently working on increasing the frequency resolution of FB-DDSP (as commented in the discussion) to demonstrate how our system can be further improved.

Creative experiments (coming soon)

Training sounds attribution
Footsteps on metal sounds by: Freesound user "Eelke", licensed under CC BY 4.0: https://freesound.org/people/Eelke/sounds/462599/
Thunderstorm sounds by: Freesound user "RossJuterbock", licensed under CC0 1.0: https://freesound.org/people/RossJuterbock/sounds/543481/
Pottery sounds by: Freesound user "Tumbleweed3288", licensed under CC0 1.0: https://freesound.org/people/Tumbleweed3288/sounds/381638/ and https://freesound.org/people/Tumbleweed3288/sounds/381548/
Knocking sounds by: Adrián Barahona-Ríos & Sandra Pauletto [2], licensed under CC BY 4.0: https://zenodo.org/record/3668503

References
[1] Engel, Jesse, et al. "DDSP: Differentiable Digital Signal Processing." arXiv preprint arXiv:2001.04643 (2020).
[2] Barahona-Ríos, Adrián and Sandra Pauletto. "Synthesising Knocking Sound Effects Using Conditional WaveGAN." In: Proceedings of the 17th Sound & Music Computing Conference, pp. 450-456, 2020.