Authors: Gwendal Le Vaillant and Thierry Dutoit (ISIA Lab, University of Mons)
This page contains supplemental material for the paper “Synthesizer Preset Interpolation using Transformer Auto-Encoders” accepted to IEEE ICASSP 2023 (preprint: https://arxiv.org/abs/2210.16984).
For the best audio listening experience, please use Chrome (preferred) or Safari.
Contents:
Source code is available on Github
Interpolation between presets
Interpolation example 1
Start preset "WindEns2Ed" |
End preset "HARD ROADS" |
||||||||
---|---|---|---|---|---|---|---|---|---|
Step 1/9 | Step 2/9 | Step 3/9 | Step 4/9 | Step 5/9 | Step 6/9 | Step 7/9 | Step 8/9 | Step 9/9 | |
SPINVAE |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Reference (naive linear) |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Interpolation example 2
Start preset "BRASS 15" |
End preset "Kharma'HM" |
||||||||
---|---|---|---|---|---|---|---|---|---|
Step 1/9 | Step 2/9 | Step 3/9 | Step 4/9 | Step 5/9 | Step 6/9 | Step 7/9 | Step 8/9 | Step 9/9 | |
SPINVAE |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Reference (naive linear) |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Interpolation example 3
Start preset "VIBESYN 2" |
End preset "SYNTHEKLA4" |
||||||||
---|---|---|---|---|---|---|---|---|---|
Step 1/9 | Step 2/9 | Step 3/9 | Step 4/9 | Step 5/9 | Step 6/9 | Step 7/9 | Step 8/9 | Step 9/9 | |
SPINVAE |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Reference (naive linear) |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Interpolation example 4
Start preset "Synbass 4" |
End preset "K.CLAV. 3" |
||||||||
---|---|---|---|---|---|---|---|---|---|
Step 1/9 | Step 2/9 | Step 3/9 | Step 4/9 | Step 5/9 | Step 6/9 | Step 7/9 | Step 8/9 | Step 9/9 | |
SPINVAE |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Reference (naive linear) |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Extrapolation
SPINVAE extrapolation example 1
<--- Extrapolation | Preset | Preset | Extrapolation ---> | ||||||
---|---|---|---|---|---|---|---|---|---|
"PnoCk Ep9" | <---------- Interpolation ----------> | "SYNTH 7" | |||||||
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
SPINVAE extrapolation example 2
<--- Extrapolation | Preset | Preset | Extrapolation ---> | ||||||
---|---|---|---|---|---|---|---|---|---|
"BOUM" | <---------- Interpolation ----------> | "fuzzerro" | |||||||
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Audio features and interpolation performance results
Timbre audio features
As indicated in the paper, the Timbre Toolbox 1 is used to compute audio features for each rendered audio file. These features have been engineered by experts and span a wide range of timbre characteristics. They can be categorized into three main groups as shown in the table below.
Temporal | Spectral | Harmonic | |||
---|---|---|---|---|---|
Att Dec Rel LAT AttSlope DecSlope TempCent EffDur FreqMod AmpMod RMSEnv |
Attack Decay Release Log-Attack Time Attack Slope Decrease Slope Temporal Centroid Effective Duration Frequency of Energy Mod. Amplitude of Energy Mod. RMS of Energy Envelope |
SpecCent SpecSpread SpecSkew SpecKurt SpecSlope SpecDecr SpecRollOff SpecVar FrameErg SpecFlat SpecCrest |
Spectral Centroid Spectral Spread Spectral Skewness Spectral Kurtosis Spectral Slope Spectral Decrease Spectral Rolloff Spectro-temporal variation Frame Energy Spectral Flatness Spectral Crest |
HarmErg NoiseErg F0 InHarm HarmDev OddEvenRatio |
Harmonic Energy Noise Energy Fundamental Frequency Inharmonicity Harmonic Spectral Deviation Odd to even harmonic ratio |
Some features are computed for each time frame, and their median value and Inter-Quartile Range (IQR) only are used1. This results in 46 features available to evaluate the interpolation.
Figure A below shows the correlation between all timbre features, and we observe that most of them are not or weakly correlated. Nonetheless, some pairs of features tend to show similar variations (e.g. RMSEnv_med and EffDur, SpecCent_med and SpecSpread_med) and present moderate to high levels of correlation. However, such pairs of features often describe timbre characteristics that are easily distinguishable to the human ear. E.g., SpecCent_med and SpecSpread_med are highly correlated (0.72) but are discernibly distinct from each other: a high-pitched sound would have a low Spectral Spread but a high Spectral Centroid. Thus, all features which present a moderate correlation should be used to analyze results.
Figure A. Absolute value of the correlation between timbre features. | Figure B. Features with a very high (> 0.9) correlation. |
---|---|
However, we observe from Figure B that a few pairs of features present a very high correlation and might be redundant. In order to keep only non-highly-correlated features, eight features could be removed from the analysis: SpecRollOff_med, SpecKurt_med, SpecKurt_IQR, HarmErg_IQR, NoiseErg_med, NoiseErg_IQR, OddEvenRatio_IQR and Rel. This is discussed in the next sub-section.
Detailed interpolation results
Interpolation quality is evaluated using the smoothness and non-linearity of audio features along an interpolation sequence. More details about these interpolation metrics are available in the paper.
Results from Table 1 from the paper are reproduced on the left-hand side of the table below. They use the full set of 46 audio features, and include the number of improved features and average performance variation.
Additional results are presented on the right-hand side of the table below. They show the median performance variation for 46 features (not presented in the paper due to space constraints). They also include complete results (number of improved features, average and median performance variaton) obtained using the reduced set of 46-8 = 38 features.
SPINVAE remains the best overall model, with no significant difference in results between using the full or reduced set of audio features.
Results included in the paper | Additional results | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Model | 46 features | | | 38 features (highly correlated excluded) | ||||||||||
Num. improved features (out of 46) |
Average variation (%) |
Median variation (%) |
| | Num. improved features (out of 38) |
Average variation (%) |
Median variation (%) |
|||||||
Smooth. | Nonlin. | Smooth. | Nonlin. | Smooth. | Nonlin. | | | Smooth. | Nonlin. | Smooth. | Nonlin. | Smooth. | Nonlin. | |
SPINVAE (DLM 3) | 35 | 38 | -12.6 | -12.3 | -13.1 | -15.5 | | | 29 | 30 | -12.1 | -11.3 | -12.0 | -14.2 |
Preset-only | 25 | 30 | -4.6 | -6.4 | -4.6 | -9.0 | | | 20 | 22 | -3.9 | -5.8 | -4.1 | -7.5 |
Sound matching | 8 | 7 | +66.8 | +29.7 | +156 | +109 | | | 6 | 5 | +62.4 | +28.8 | +172 | +126 |
DLM 2 | 31 | 37 | -8.2 | -10.4 | -6.7 | -11.0 | | | 26 | 29 | -7.4 | -9.3 | -5.6 | -9.5 |
DLM 4 | 30 | 40 | -9.2 | -14.5 | -7.5 | -16.1 | | | 25 | 32 | -9.2 | -13.7 | -7.0 | -15.0 |
Softmax | 23 | 40 | -1.2 | -15.6 | +0.9 | -15.9 | | | 18 | 32 | -1.4 | -14.6 | +2.5 | -13.5 |
MLP | 18 | 27 | +21.0 | -1.8 | +23.9 | +0.3 | | | 15 | 22 | +18.2 | -1.5 | +20.7 | +3.4 |
LSTM | 15 | 1 | +123 | +93.5 | +349 | +394 | | | 12 | 1 | +109 | +89.5 | +403 | +462 |