Authors: Gwendal Le Vaillant and Thierry Dutoit (ISIA Lab, University of Mons)

This page contains supplemental material for the paper “Synthesizer Preset Interpolation using Transformer Auto-Encoders” accepted to IEEE ICASSP 2023 (preprint: https://arxiv.org/abs/2210.16984).

For the best audio listening experience, please use Chrome (preferred) or Safari.

Contents:

Source code is available on Github


Interpolation between presets

Interpolation example 1

Start preset
"WindEns2Ed"
End preset
"HARD ROADS"
Step 1/9 Step 2/9 Step 3/9 Step 4/9 Step 5/9 Step 6/9 Step 7/9 Step 8/9 Step 9/9
SPINVAE








Reference
(naive linear)









Interpolation example 2

Start preset
"BRASS 15"
End preset
"Kharma'HM"
Step 1/9 Step 2/9 Step 3/9 Step 4/9 Step 5/9 Step 6/9 Step 7/9 Step 8/9 Step 9/9
SPINVAE








Reference
(naive linear)









Interpolation example 3

Start preset
"VIBESYN 2"
End preset
"SYNTHEKLA4"
Step 1/9 Step 2/9 Step 3/9 Step 4/9 Step 5/9 Step 6/9 Step 7/9 Step 8/9 Step 9/9
SPINVAE








Reference
(naive linear)









Interpolation example 4

Start preset
"Synbass 4"
End preset
"K.CLAV. 3"
Step 1/9 Step 2/9 Step 3/9 Step 4/9 Step 5/9 Step 6/9 Step 7/9 Step 8/9 Step 9/9
SPINVAE








Reference
(naive linear)










Extrapolation

SPINVAE extrapolation example 1

<--- Extrapolation Preset Preset Extrapolation --->
"PnoCk Ep9" <---------- Interpolation ----------> "SYNTH 7"










SPINVAE extrapolation example 2

<--- Extrapolation Preset Preset Extrapolation --->
"BOUM" <---------- Interpolation ----------> "fuzzerro"











Audio features and interpolation performance results

Timbre audio features

As indicated in the paper, the Timbre Toolbox 1 is used to compute audio features for each rendered audio file. These features have been engineered by experts and span a wide range of timbre characteristics. They can be categorized into three main groups as shown in the table below.

  Temporal   Spectral   Harmonic
Att
Dec
Rel
LAT
AttSlope
DecSlope
TempCent
EffDur
FreqMod
AmpMod
RMSEnv
Attack
Decay
Release
Log-Attack Time
Attack Slope
Decrease Slope
Temporal Centroid
Effective Duration
Frequency of Energy Mod.
Amplitude of Energy Mod.
RMS of Energy Envelope
SpecCent
SpecSpread
SpecSkew
SpecKurt
SpecSlope
SpecDecr
SpecRollOff
SpecVar
FrameErg
SpecFlat
SpecCrest
Spectral Centroid
Spectral Spread
Spectral Skewness
Spectral Kurtosis
Spectral Slope
Spectral Decrease
Spectral Rolloff
Spectro-temporal variation
Frame Energy
Spectral Flatness
Spectral Crest
HarmErg
NoiseErg
F0
InHarm
HarmDev
OddEvenRatio
Harmonic Energy
Noise Energy
Fundamental Frequency
Inharmonicity
Harmonic Spectral Deviation
Odd to even harmonic ratio

Some features are computed for each time frame, and their median value and Inter-Quartile Range (IQR) only are used1. This results in 46 features available to evaluate the interpolation.

Figure A below shows the correlation between all timbre features, and we observe that most of them are not or weakly correlated. Nonetheless, some pairs of features tend to show similar variations (e.g. RMSEnv_med and EffDur, SpecCent_med and SpecSpread_med) and present moderate to high levels of correlation. However, such pairs of features often describe timbre characteristics that are easily distinguishable to the human ear. E.g., SpecCent_med and SpecSpread_med are highly correlated (0.72) but are discernibly distinct from each other: a high-pitched sound would have a low Spectral Spread but a high Spectral Centroid. Thus, all features which present a moderate correlation should be used to analyze results.

Figure A. Absolute value of the correlation between timbre features. Figure B. Features with a very high (> 0.9) correlation.

However, we observe from Figure B that a few pairs of features present a very high correlation and might be redundant. In order to keep only non-highly-correlated features, eight features could be removed from the analysis: SpecRollOff_med, SpecKurt_med, SpecKurt_IQR, HarmErg_IQR, NoiseErg_med, NoiseErg_IQR, OddEvenRatio_IQR and Rel. This is discussed in the next sub-section.

Detailed interpolation results

Interpolation quality is evaluated using the smoothness and non-linearity of audio features along an interpolation sequence. More details about these interpolation metrics are available in the paper.

Results from Table 1 from the paper are reproduced on the left-hand side of the table below. They use the full set of 46 audio features, and include the number of improved features and average performance variation.

Additional results are presented on the right-hand side of the table below. They show the median performance variation for 46 features (not presented in the paper due to space constraints). They also include complete results (number of improved features, average and median performance variaton) obtained using the reduced set of 46-8 = 38 features.

SPINVAE remains the best overall model, with no significant difference in results between using the full or reduced set of audio features.

Results included in the paper Additional results
Model 46 features | 38 features (highly correlated excluded)
Num. improved
features (out of 46)
Average
variation (%)
Median
variation (%)
| Num. improved
features (out of 38)
Average
variation (%)
Median
variation (%)
Smooth. Nonlin. Smooth. Nonlin. Smooth. Nonlin. | Smooth. Nonlin. Smooth. Nonlin. Smooth. Nonlin.
SPINVAE (DLM 3) 35 38 -12.6 -12.3 -13.1 -15.5 | 29 30 -12.1 -11.3 -12.0 -14.2
Preset-only 25 30 -4.6 -6.4 -4.6 -9.0 | 20 22 -3.9 -5.8 -4.1 -7.5
Sound matching 8 7 +66.8 +29.7 +156 +109 | 6 5 +62.4 +28.8 +172 +126
DLM 2 31 37 -8.2 -10.4 -6.7 -11.0 | 26 29 -7.4 -9.3 -5.6 -9.5
DLM 4 30 40 -9.2 -14.5 -7.5 -16.1 | 25 32 -9.2 -13.7 -7.0 -15.0
Softmax 23 40 -1.2 -15.6 +0.9 -15.9 | 18 32 -1.4 -14.6 +2.5 -13.5
MLP 18 27 +21.0 -1.8 +23.9 +0.3 | 15 22 +18.2 -1.5 +20.7 +3.4
LSTM 15 1 +123 +93.5 +349 +394 | 12 1 +109 +89.5 +403 +462

  1. Geoffroy Peeters, Bruno Giordano, Patrick Susini, and Nicolas Misdariis, “The timbre toolbox: Extracting audio descriptors from musical signals,” in The Journal of the Acoustical Society of America, 2011, vol. 130.  2