Authors: Gwendal Le Vaillant and Thierry Dutoit (ISIA Lab, University of Mons)

This page contains supplemental material for the paper “Synthesizer Preset Interpolation using Transformer Auto-Encoders” accepted to IEEE ICASSP 2023 (preprint: https://arxiv.org/abs/2210.16984).

For the best audio listening experience, please use Chrome (preferred) or Safari.

Contents:

Interpolation between presets
Extrapolation
Audio features and interpolation performance results

Source code is available on Github

Interpolation between presets

Interpolation example 1

	Start preset "WindEns2Ed"								End preset "HARD ROADS"
	Step 1/9	Step 2/9	Step 3/9	Step 4/9	Step 5/9	Step 6/9	Step 7/9	Step 8/9	Step 9/9
SPINVAE
Reference (naive linear)

Interpolation example 2

	Start preset "BRASS 15"								End preset "Kharma'HM"
	Step 1/9	Step 2/9	Step 3/9	Step 4/9	Step 5/9	Step 6/9	Step 7/9	Step 8/9	Step 9/9
SPINVAE
Reference (naive linear)

Interpolation example 3

	Start preset "VIBESYN 2"								End preset "SYNTHEKLA4"
	Step 1/9	Step 2/9	Step 3/9	Step 4/9	Step 5/9	Step 6/9	Step 7/9	Step 8/9	Step 9/9
SPINVAE
Reference (naive linear)

Interpolation example 4

	Start preset "Synbass 4"								End preset "K.CLAV. 3"
	Step 1/9	Step 2/9	Step 3/9	Step 4/9	Step 5/9	Step 6/9	Step 7/9	Step 8/9	Step 9/9
SPINVAE
Reference (naive linear)

Extrapolation

SPINVAE extrapolation example 1

<--- Extrapolation		Preset					Preset	Extrapolation --->
		"PnoCk Ep9"	<---------- Interpolation ---------->				"SYNTH 7"

SPINVAE extrapolation example 2

<--- Extrapolation		Preset					Preset	Extrapolation --->
		"BOUM"	<---------- Interpolation ---------->				"fuzzerro"

Audio features and interpolation performance results

Timbre audio features

As indicated in the paper, the Timbre Toolbox ¹ is used to compute audio features for each rendered audio file. These features have been engineered by experts and span a wide range of timbre characteristics. They can be categorized into three main groups as shown in the table below.

	Temporal		Spectral		Harmonic
Att Dec Rel LAT AttSlope DecSlope TempCent EffDur FreqMod AmpMod RMSEnv	Attack Decay Release Log-Attack Time Attack Slope Decrease Slope Temporal Centroid Effective Duration Frequency of Energy Mod. Amplitude of Energy Mod. RMS of Energy Envelope	SpecCent SpecSpread SpecSkew SpecKurt SpecSlope SpecDecr SpecRollOff SpecVar FrameErg SpecFlat SpecCrest	Spectral Centroid Spectral Spread Spectral Skewness Spectral Kurtosis Spectral Slope Spectral Decrease Spectral Rolloff Spectro-temporal variation Frame Energy Spectral Flatness Spectral Crest	HarmErg NoiseErg F0 InHarm HarmDev OddEvenRatio	Harmonic Energy Noise Energy Fundamental Frequency Inharmonicity Harmonic Spectral Deviation Odd to even harmonic ratio

Some features are computed for each time frame, and their median value and Inter-Quartile Range (IQR) only are used¹. This results in 46 features available to evaluate the interpolation.

Figure A below shows the correlation between all timbre features, and we observe that most of them are not or weakly correlated. Nonetheless, some pairs of features tend to show similar variations (e.g. RMSEnv_med and EffDur, SpecCent_med and SpecSpread_med) and present moderate to high levels of correlation. However, such pairs of features often describe timbre characteristics that are easily distinguishable to the human ear. E.g., SpecCent_med and SpecSpread_med are highly correlated (0.72) but are discernibly distinct from each other: a high-pitched sound would have a low Spectral Spread but a high Spectral Centroid. Thus, all features which present a moderate correlation should be used to analyze results.

Figure A. Absolute value of the correlation between timbre features.	Figure B. Features with a very high (> 0.9) correlation.

However, we observe from Figure B that a few pairs of features present a very high correlation and might be redundant. In order to keep only non-highly-correlated features, eight features could be removed from the analysis: SpecRollOff_med, SpecKurt_med, SpecKurt_IQR, HarmErg_IQR, NoiseErg_med, NoiseErg_IQR, OddEvenRatio_IQR and Rel. This is discussed in the next sub-section.

Detailed interpolation results

Interpolation quality is evaluated using the smoothness and non-linearity of audio features along an interpolation sequence. More details about these interpolation metrics are available in the paper.

Results from Table 1 from the paper are reproduced on the left-hand side of the table below. They use the full set of 46 audio features, and include the number of improved features and average performance variation.

Additional results are presented on the right-hand side of the table below. They show the median performance variation for 46 features (not presented in the paper due to space constraints). They also include complete results (number of improved features, average and median performance variaton) obtained using the reduced set of 46-8 = 38 features.

SPINVAE remains the best overall model, with no significant difference in results between using the full or reduced set of audio features.

	Results included in the paper				Additional results
Model	46 features						\|	38 features (highly correlated excluded)
	Num. improved features (out of 46)		Average variation (%)		Median variation (%)		\|	Num. improved features (out of 38)		Average variation (%)		Median variation (%)
	Smooth.	Nonlin.	Smooth.	Nonlin.	Smooth.	Nonlin.	\|	Smooth.	Nonlin.	Smooth.	Nonlin.	Smooth.	Nonlin.
SPINVAE (DLM 3)	35	38	-12.6	-12.3	-13.1	-15.5	\|	29	30	-12.1	-11.3	-12.0	-14.2
Preset-only	25	30	-4.6	-6.4	-4.6	-9.0	\|	20	22	-3.9	-5.8	-4.1	-7.5
Sound matching	8	7	+66.8	+29.7	+156	+109	\|	6	5	+62.4	+28.8	+172	+126
DLM 2	31	37	-8.2	-10.4	-6.7	-11.0	\|	26	29	-7.4	-9.3	-5.6	-9.5
DLM 4	30	40	-9.2	-14.5	-7.5	-16.1	\|	25	32	-9.2	-13.7	-7.0	-15.0
Softmax	23	40	-1.2	-15.6	+0.9	-15.9	\|	18	32	-1.4	-14.6	+2.5	-13.5
MLP	18	27	+21.0	-1.8	+23.9	+0.3	\|	15	22	+18.2	-1.5	+20.7	+3.4
LSTM	15	1	+123	+93.5	+349	+394	\|	12	1	+109	+89.5	+403	+462

Geoffroy Peeters, Bruno Giordano, Patrick Susini, and Nicolas Misdariis, “The timbre toolbox: Extracting audio descriptors from musical signals,” in The Journal of the Acoustical Society of America, 2011, vol. 130. ↩ ↩²