Authors: Gwendal Le Vaillant and Thierry Dutoit (ISIA Lab, University of Mons)

Supplemental material for the article:

G. Le Vaillant and T. Dutoit, “Latent Space Interpolation of Synthesizer Parameters Using Timbre-Regularized Auto-Encoders,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), vol. 32, pp. 3379-3392, 2024, https://ieeexplore.ieee.org/document/10596701.

Contents:

Introduction - Live Demonstration
Interpolation Between Presets
Extrapolation
Presets Modulation

Introduction - Live Demonstration

The following video shows how the SPINVAE-2 model can perform morphings between DX7 presets.

Interpolation Between Presets

Preset interpolation is usually performed by computing a linear interpolation on each individual synthesis parameter. This method is called Reference (linear).

Our work introduces a preset morphing method based on the SPINVAE-2 model. First, the start and end presets, $\mathbf{u}^{(n)}$ and $\mathbf{u}^{(m)}$, are encoded into latent vectors $\mathbf{z}^{(n)}$ and $\mathbf{z}^{(m)}$. Second, a linear interpolation produces a series of intermediate latent codes $\lbrace \mathbf{z}[t], t \in [1, T] \rbrace$ vectors, where $\mathbf{z}[1] = \mathbf{z}^{(n)}$ and $\mathbf{z}[T] = \mathbf{z}^{(m)}$. Third, the preset morphing is finally obtained by decoding each $\mathbf{z}[t]$ into an intermediate preset.

Examples below use $T = 7$ interpolation steps for both methods.

Interpolation example 1

	Start preset "AnlgSyn.45"						End preset "ClinkieBel"
	Step 1/7	Step 2/7	Step 3/7	Step 4/7	Step 5/7	Step 6/7	Step 7/7
Reference (linear)
SPINVAE-2

Interpolation example 2

	Start preset "E.Piano 23"						End preset "B3 Organ 3"
	Step 1/7	Step 2/7	Step 3/7	Step 4/7	Step 5/7	Step 6/7	Step 7/7
Reference (linear)
SPINVAE-2

Interpolation example 3

	Start preset "WindEns2Ed"						End preset "HARD ROADS"
	Step 1/7	Step 2/7	Step 3/7	Step 4/7	Step 5/7	Step 6/7	Step 7/7
Reference (linear)
SPINVAE-2

Interpolation example 4

	Start preset "ABU ASHRAM"						End preset "STARRY 1"
	Step 1/7	Step 2/7	Step 3/7	Step 4/7	Step 5/7	Step 6/7	Step 7/7
Reference (linear)
SPINVAE-2

Interpolation example 5

	Start preset "Revers1"						End preset "SuperGrand"
	Step 1/7	Step 2/7	Step 3/7	Step 4/7	Step 5/7	Step 6/7	Step 7/7
Reference (linear)
SPINVAE-2

Other examples

Additional examples are available on a separate webpage.

SPINVAE-2 Extrapolations

The latent interpolation method can be used to perform extrapolations, i.e., some latent codes $ \mathbf{z}[t]$ can also be computed for $ t \leq 0 $ and $ t > T $. These extrapolated latent vectors can be decoded into extrapolated presets, whose timbre characteristics go beyond those of the two original presets. Examples below display interpolations made of $T = 7$ steps and $ 2 $ extrapolation steps on each side.

Extrapolation example 1

	Extrapolation		Preset						Preset	Extrapolation
			BOUM	Interpolation					fuzzerro
SPINVAE-2

Extrapolation example 2

	Extrapolation		Preset						Preset	Extrapolation
			INDIA 1	Interpolation					Tonewheel2
SPINVAE-2

Presets Modulation

Under the Variational Auto-Encoder (VAE) framework, a preset $\mathbf{u}$ can be encoded as a Gaussian distribution $q \left( \mathbf{z} \mid \mathbf{u} \right)$ in the latent space, where $\mathbf{z}$ denotes a latent vector. The distribution is defined as $q \left( \mathbf{z} \mid \mathbf{u} \right) = \mathcal{N} \left( \mathbf{z} ; \mu, \sigma^2 \right)$ where $\mu, \sigma$ are output vectors from the Transformer encoder.

New presets, similar to the original $\mathbf{u}$, can be obtained by sampling some latent vectors $\mathbf{z} \sim q \left( \mathbf{z} \mid \mathbf{u} \right)$ then decoding them into presets. This is a form of modulation, which can be used to make a preset slightly evolve over time and sound more dynamic.

In order to obtain more creative presets, standard deviations of the Gaussian distribution can be artificially increased. Examples presented below use standard deviations of $2 \sigma$ and $3 \sigma$, where $\sigma$ is computed from the original preset $\mathbf{u}$. I.e., latent vectors are sampled from $\mathcal{N} \left( \mathbf{z} ; \mu, \left(2\sigma\right)^2 \right)$ and $\mathcal{N} \left( \mathbf{z} ; \mu, \left(3\sigma\right)^2 \right)$, and are finally used to generate modulated presets.

Modulation example 1

Original preset "CP-70" (no modulation)

Modulation, 2σ standard deviation

Modulation, 3σ standard deviation

Modulation example 2

Original preset "CHEAPO" (no modulation)

Modulation, 2σ standard deviation

Modulation, 3σ standard deviation