Efficiently retrieving specific instrument timbres from audio mixtures remains a challenge in digital music production. This paper introduces a contrastive learning framework for musical instrument retrieval,
enabling direct querying of instrument databases using a single model for both single- and multi-instrument sounds.
We propose techniques to generate realistic positive/negative pairs of sounds for virtual musical instruments, such as samplers and synthesizers, addressing limitations in common audio data augmentation methods.
The first experiment focuses on instrument retrieval from a dataset of 3,884 instruments, using single-instrument audio as input. Contrastive approaches are competitive with previous works based on classification pre-training.
The second experiment considers multi-instrument retrieval with a mixture of instruments as audio input.
In this case, the proposed contrastive framework outperforms related works, achieving 84.2% top-1 and 96.4% top-5 accuracies for three-instrument mixtures.
The following tables are extensions of the results presented in Table 2 the paper.
The dataset for 3-component mixtures is reduced to 1463 instruments:
615 basses, 480 synth leads and 368 percussions.