A Dataset of String Ensemble Recordings and Onset Annotations for Timing Analysis
Maciek Tomczak, Susan Min Li, and Massimiliano Di Luca
Extended Abstracts for the Late-Breaking Demo Session of the International Society of Music Information Retrieval Conference (ISMIR), Milan, Italy, 2023
In this paper, we present Virtuoso Strings, a dataset for timing analysis and automatic music transcription (AMT) tasks requiring note onset annotations. This dataset takes advantage of real-world recordings in multitrack format and is curated as a component of the Augmented Reality Music Ensemble (ARME) project, which investigates musician synchronisation and multimodal music analysis. The dataset comprises repeated recordings of quartet, trio, duet and solo ensemble performances. Each performance showcases varying temporal expressions and leadership role assignments, providing new possibilities for developing and evaluating AMT models across diverse musical styles. To reduce the cost of the labour-intensive manual annotation, a semi-automatic method was utilised for both annotation and quality control. The dataset features 746 tracks, totalling 68,728 onsets. Each track includes onset annotations for a single string instrument. This design facilitates the generation of audio files with varied instrument combinations for use in the AMT evaluation process.
AudioMostly
Onset Detection for String Instruments Using Bidirectional Temporal and Convolutional Recurrent Networks
Maciek Tomczak, and Jason Hockman
Proceedings of the Audio Mostly Conference, Edinburgh, United Kingdom, ACM, New York, NY, USA, 2023
Recent work in note onset detection has centered on deep learning models such as recurrent neural networks (RNN), convolutional neural networks (CNN) and more recently temporal convolutional networks (TCN), which achieve high evaluation accuracies for onsets characterized by clear, well-defined transients, as found in percussive instruments. However, onsets with less transient presence, as found in string instrument recordings, still pose a relatively difficult challenge for state-of-the-art algorithms. This challenge is further exacerbated by a paucity of string instrument data containing expert annotations. In this paper, we propose two new models for onset detection using bidirectional temporal and recurrent convolutional networks, which generalise to polyphonic signals and string instruments. We perform evaluations of the proposed methods alongside state-of-the-art algorithms for onset detection on a benchmark dataset from the MIR community, as well as on a test set from a newly proposed dataset of string instrument recordings with note onset annotations, comprising approximately 40 minutes and over 8,000 annotated onsets with varied expressive playing styles. The results demonstrate the effectiveness of both presented models, as they outperform the state-of-the-art algorithms on string recordings while maintaining comparative performance on other types of music.
RPPW
Adaptive metronome: a MIDI plug-in for modelling cooperative timing in music ensembles
Sean Enderby, Ryan Stables, Jason Hockman, Maciek Tomczak, Alan Wing, Mark Elliot, and Massimiliano Di Luca
Rhythm Production and Perception Workshop (RPPW), Birmingham, UK, 2023
Creative rhythmic transformations of musical audio refer to automated methods for manipulation of temporally-relevant sounds in time. This paper presents a method for joint synthesis and rhythm transformation of drum sounds through the use of adversarial autoencoders (AAE). Users may navigate both the timbre and rhythm of drum patterns in audio recordings through expressive control over a low-dimensional latent space. The model is based on an AAE with Gaussian mixture latent distributions that introduce rhythmic pattern conditioning to represent a wide variety of drum performances. The AAE is trained on a dataset of bar-length segments of percussion recordings, along with their clustered rhythmic pattern labels. The decoder is conditioned during adversarial training for mixing of data-driven rhythmic and timbral properties. The system is trained with over 500000 bars from 5418 tracks in popular datasets covering various musical genres. In an evaluation using real percussion recordings, the reconstruction accuracy and latent space interpolation between drum performances are investigated for audio generation conditioned by target rhythmic patterns.
DAFx
Adversarial Synthesis of Drum Sounds
Jake Drysdale, Maciek Tomczak, and Jason Hockman
Proceedings of the International Conference on Digital Audio Effects (DAFx), Vienna, Austria, 2020
Many recent approaches to creative transformations of musical audio have been motivated by the success of raw audio generation models such as WaveNet, in which audio samples are modeled by generative neural networks. This paper describes a generative audio synthesis model for multi-drum translation based on a WaveNet denosing autoencoder architecture. The timbre of an arbitrary source audio input is transformed to sound as if it were played by various percussive instruments while preserving its rhythmic structure. Two evaluations of the transformations are conducted based on the capacity of the model to preserve the rhythmic patterns of the input and the audio quality as it relates to timbre of the target drum domain. The first evaluation measures the rhythmic similarities between the source audio and the corresponding drum translations, and the second provides a numerical analysis of the quality of the synthesised audio. Additionally, a semi- and fully-automatic audio effect has been proposed, in which the user may assist the system by manually labelling source audio segments or use a state-of-the-art automatic drum transcription system prior to drum translation.
2018
DAFx
Audio Style Transfer with Rhythmic Constraints
Maciek Tomczak, Carl Southall, and Jason Hockman
Proceedings of the International Conference on Digital Audio Effects (DAFx), Aveiro, Portugal, 2018
This project describes an approach of semantic recognition by using the Mel Frequency Cepstral Coefficients (MFCCs) extracted from equalised signal of electric guitar recordings. Feature scaling is employed, prior to training and testing semantically processed samples via k Nearest Neighbour (kNN) and Support Vector Machine (SVM). Based on the created dataset of total 400 semantic trials collected from 20 experiment participants, it was possible to successfully train the kNN and SVM classifiers to distinguish between warm and bright extracted features. Results presented in this study show that a k = 5 NN model classifies the warm and bright descriptors most accurately, achieving 0.04% error on the test set.