Music Information Retrieval Datasets

Georg Holzmann
grh _at_ mur _dot_ at

last update: 24.6.2009

Introduction

The availability of common datasets is very import in the progress of the music information retrieval (MIR) community. Whereas standard benchmark tasks are widely used in other similar research areas (e.g. speech or handwriting recognition), it is difficult to freely distribute music data due to very restrictive copyright laws. However, different groups try to overcome these problems by using music with a free license (e.g. Creative Commons) or by just distributing feature vectors and not the audio data.

This is an attempt to list already available datasets. Similar resources for MIR tools, papers and conferences can be found at the music-ir.org web page. Furthermore there exist an annual Music Information Retrieval Evaluation eXchange Contest during the ISMIR Conference, called MIREX, where groups can evaluate and compare the performance of their algorithms.

Note

Please contact me (mail) if you know or have any other free dataset, or if you have other comments !

Audio Datasets

Manifold Collections

  • ISMIR2004 Audio Description Contest Dataset

    Link:

    http://ismir2004.ismir.net/ISMIR_Contest.html

    Description:

    Datasets of the Audio Description Contest of the ISMIR 2004 Conference:

    • Genre Classification/Artist Identification
    • Melody Extraction
    • Tempo Induction
    • Rhythm Classification
    Format:

    WAV or MP3

    Availability:

    online

  • RWC Music Database

    Link:

    http://staff.aist.go.jp/m.goto/RWC-MDB/

    Description:

    Large-scale corpus which contains six original collections: the Popular Music Database (100 songs), Royalty-Free Music Database (15 songs), Classical Music Database (50 pieces), Jazz Music Database (50 pieces), Music Genre Database (100 pieces), and Musical Instrument Sound Database (50 instruments). All musical pieces are presented in WAV, the corresponding MIDI and text files of lyrics (for songs).

    Additional annotations to the RWC database (beat, melody, ...) are available at http://staff.aist.go.jp/m.goto/RWC-MDB/AIST-Annotation/

    Format:

    WAV, MIDI, text files

    Availability:

    distributed on many CDs, contract necessary

  • Holzapfel's Greek Music Dataset

    Link:

    http://www.csd.uoc.gr/~hannover/Datasets.html

    Description:

    Three available datasets with greek music:

    • Rembetiko dataset: 21 singers, 80 files, with labels at which points there is singing voice or not
    • Traditional cretan dances: for dance music classification, 6 classes, 30 files each class
    • Beat tracking dataset: 20 samples of 30 seconds length of traditional cretan music, with beat annotations
    Format:

    MP3

    Availability:

    by email

Audio from various Genres/Artists with Tags

  • Magnatagatune Dataset

    Link:

    http://tagatune.org/Datasets.html

    Description:

    More than 25000 29s long music clips (creative commons), each of them annotated with a combination of 188 tags (collected through Edith's "TagATune" game, http://www.gwap.com/gwap/gamesPreview/tagatune/).

    Additionally an "analysis" XML file containing timbre, rhythm and harmonic-content related features is included.

    Format:

    MP3

    Availability:

    online

  • Uni Dortmund Music Audio Benchmark Dataset

    Link:

    http://www-ai.cs.uni-dortmund.de/audio.html

    Description:

    1886 songs, which have been downloaded from www.garageband.com, from different genres and with tags.

    There is also an example set available, with already extracted features.

    Format:

    MP3, 44100 Hz, 128 kbps

    Availability:

    online

  • Latin Music Database

    Link:http://www.ppgia.pucpr.br/~silla/lmd/
    Description:3.160 music pieces, classified in 10 different musical genres: Tango, Bolero, Batchata, Salsa, Merengue, Axé, Forró, Sertaneja, Gaúcha and Pagode
    Format:Feature Vectors, extracted with Marsyas
    Availability:online
  • USPOP2002 Pop Music Data Set

    Link:http://labrosa.ee.columbia.edu/projects/musicsim/uspop2002.html
    Description:Large Corpus with MFCC features from 706 albums and 8764 tracks (400 artists), with additional style tags
    Format:MFCC Feature Vectors
    Availability:3 DVDs, by mail
  • Artist20 Dataset

    Link:http://labrosa.ee.columbia.edu/projects/artistid/
    Description:Artist20 is a database of six albums by each of 20 artists, making a total of 1413 tracks. There is a defined training set (three albums per artist), validation set (1 album), and test set (2 albums) with a Matlab baseline classifier.
    Format:low quality MP3 (32 kbps, mono) and Feature Vectors (MFCCs and beat-chroma matrices)
    Availability:by email
  • CAL-500 Dataset

    Link:http://cosmal.ucsd.edu/cal/projects/AnnRet/AnnRet.php
    Description:Set of human tags for 500 popular songs, tagged by at least 3 humans. The songs are in low-quality MP3 format, with additional feature vectors.
    Format:MP3 (mono, 32 kbps) and Feature Vectors (MFCC, delta MFCC, delta delta MFCC, dynamic MFCC, chroma and auditory filterbank temporal envelope)
    Availability:by email
  • Multi-Label Classification Dataset

    Link:http://mlkd.csd.auth.gr/multilabel.html
    Description:Multi-lable dataset from Thessaloniki University with several mood tags of 593 songs. (Look for "Files and Sources / emotions")
    Format:Feature Vectors (MFCC, controid, rolloff, flux) in Weka ARFF format
    Availability:online

Instrument and Sound Samples

  • University of Iowa Musical Instruments Samples

    Link:http://theremin.music.uiowa.edu/MIS.html
    Description:Chromatic scales of many instruments, recorded at three non-normalized dynamic levels
    Format:16 bit, 44.1 kHz, AIFF
    Availability:online
  • OLPC Samples Collection

    Link:

    http://wiki.laptop.org/go/Sound_samples

    Description:

    Free sound samples from the one laptop per child project, including 8458 uncompressed samples (over 8GB) of

    • Instruments
    • Voice
    • Noise and Everyday Sounds
    • Synthesizers
    Format:

    WAV

    Availability:

    online

  • McGill University Master Samples

    Link:http://www.music.mcgill.ca/resources/mums/html/
    Description:Library of musical instrument sound samples
    Format:WAV
    Availability:3 DVDs, by mail
  • ENST-Drums

    Link:http://perso.telecom-paristech.fr/~gillet/ENST-drums/
    Description:A large and varied research audio-visual database for automatic drum transcription and processing
    Format:8 individual audio channels and filmed from two angles
    Availability:by mail, contract necessary
  • Freesound

    Link:http://www.freesound.org/
    Description:Big collaborative database of Creative Commons licensed sounds. Freesound focusses only on sound, not songs.
    Format:many
    Availability:online
  • UrbanSync Dataset

    Link:

    http://urbansync.wordpress.com/download/

    Description:

    This creative commons licensed data consists of 4 different data streams, recorded in different cities:

    • GPS
    • Urban Sound as digital audio (mp3)
    • Physiological data and context (CSV)
    • Sonified Esmog data as digital audio (mp3)
    Format:

    MP3, GPS, CSV

    Availability:

    online

Rest

Symbolic Datasets

Tags and Metadata for Audio

  • Lastfm-ArtistTags2007 Dataset

    Link:http://blogs.sun.com/plamere/entry/open_research_the_data_lastfm
    Description:The data consists of the raw tag counts for the 100 most frequently occuring tags that Last.fm listeners have applied to over 20,000 artists.
    Format:text files
    Availability:online
  • ISMIR08 Co-Occurrences of Artists in Playlists Dataset

    Link:http://labs.strands.com/music/affinity/
    Description:1,030,068 user-compiled playlists with artists and tags, from http://www.mystrands.com
    Format:text files
    Availability:online
  • Song Segmentations

    Link:http://www.elec.qmul.ac.uk/digitalmusic/downloads/index.html#segment
    Description:Annotations of musical form (verses, choruses etc.) for 60 songs by The Beatles, Britney Spears, Michael Jackson, Madonna, Prince, The Clash, etc.
    Format:text files
    Availability:online
  • Genre-based chord transition data

    Link:http://research.microsoft.com/en-us/um/people/dan/chords/
    Description:Data with transition probabilities between different chords (such as Dm->G) computed from a database of popular music. Data is divided into four different human-labeled categories: Pop, Rock, Country and Beatles.
    Format:text files
    Availability:online
  • MusicBrainz

    Link:http://musicbrainz.org/
    Description:A community music metadatabase that attempts to create a comprehensive music information site. It provides tags and information about CDs, artists or albums.
    Format:text files
    Availability:online
  • DBTune

    Link:http://dbtune.org/
    Description:As part of the Linking Open Data on the Semantic Web community project, DBTune hosts a number of servers, providing access to music-related structured data, in a Linked Data fashion. It now provides access to more than 14 billion RDF triples, with data from Jamendo, Magnatune, AudioScrobbler, MySpace, Musicbrainz, BBC playcount data, Echonest and more.
    Format:RDF
    Availability:online

Music Scores, Themes and Chords

  • Musipedia

    Link:

    http://www.musipedia.org/

    Description:

    Musipedia, inspired by Wikipedia, is building a searchable, editable, and expandable collection of tunes, melodies, and musical themes. Every entry can be edited by anybody. An entry can contain a bit of sheet music, a MIDI file, textual information about the work and the composer, and last but not least the Parsons Code, a rough description of the melodic contour.

    30,000 symbolically encoded melodies (Lilypond, MIDI) and 100,000 MIDI files, searchable with a SOAP interface or with a web interface.

    Format:

    Lilypond, MIDI

    Availability:

    online

  • Wikifonia

    Link:http://www.wikifonia.org
    Description:Wikifonia is a free MusicXML dataset of lead sheets, containing melodies, optionally chord names and lyrics. It has been used quite a lot in research, i.e. http://www.spectrum.ieee.org/jul08/6442
    Format:MusicXML (xml, mxl)
    Availability:online
  • MuseData

    Link:http://www.musedata.org/
    Description:An electronic library of classical music scores (european music from roughly 1700-1825)
    Format:MuseData, Humdrun and MIDI
    Availability:online
  • Kern Scores

    Link:http://kern.ccarh.org/
    Description:A library of virtual musical scores in the Humdrum data format (107,087 files), mainly european classical music
    Format:Humdrum
    Availability:online
  • Themefinder

    Link:http://www.themefinder.org/
    Description:Provides a web-based interface to the Humdrum thema command, which in turn allows searching of databases containing musical themes or incipits. Currently there are three databases: Classical Instrumental Music, European Folksongs, and Latin Motets from the sixteenth century.
    Format:Humdrum
    Availability:online
  • Beatles Chord Transcriptions

    Link:http://mir-research.blogspot.com/2007/09/beatles-chord-transcriptions.html
    Description:Transcription of the chords for all songs on the 12 studio albums of the Beatles
    Format:text files
    Availability:by email
  • zweiecktranscriptions

    Link:http://code.google.com/p/zweiecktranscriptions/
    Description:Chord transcriptions of the album "Zwielicht" by the band Zweieck, transcribed using Sonic Visualiser. More bands should follow.
    Format:text files and mp3
    Availability:online

Free Online Music Platforms

Miscellaneous

Credits

This list was developed after a discussion on the MIR mailing list and MIREX mailing list. Thanks for the feedback from Kris West, Paul Lamere, Jay LeBoeuf, Luke Barrington, Dan Ellis, Andre Holzapfel, Richard Ranft, Claudio Baccigalupo, Eleanor Selfridge-Field, Fabien Gouyon, Yves Raimond, Thomas Bonte, Tristan Jehan, Rainer Typke, Juan Jose Burred, Matthias Mauch, Olivier Gillet and Edith Lok Man Law.