last update: 24.6.2009
Contents
The availability of common datasets is very import in the progress of the music information retrieval (MIR) community. Whereas standard benchmark tasks are widely used in other similar research areas (e.g. speech or handwriting recognition), it is difficult to freely distribute music data due to very restrictive copyright laws. However, different groups try to overcome these problems by using music with a free license (e.g. Creative Commons) or by just distributing feature vectors and not the audio data.
This is an attempt to list already available datasets. Similar resources for MIR tools, papers and conferences can be found at the music-ir.org web page. Furthermore there exist an annual Music Information Retrieval Evaluation eXchange Contest during the ISMIR Conference, called MIREX, where groups can evaluate and compare the performance of their algorithms.
Note
Please contact me (mail) if you know or have any other free dataset, or if you have other comments !
ISMIR2004 Audio Description Contest Dataset
Link: | |
---|---|
Description: | Datasets of the Audio Description Contest of the ISMIR 2004 Conference:
|
Format: | WAV or MP3 |
Availability: | online |
RWC Music Database
Link: | |
---|---|
Description: | Large-scale corpus which contains six original collections: the Popular Music Database (100 songs), Royalty-Free Music Database (15 songs), Classical Music Database (50 pieces), Jazz Music Database (50 pieces), Music Genre Database (100 pieces), and Musical Instrument Sound Database (50 instruments). All musical pieces are presented in WAV, the corresponding MIDI and text files of lyrics (for songs). Additional annotations to the RWC database (beat, melody, ...) are available at http://staff.aist.go.jp/m.goto/RWC-MDB/AIST-Annotation/ |
Format: | WAV, MIDI, text files |
Availability: | distributed on many CDs, contract necessary |
Holzapfel's Greek Music Dataset
Link: | |
---|---|
Description: | Three available datasets with greek music:
|
Format: | MP3 |
Availability: | by email |
Magnatagatune Dataset
Link: | |
---|---|
Description: | More than 25000 29s long music clips (creative commons), each of them annotated with a combination of 188 tags (collected through Edith's "TagATune" game, http://www.gwap.com/gwap/gamesPreview/tagatune/). Additionally an "analysis" XML file containing timbre, rhythm and harmonic-content related features is included. |
Format: | MP3 |
Availability: | online |
Uni Dortmund Music Audio Benchmark Dataset
Link: | |
---|---|
Description: | 1886 songs, which have been downloaded from www.garageband.com, from different genres and with tags. There is also an example set available, with already extracted features. |
Format: | MP3, 44100 Hz, 128 kbps |
Availability: | online |
Latin Music Database
Link: | http://www.ppgia.pucpr.br/~silla/lmd/ |
---|---|
Description: | 3.160 music pieces, classified in 10 different musical genres: Tango, Bolero, Batchata, Salsa, Merengue, Axé, Forró, Sertaneja, Gaúcha and Pagode |
Format: | Feature Vectors, extracted with Marsyas |
Availability: | online |
USPOP2002 Pop Music Data Set
Link: | http://labrosa.ee.columbia.edu/projects/musicsim/uspop2002.html |
---|---|
Description: | Large Corpus with MFCC features from 706 albums and 8764 tracks (400 artists), with additional style tags |
Format: | MFCC Feature Vectors |
Availability: | 3 DVDs, by mail |
Artist20 Dataset
Link: | http://labrosa.ee.columbia.edu/projects/artistid/ |
---|---|
Description: | Artist20 is a database of six albums by each of 20 artists, making a total of 1413 tracks. There is a defined training set (three albums per artist), validation set (1 album), and test set (2 albums) with a Matlab baseline classifier. |
Format: | low quality MP3 (32 kbps, mono) and Feature Vectors (MFCCs and beat-chroma matrices) |
Availability: | by email |
CAL-500 Dataset
Link: | http://cosmal.ucsd.edu/cal/projects/AnnRet/AnnRet.php |
---|---|
Description: | Set of human tags for 500 popular songs, tagged by at least 3 humans. The songs are in low-quality MP3 format, with additional feature vectors. |
Format: | MP3 (mono, 32 kbps) and Feature Vectors (MFCC, delta MFCC, delta delta MFCC, dynamic MFCC, chroma and auditory filterbank temporal envelope) |
Availability: | by email |
Multi-Label Classification Dataset
Link: | http://mlkd.csd.auth.gr/multilabel.html |
---|---|
Description: | Multi-lable dataset from Thessaloniki University with several mood tags of 593 songs. (Look for "Files and Sources / emotions") |
Format: | Feature Vectors (MFCC, controid, rolloff, flux) in Weka ARFF format |
Availability: | online |
University of Iowa Musical Instruments Samples
Link: | http://theremin.music.uiowa.edu/MIS.html |
---|---|
Description: | Chromatic scales of many instruments, recorded at three non-normalized dynamic levels |
Format: | 16 bit, 44.1 kHz, AIFF |
Availability: | online |
OLPC Samples Collection
Link: | |
---|---|
Description: | Free sound samples from the one laptop per child project, including 8458 uncompressed samples (over 8GB) of
|
Format: | WAV |
Availability: | online |
McGill University Master Samples
Link: | http://www.music.mcgill.ca/resources/mums/html/ |
---|---|
Description: | Library of musical instrument sound samples |
Format: | WAV |
Availability: | 3 DVDs, by mail |
ENST-Drums
Link: | http://perso.telecom-paristech.fr/~gillet/ENST-drums/ |
---|---|
Description: | A large and varied research audio-visual database for automatic drum transcription and processing |
Format: | 8 individual audio channels and filmed from two angles |
Availability: | by mail, contract necessary |
Freesound
Link: | http://www.freesound.org/ |
---|---|
Description: | Big collaborative database of Creative Commons licensed sounds. Freesound focusses only on sound, not songs. |
Format: | many |
Availability: | online |
UrbanSync Dataset
Link: | |
---|---|
Description: | This creative commons licensed data consists of 4 different data streams, recorded in different cities:
|
Format: | MP3, GPS, CSV |
Availability: | online |
QBSH: A Corpus for Designing QBSH (Query by Singing/Humming) Systems
Link: | http://neural.cs.nthu.edu.tw/jang2/dataSet/childSong4public/QBSH-corpus/ |
---|---|
Description: | The Query By Singing/Humming Corpus of the MIR Lab at CS Dept. of NTHU, Taiwan, including 48 MIDI files of the songs in the database and 2797 singing/humming clips from about 118 persons |
Format: | WAV and MIDI |
Availability: | online |
Cover80 cover song dataset
Link: | http://labrosa.ee.columbia.edu/projects/coversongs/covers80/ |
---|---|
Description: | A collection of 80 songs, each performed by 2 artists, for automatic detection of "cover songs" (i.e. alternative performances of the same basic musical piece by different artists, typically with large stylistic and/or harmonic changes) |
Format: | low quality MP3 (32 kbps, mono) and Feature Vectors |
Availability: | online |
Graham's Melody Extraction Dataset
Link: | http://www.ee.columbia.edu/~graham/mirex_melody/ and http://labrosa.ee.columbia.edu/projects/melody/ |
---|---|
Description: | Audio files with corresponding pitch data for transcribing real music recordings into scores |
Format: | WAV |
Availability: | online |
MIREX06 Audio Tempo Extraction and Beat Tracking Datasets
Link: | http://www.music-ir.org/mirex/2006/index.php/Audio_Tempo_Extraction#Practice_Data |
---|---|
Description: | Practice data for the MIREX06 tempo extraction and beat tracking contest (20 examples) |
Format: | WAV |
Availability: | online |
Lastfm-ArtistTags2007 Dataset
Link: | http://blogs.sun.com/plamere/entry/open_research_the_data_lastfm |
---|---|
Description: | The data consists of the raw tag counts for the 100 most frequently occuring tags that Last.fm listeners have applied to over 20,000 artists. |
Format: | text files |
Availability: | online |
ISMIR08 Co-Occurrences of Artists in Playlists Dataset
Link: | http://labs.strands.com/music/affinity/ |
---|---|
Description: | 1,030,068 user-compiled playlists with artists and tags, from http://www.mystrands.com |
Format: | text files |
Availability: | online |
Song Segmentations
Link: | http://www.elec.qmul.ac.uk/digitalmusic/downloads/index.html#segment |
---|---|
Description: | Annotations of musical form (verses, choruses etc.) for 60 songs by The Beatles, Britney Spears, Michael Jackson, Madonna, Prince, The Clash, etc. |
Format: | text files |
Availability: | online |
Genre-based chord transition data
Link: | http://research.microsoft.com/en-us/um/people/dan/chords/ |
---|---|
Description: | Data with transition probabilities between different chords (such as Dm->G) computed from a database of popular music. Data is divided into four different human-labeled categories: Pop, Rock, Country and Beatles. |
Format: | text files |
Availability: | online |
MusicBrainz
Link: | http://musicbrainz.org/ |
---|---|
Description: | A community music metadatabase that attempts to create a comprehensive music information site. It provides tags and information about CDs, artists or albums. |
Format: | text files |
Availability: | online |
DBTune
Link: | http://dbtune.org/ |
---|---|
Description: | As part of the Linking Open Data on the Semantic Web community project, DBTune hosts a number of servers, providing access to music-related structured data, in a Linked Data fashion. It now provides access to more than 14 billion RDF triples, with data from Jamendo, Magnatune, AudioScrobbler, MySpace, Musicbrainz, BBC playcount data, Echonest and more. |
Format: | RDF |
Availability: | online |
Musipedia
Link: | |
---|---|
Description: | Musipedia, inspired by Wikipedia, is building a searchable, editable, and expandable collection of tunes, melodies, and musical themes. Every entry can be edited by anybody. An entry can contain a bit of sheet music, a MIDI file, textual information about the work and the composer, and last but not least the Parsons Code, a rough description of the melodic contour. 30,000 symbolically encoded melodies (Lilypond, MIDI) and 100,000 MIDI files, searchable with a SOAP interface or with a web interface. |
Format: | Lilypond, MIDI |
Availability: | online |
Wikifonia
Link: | http://www.wikifonia.org |
---|---|
Description: | Wikifonia is a free MusicXML dataset of lead sheets, containing melodies, optionally chord names and lyrics. It has been used quite a lot in research, i.e. http://www.spectrum.ieee.org/jul08/6442 |
Format: | MusicXML (xml, mxl) |
Availability: | online |
MuseData
Link: | http://www.musedata.org/ |
---|---|
Description: | An electronic library of classical music scores (european music from roughly 1700-1825) |
Format: | MuseData, Humdrun and MIDI |
Availability: | online |
Kern Scores
Link: | http://kern.ccarh.org/ |
---|---|
Description: | A library of virtual musical scores in the Humdrum data format (107,087 files), mainly european classical music |
Format: | Humdrum |
Availability: | online |
Themefinder
Link: | http://www.themefinder.org/ |
---|---|
Description: | Provides a web-based interface to the Humdrum thema command, which in turn allows searching of databases containing musical themes or incipits. Currently there are three databases: Classical Instrumental Music, European Folksongs, and Latin Motets from the sixteenth century. |
Format: | Humdrum |
Availability: | online |
Beatles Chord Transcriptions
Link: | http://mir-research.blogspot.com/2007/09/beatles-chord-transcriptions.html |
---|---|
Description: | Transcription of the chords for all songs on the 12 studio albums of the Beatles |
Format: | text files |
Availability: | by email |
zweiecktranscriptions
Link: | http://code.google.com/p/zweiecktranscriptions/ |
---|---|
Description: | Chord transcriptions of the album "Zwielicht" by the band Zweieck, transcribed using Sonic Visualiser. More bands should follow. |
Format: | text files and mp3 |
Availability: | online |
Jamendo
Link: | http://www.jamendo.com/ |
---|---|
Description: | Creative Commons licensed music |
Garage Band
Link: | http://www.garageband.com/ |
---|---|
Description: | Public domain recordings |
Magnatune Creative Commons Music
Link: | http://magnatune.com or http://magnatune.com/info/press/coverage/ccblog |
---|---|
Description: | Magnatune releases every song on the label under a Creative Commons license, which people can buy if they like |
Epitonic
Link: | http://epitonic.com/ |
---|---|
Description: | High quality free and legal mp3 music |
Networked Environment for Music Analysis
Link: | |
---|---|
Description: | A webservices system for submitting code and running it against virtual collections: "The NEMA team aims to create an open and extensible webservice-based resource framework that facilitates the integration of music data and analytic/evaluative tools that can be used by the global MIR and CM research and education communities on a basis independent of time or location." (full use in 2010) |
British Library Sound Archive
Link: | http://sounds.bl.uk/ |
---|---|
Description: | 32000 free and rare archival recordings of music, spoken word, and human and natural environments. Unfortunately only downloadable for UK universities, some are streamable from anywhere. |
Format: | unknown |
Availability: | online, but only for UK universities |
Echo Nest
Link: | http://.echonest.com and http://developer.echonest.com |
---|---|
Description: | Echo Nest is providing APIs for building your own MIR datasets. It extracts tempo, beats, time signature, song sections, timbre, key, and other musical attributes from an uploaded song, can generate similar recommendations and returns feeds. |
Format: | unknown |
Availability: | online |
This list was developed after a discussion on the MIR mailing list and MIREX mailing list. Thanks for the feedback from Kris West, Paul Lamere, Jay LeBoeuf, Luke Barrington, Dan Ellis, Andre Holzapfel, Richard Ranft, Claudio Baccigalupo, Eleanor Selfridge-Field, Fabien Gouyon, Yves Raimond, Thomas Bonte, Tristan Jehan, Rainer Typke, Juan Jose Burred, Matthias Mauch, Olivier Gillet and Edith Lok Man Law.