CLaMP: Contrastive Language-Music Pre-training for Cross-Modal Symbolic Music Information Retrieval [ISMIR 2023, Best Student Paper Award]
This project is initiated and owned by the Central Conservatory of Music.
In CLaMP: Contrastive Language-Music Pre-training for Cross-Modal Symbolic Music Information Retrieval, we introduce a solution for cross-modal symbolic MIR that utilizes contrastive learning and pre-training. The proposed approach, CLaMP: Contrastive Language-Music Pre-training, which learns cross-modal representations between natural language and symbolic music using a music encoder and a text encoder trained jointly with a contrastive loss. To pre-train CLaMP, we collected a large dataset of 1.4 million music-text pairs. It employed text dropout as a data augmentation technique and bar patching to efficiently represent music data which reduces sequence length to less than 10%. In addition, we developed a masked music model pre-training objective to enhance the music encoder's comprehension of musical context and structure. CLaMP integrates textual information to enable semantic search and zero-shot classification for symbolic music, surpassing the capabilities of previous models. To support the evaluation of semantic search and music classification, we publicly release WikiMusicText (WikiMT), a dataset of 1010 lead sheets in ABC notation, each accompanied by a title, artist, genre, and description. In comparison to state-of-the-art models that require fine-tuning, zero-shot CLaMP demonstrated comparable or superior performance on score-oriented datasets.
The architecture of CLaMP, including two encoders - one for music and one for text - trained jointly with a contrastive loss to learn cross-modal representations.
Two variants of CLaMP are introduced: CLaMP-S/512 and CLaMP-S/1024. Both models consist of a 6-layer music encoder and a 6-layer text encoder with a hidden size of 768. While CLaMP-S/512 accepts input music sequences of up to 512 tokens in length, CLaMP-S/1024 allows for up to 1024 tokens. The maximum input length for the text encoder in both models is 128 tokens. These models are part of Muzic, a research initiative on AI music that leverages deep learning and artificial intelligence to enhance music comprehension and generation.
As part of our effort to make CLaMP more accessible to researchers and developers, we have created three Hugging Face spaces that showcase its abilities. The first space, CLaMP - Semantic Music Search, enables users to search for musical pieces using natural language queries, such as "a happy jazz song." The second space, CLaMP - Zero-Shot Music Classification, allows users to classify musical pieces based on their textual descriptions, without the need for any fine-tuning. Finally, the third space, CLaMP - Similar Music Recommendation, allows users to input a musical piece in MusicXML (.mxl) and receive recommendations for similar pieces based on their textual descriptions.
These spaces leverage the power of CLaMP's pre-trained models to provide users with state-of-the-art cross-modal symbolic music information retrieval capabilities. We hope that these spaces will inspire researchers and developers to explore the possibilities of CLaMP and contribute to the advancement of the field of AI music.
CLaMP is capable of aligning symbolic music and natural language, which can be used for various cross-modal retrieval tasks, including semantic search and zero-shot classification for symbolic music.
The processes of CLaMP performing cross-modal symbolic MIR tasks, including semantic search and zero-shot classification for symbolic music, without requiring task-specific training data.
Semantic search is a technique for retrieving music by open-domain queries, which differs from traditional keyword-based searches that depend on exact matches or meta-information. This involves two steps: 1) extracting music features from all scores in the library, and 2) transforming the query into a text feature. By calculating the similarities between the text feature and the music features, it can efficiently locate the score that best matches the user's query in the library.
Zero-shot classification refers to the classification of new items into any desired label without the need for training data. It involves using a prompt template to provide context for the text encoder. For example, a prompt such as "This piece of music is composed by {composer}." is utilized to form input texts based on the names of candidate composers. The text encoder then outputs text features based on these input texts. Meanwhile, the music encoder extracts the music feature from the unlabelled target symbolic music. By calculating the similarity between each candidate text feature and the target music feature, the label with the highest similarity is chosen as the predicted one.
- Semantic search and zero-shot classification for score-oriented symbolic music datasets.
- Cross-modal representation learning between natural language and symbolic music.
- Enabling research in music analysis, retrieval, and generation.
- Building innovative systems and applications that integrate music and language.
- CLaMP's current version has limited comprehension of performance MIDI.
- The model may not perform well on tasks outside its pre-training scope.
- It may require fine-tuning for some specific tasks.
To use CLaMP, you can follow these steps:
- Clone the CLaMP repository by running the following command in your terminal:
git clone https://github.com/microsoft/muzic.git
This will create a local copy of the repository on your computer.
- Navigate to the CLaMP directory by running the following command:
cd muzic/clamp
- Install the required dependencies by running the following command:
pip install -r requirements.txt
-
If you are performing a music query, save your query as
inference/music_query.mxl
. For music keys, ensure that all the music files are in the MusicXML (.mxl) format, and are saved in theinference/music_keys
folder. -
If you are performing a text query, save your query as
inference/text_query.txt
. For text keys, save all the keys in theinference/text_keys.txt
file, where each line corresponds to a key. -
Run the following command to perform the query:
python clamp.py -clamp_model_name [MODEL NAME] -query_modal [QUERY MODAL] -key_modal [KEY MODAL] -top_n [NUMBER OF RESULTS]
Replace [MODEL NAME] with the name of the CLaMP model you want to use (either sander-wood/clamp-small-512
or sander-wood/clamp-small-1024
), [QUERY MODAL] with either music
or text
to indicate the type of query you want to perform, [KEY MODAL] with either music
or text
to indicate the type of key modal you want to use, and [NUMBER OF RESULTS] with the number of top results you want to return.
For example, to perform semantic music search with the sander-wood/clamp-small-512
model and return the top 5 results, run:
python clamp.py -clamp_model_name sander-wood/clamp-small-512 -query_modal text -key_modal music -top_n 5
Note that the first time you run the CLaMP script, it will automatically download the model checkpoint from Hugging Face. This may take a few minutes, depending on your internet speed.
- After running the command, the script will generate a list of the top results for the given query. Each result correspond to a music file in the
music_keys
folder or a line in thetext_keys.txt
file, depending on the type of key modal you used.
@misc{wu2023clamp,
title={CLaMP: Contrastive Language-Music Pre-training for Cross-Modal Symbolic Music Information Retrieval},
author={Shangda Wu and Dingyao Yu and Xu Tan and Maosong Sun},
year={2023},
eprint={2304.11029},
archivePrefix={arXiv},
primaryClass={cs.SD}
}
This project uses xml2abc.py (source: https://wim.vree.org/svgParse/xml2abc.html) for converting XML music notation to ABC format. We would like to acknowledge and thank the author, Wim Vree, for developing this helpful tool.