This codebase adds local conditioning support for generating music instead of the text-to-speech application of the original paper. We implement a MIDI reader that reads a MIDI file and upsamples the notes to generate embeddings for the audio of the music. See upstream repo's README for more information on the overall architecture and approach.