Generating Music with GPT

Noufal Samsudin
3 min readJul 1, 2023

Implementing a GPT transformer from scratch, training to generate midi music

With the advent of ChatGPT and its impressive language generation capabilities, GPT (Generative Pre-trained Transformer) models have captured the imagination of researchers and enthusiasts alike. These models have revolutionized natural language processing and opened up new possibilities for AI-generated text.

In this article, I detail my learnings implementing a GPT model from scratch, training it for a custom use case — generating symbolic music.

Let’s take a look at the results before we discuss the approach (Full Code in Github).

Results

GPTs are essential sequence to sequence models which are trained to complete a given sequence. So the model here is able to take some input sequence of musical notes and continue that sequence. So it’s like having a musical copilot.

I deployed the GPT model in flask and modified a cool midi player project I found on github (https://github.com/ryohey/signal) to complete input prompt music sequences. As evident from the video, they were not all winners 😜- some generations were shockingly dreadful. But in most cases, the model does a decent job.

Some examples:

Building GPT from Scratch

-Inspired heavily from Andrej Karpathy’s video and Illustrated Transformer blog.

Below is a diagram showing the Transformer -decoder architecture I implemented in PyTorch (Full Code in Github).

GPT(Transformer-decoder architecture) — 2 Heads, 1 Layer

Training the Model to Generate Short Stories:

To check if my implementation is correct, I initially trained the model on TinyStories dataset. I built a small network, and trained the model for just 1 epoch on a small subset of the dataset. Below are some sample results:

Blue text is the prompt. Red is model output

Here I was just trying to see if the model is able to generate some semblance of natural language — and it does.

Adapting the Model for Music Generation:

Dataset: piano adl midi (link) — comprising 2000+ midi files.

Data Preprocessing: tokenized using miditok remiPlus tokenizer — a quick way to tokenize and de-tokenize the midi files. Sequence Length: 256

There is some loss in tokenizing and detokenizing the files. But it’s alright

Model config

  • Attention Heads: 8
  • Transformer Layers : 1
  • Embedding Dimension: 769
  • Token Vocabulary: 633
  • Epochs: 5

Ran for about 12 hours/epoch on my small GPU — 1080Ti

There is a lot of room for improvement here to get better results. My intention here was to learn and get a much more hands on experience with foundation model, and build something cool in the process.

Shoulders of Giants

  1. Andrej Karpathy’s video walkthrough — https://www.youtube.com/watch?v=kCc8FmEb1nY&ab_channel=AndrejKarpathy
  2. The Illustrated Transformer — http://jalammar.github.io/illustrated-transformer/
  3. GPT — Theory & Code -https://habr.com/en/companies/ods/articles/708672/

About The Author

I work in Dubai Holding, UAE as a Principal Data Scientist. You can reach out to me at kvsnoufal@gmail.com or https://www.linkedin.com/in/kvsnoufal/

https://youtu.be/kLn-hvynM3I

BECOME a WRITER at MLearning.ai // text-to-video // Detect AI img

--

--