Skip to the content.

Introduction

In this paper, we propose MuseTok, a tokenization method for symbolic music, and investigate its effectiveness in both music generation and understanding tasks. MuseTok employs RQ-VAE on bar-wise music segments within a transformer-based encoder-decoder framework, producing music codes that achieve high-fidelity music reconstruction and accurate understanding of music theory. We further applied learned MuseTok to music generation and semantic understanding tasks, to comprehensively evaluate the quality of such representations.

Fig.1 Overview of MuseTok (left) and its downstream generation (middle) and understanding (right) tasks.

We provide audio demonstrations and additional experimental details to supplement experiment sections of the paper, organized as follows:

Dataset Statistics & Examples

We utilize public-domain music data for model training, with representative samples provided below to show the characteristics of our datasets. The #bars are averaged over a dataset, while #events are averaged per bar.

Dataset # pieces # bars # events Genres Sample1 Sample2
PDMX-monophonic [1] 163,366 27.36 19.33 mixed
PDMX-chorale [1] 25,597 35.24 44.77 mixed
POP909 [2] 886 83.52 62.31 pop
EMOPIA [3] 1,071 17.79 46.73 pop
Pop1k7 [4] 1,747 104.59 39.39 pop
Hymnal [5] 1,606 18.47 47.05 folk
Multipianomide [6] 457 95.64 47.89 classical
Ragtime [7] 457 139.16 49.32 jazz
Summary 195,187 29.21 24.30 - - -

Reference:

[1] P. Long, Z. Novack, T. Berg-Kirkpatrick, and J. J. McAuley, “PDMX: A large-scale public domain musicxml dataset for symbolic music processing,” in Proc. ICASSP, 2025.

[2] Z. Wang, K. Chen, J. Jiang, Y. Zhang, M. Xu, S. Dai, and G. Xia, “POP909: A pop-song dataset for music arrangement generation,” in Proc. ISMIR, 2020.

[3] H. Hung, J. Ching, S. Doh, N. Kim, J. Nam, and Y.-H. Yang, “EMOPIA: A multi-modal pop piano dataset for emotion recognition and emotion-based music generation,” in Proc. ISMIR, 2021.

[4] W. Hsiao, J. Liu, Y. Yeh, and Y.-H. Yang, “Compound Word Transformer: Learning to compose full song music over dynamic directed hypergraphs,” in Proc. AAAI, 2021.

[5] https://www.hymnal.net/en/home

[6] http://piano-midi.de/

[7] https://www.ragsrag.com/pr/pr.html

Music Reconstruction

We present samples below to demonstrate the performance of three models (VAE, MuseTok-Small and MuseTok-Large) to reconstruct the Original pieces. Here, d denotes the hidden dimension, D the quantization depth, and K the codebook size.

Original VAE MuseTok-Small MuseTok-Large

Music Continuation

To assess the performance of our model (MuseTok-Large) and two baselines (REMI and AMT) on the music continuation task, we conduct online listening tests to collect human assessment on the generation quality. (Please refer to the paper for more details about objective evaluation.)

After listening to the primer, participants are responsible to rate the generated continuations in a 5-point Likert scale for the following questions:

A total of 24 responses are collected, with each response containing 8 groups of generated samples randomly selected from 32 primers in the test set, resulting in 192 ratings per model for each aspect.

The generation performance on objective and subjective metrics is presented below:

Fig.2 Objective and subjective evaluations on music generation.

Below are some example music primers and their corresponding continuations that were presented to participants during the listening test:

Primer MuseTok REMI AMT

When handling long-context primers, MuseTok demonstrates greater robustness compared to REMI-like generation frameworks:

Primer MuseTok REMI AMT

We also present generation samples from MuseTok without primers here to show its effectiveness on content generation:

How MuseTok Learns Music?

To further discover the music concepts learned through MuseTok, we examine the usage frequency of codes across different music textures and time signatures. Here we show the frequency of the 50 most used codes in each texture group or time signature group across all eight codebooks.

Music Textures

The top-50 frequently used codes in different texture groups are largely distinct at all codebooks.

Fig.3 Top-50 used codes across three texture groups in all quantization depths.

Time Signatures

MuseTok almost omits the time signature difference in the first codebook, but gradually diverges in deeper codebooks.

Fig.4 Top-50 used codes across six time signatures in all quantization depths.