Introduction

In this paper, we propose MuseTok, a tokenization method for symbolic music, and investigate its effectiveness in both music generation and understanding tasks. MuseTok employs RQ-VAE on bar-wise music segments within a transformer-based encoder-decoder framework, producing music codes that achieve high-fidelity music reconstruction and accurate understanding of music theory. We further applied learned MuseTok to music generation and semantic understanding tasks, to comprehensively evaluate the quality of such representations.

Fig.1 Overview of MuseTok (left) and its downstream generation (middle) and understanding (right) tasks.

We provide audio demonstrations and additional experimental details to supplement experiment sections of the paper, organized as follows:

Dataset Statistics & Examples (for Section 4.1)
Music Reconstruction (for Section 4.2)
Music Continuation (for Section 4.3)
How MuseTok Learns Music (for Section 4.5)

Dataset Statistics & Examples

We utilize public-domain music data for model training, with representative samples provided below to show the characteristics of our datasets. The #bars are averaged over a dataset, while #events are averaged per bar.

Dataset	# pieces	# bars	# events	Genres	Sample1	Sample2
PDMX-monophonic [1]	163,366	27.36	19.33	mixed
PDMX-chorale [1]	25,597	35.24	44.77	mixed
POP909 [2]	886	83.52	62.31	pop
EMOPIA [3]	1,071	17.79	46.73	pop
Pop1k7 [4]	1,747	104.59	39.39	pop
Hymnal [5]	1,606	18.47	47.05	folk
Multipianomide [6]	457	95.64	47.89	classical
Ragtime [7]	457	139.16	49.32	jazz
Summary	195,187	29.21	24.30	-	-	-

Reference:

[1] P. Long, Z. Novack, T. Berg-Kirkpatrick, and J. J. McAuley, “PDMX: A large-scale public domain musicxml dataset for symbolic music processing,” in Proc. ICASSP, 2025.

[2] Z. Wang, K. Chen, J. Jiang, Y. Zhang, M. Xu, S. Dai, and G. Xia, “POP909: A pop-song dataset for music arrangement generation,” in Proc. ISMIR, 2020.

[3] H. Hung, J. Ching, S. Doh, N. Kim, J. Nam, and Y.-H. Yang, “EMOPIA: A multi-modal pop piano dataset for emotion recognition and emotion-based music generation,” in Proc. ISMIR, 2021.

[4] W. Hsiao, J. Liu, Y. Yeh, and Y.-H. Yang, “Compound Word Transformer: Learning to compose full song music over dynamic directed hypergraphs,” in Proc. AAAI, 2021.

[5] https://www.hymnal.net/en/home

[6] http://piano-midi.de/

[7] https://www.ragsrag.com/pr/pr.html

Music Reconstruction

We present samples below to demonstrate the performance of three models (VAE, MuseTok-Small and MuseTok-Large) to reconstruct the Original pieces. Here, d denotes the hidden dimension, D the quantization depth, and K the codebook size.

VAE (d = 128)
MuseTok-Small (D = 8, K = 1024, d = 128)
MuseTok-Large (D = 16, K = 2048, d = 128)

Original	VAE	MuseTok-Small	MuseTok-Large

Music Continuation

To assess the performance of our model (MuseTok-Large) and two baselines (REMI and AMT) on the music continuation task, we conduct online listening tests to collect human assessment on the generation quality. (Please refer to the paper for more details about objective evaluation.)

After listening to the primer, participants are responsible to rate the generated continuations in a 5-point Likert scale for the following questions:

Pitch: Are there any notes that sound out of key (0 score), or align with each other (5 score)regardless of the primer?
Structure: How well does the generated part follow the rhythm and structure of the primer?
Harmony: How well does the generated part align with the chord progression of the primer?
Development: How coherent and engaging is the musical flow and motif development following the primer?

A total of 24 responses are collected, with each response containing 8 groups of generated samples randomly selected from 32 primers in the test set, resulting in 192 ratings per model for each aspect.

The generation performance on objective and subjective metrics is presented below:

Fig.2 Objective and subjective evaluations on music generation.

Below are some example music primers and their corresponding continuations that were presented to participants during the listening test:

Primer	MuseTok	REMI	AMT

When handling long-context primers, MuseTok demonstrates greater robustness compared to REMI-like generation frameworks:

Primer	MuseTok	REMI	AMT

We also present generation samples from MuseTok without primers here to show its effectiveness on content generation:

How MuseTok Learns Music?

To further discover the music concepts learned through MuseTok, we examine the usage frequency of codes across different music textures and time signatures. Here we show the frequency of the 50 most used codes in each texture group or time signature group across all eight codebooks.

Music Textures

The top-50 frequently used codes in different texture groups are largely distinct at all codebooks.

Fig.3 Top-50 used codes across three texture groups in all quantization depths.

Time Signatures

MuseTok almost omits the time signature difference in the first codebook, but gradually diverges in deeper codebooks.

Fig.4 Top-50 used codes across six time signatures in all quantization depths.