Whole-song Hierarchical Generation of Symbolic Music Using Cascaded Diffusion Models [ICLR24, #69]
- cascaded diffusion models for hierarchical composition, high-level(whole song form) to low-level(notes, chords)
- use not only background(high-level) conditions, but also former relevant music segments and external controls
BUTTER: A representation learning framework for bi-directional music-sentence retrieval and generation [NLP4MusA20, #67]
- use VAE to encode music, GRU to encode text description, and linear transformation to transform these representations into the same embedding space
- in cross-modal alignment(by linear transform), the loss is calculated separately for each attributes(key, meter, style)
Motif-Centric Representation Learning for Symbolic Music [arXiv23/9, #66]
- Use VICReg, the training method in which no negative samples are used, for pretraining the model and use contrastive learning for finetuning
- to reduce the influence of randomly selected negative samples that have similar segments with positive samples
MusIAC: An extensible generative framework for Music Infilling Applications with multi-level Control [EvoMUSART22, #65]
- Add 6 types of control tokens for user control to REMI representation
Structure-Enhanced Pop Music Generation via Harmony-Aware Learning [ACM-MM22, #62]
- Harmony Aware hierarchical music Transformer (HAT) can model the harmony information, texture and form, by updating token representation
- the updated representation can be used for the task of understanding and generation and the model is better in the form and texture
Theme Transformer: Symbolic Music Generation with Theme-Conditioned Transformer [IEEE transaction on multimedia, #59]
- Explicitly train the transformer to treat the conditioning sequence as a thematic material that manifests itself multiple times.
- Contrastive representation learning and clustering for retrieving thematic materials
- Gated parallel attention module to more effectively account for a given conditioning thematic materials
CLaMP: Contrastive Language-Music Pre-training for Cross-Modal Symbolic Music Information Retrieval [ISMIR23, #58]
- trained text and symbolic music(ABC notation) with contrastive loss
- text dropout(data augmentation), bar patching(compound characters), masked music model pretraining(mask, shuffle, unchanged).
TeleMelody: Lyric-to-Melody Generation with a Template-Based Two-Stage Method [EMNLP22, #56]
- two-stage system, lyrics-to-template and template-to-melody
- generated melody can be controlled by adjusting the template
Compose & Embellish: Well-Structured Piano Performance Generation via A Two-Stage Approach [ICASSP23, #54]
- a transformer model generates a lead sheet and then another transformer model generates all sequences
- a lead sheet consists of melody, chords, and structure information(like ABAB) of each bar
Melody Infilling with User-Provided Structural Context [ISMIR22, #52]
- encoder-decoder transformer model for structure-aware conditioning infilling
- use the bar-count-down technique and order embeddings to control the length and attention-selecting module that allows the model to access multiple structural contexts while infilling
Variable-Length Music Score Infilling via XLNet and Musically Specialized Positional Encoding [ISMIR21, #51]
- can infill a variable number of notes (up to 128) for different time spans
- use CP, XLNet, relative bar encoding, and look-ahead onset prediction
Anticipation-RNN: Enforcing unary constraints in sequence generation, with application to interactive music generation [NCAA18, #50]
- anticipation-RNN can enforce user-defined unary constraints
- one RNN is used to encode the constraints and the other one is used to generate the sequence with the constraint information
- the constraints are fed reversely into RNNs, so the decoder (the other one) can use the future constraint information
The Piano Inpainting Application [Sony21, #49]
- Structured MIDI Encoding is proposed and used to train Linear Transformer for infilling(inpainting)
- use time-shift tokens instead of note-on/off or duration tokens
Composer's Assistant: An Interactive Transformer for Multi-Track MIDI Infilling [ISMIR23, #44]
- Train T5-like model to infill multi-track MIDI whose arbitrary track-measures have been deleted
- The model can be used in the REAPER digital audio workstation(DAW)
Infilling Piano Performances [[NeurIPS workshop18, #40]
- infill deleted section of MIDI by using the transformer
- give {left context + special token + right context} to the model and generate the blanks
FIGARO: Generating Symbolic Music with Fine-Grained Artistic Control [ICLR23, #39]
- FIne-grained music Generation via Attention-based, RObust control (FIGARO) by applying description-to-sequence modelling
- combine learned high-level features with domain knowledge which acts as a strong inductive bias
Exploring the Efficacy of Pre-trained Checkpoints in Text-to-Music Generation Task [AAAI23 workshop, #38]
- given text, generate symbolic music using pre-trained language models like BERT, GPT-2, and BART
- the model that initializes the parameters by BART outperformed the model that initializes parameters randomly
Mubert [github]
- generate music from a free text prompt
- this is not the actual generation, but just combines the pre-composed music according to the rule
Vector Quantized Contrastive Predictive Coding for Template-based Music Generation [20, #36]
- given a template sequence, generate novel sequences sharing perceptible similarities with the original template
- encode and quantize the template sequence followed by decoding to generate variations
MMM : Exploring Conditional Multi-Track Music Generation [ArXiv20, #34]
- Multitrack inpainting, using event-level token representation and transformer
- replace the sequences of the bar representation which we want to predict into the token, and add the token to the last
- no quantitative results, only method and demo
Multitrack Music Transformer [ICASSP23][#29]
- propose MMT, multitrack music representation for symbolic music generation can reduce memory usage
- multitrack symbolic music generation by using MMT
- Continuation, scratch generation, instrument informed generation
Compound word transformer: Learning to compose full-song music over dynamic directed hypergraphs [AAAI21][#27]
- group consecutive and related tokens into compound words to capture the co-occurrence relationship
- 5-10 times faster at training with comparable quality
- CP transformer can be seen as the hyperedge prediction
PopMAG: Pop Music Accompaniment Generation [ACM20][#26]
- Multitrack-MIDI representation (MuMIDI) enables simultaneous multi-track generation in a single sequence
- model multiple note attributes of a musical note in one step and use the architecture of transformerXL to capture the long-term dependencies
Controllable deep melody generation via hierarchical music structure representation [ISMIR21][#24]
- music framework generates rhythm and basic melody using two separate transformer-based networks
- then, generate the melody conditioned on the basic melody, rhythm, and chords
A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music [ICML18][#23]
- hierarchical decoder which first outputs embeddings for sub-sequences and then uses these embeddings to generate each subsequence
- propose MusicVAE which uses the hierarchical latent vector model
PopMNet: Generating structured pop music melodies using neural networks [Wu+, 19]
- CNN generates melody structure which is defined by pairwise relations, specifically the sequence between all bars in a melody
- RNN generates melodies conditioned on the structure and chord progression
MELONS: generating melody with long-term structure using transformers and structure graph [ICASSP22][#22]
- factor melody generation into 2 sub-problems: structure generation and structure conditional melody generation
- these sub-problems are solved by the linear transformer
Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions [ACM20][#20]
- propose REMI, a new event representation of beat-based music
- use Auto Transcription to get enough training data
Museformer [Yu+ NeurIPS22][#11]
- Use fine- and coarse-grained attention for music generation
- fine-grained attention captures the tokens in the most relevant measure (the previous 1,2,4,8...)
- coarse-grained attention captures the summarization of the other measure, which reduces the computational cost
Music Transformer [Huang+, ICLR19][#10]
- Generate symbolic music by transformers with relative position-based attention
- reduce the memory requirements in relative position-based attention by "skewing"
Graph-based Polyphonic Multitrack Music Generation [Cosenza+, 23/7][#1]
- Use a graph to represent the multitrack music score
- Train GCN and VAE to generate graph (music)
- Not good performance
Impact of time and note duration tokenizations on deep learning symbolic music modeling [ISMIR23, #48]
- analyze the common tokenization methods and especially experiment with time and note duration representations
- demonstrate that explicit information leads to better results depending on the tasks
Multimodal Multifaceted Music Emotion Recognition Based on Self-Attentive Fusion of Psychology-Inspired Symbolic and Acoustic Features[APSIPA23, #43]
- Multimodal multifaceted MER method that uses features from MIDI and audio data based on musical psychology.
- Self-attention mechanism can learn the complicated relationships between different features and fuse the
PiRhDy: Learning Pitch-, Rhythm-, and Dynamics-aware Embeddings for Symbolic Music[ACM20][#30]
- generate music note embeddings
- (1) token modeling: separately represents pitch, rhythm, and dynamics and integrates them into a single token embedding
- (2) context modeling: use melodic and harmonic embedding to train the token embedding
MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training [ACL finding21][#28]
- Pre-train BERT with 1.5M MIDI files which are private and created by Microsoft Research Asia
- use OctupleMIDI encoding and bar-level masking strategy to enhance symbolic music data
Graph Neural Network for Music Score Data and Modeling Expressive Piano Performance [Jeong+, ICML19][[#6]]
- Use GNN and LSTM with Hierarchical Attention Network to generate expressive piano performance
- GNN captures the node information in a measure and LSTM w/HAN captures the measure information
- Let the node have node information in other measures by updating iteratively
Mustango: Toward Controllable Text-to-Music Generation [23/11, #55]
- Mustango expands Tango(text-to-audio) model and controls with general text and specific text(related to chords, beats, tempo, key)
- MusicBench datasets comprise 5479 audio clips (each has 10sec)
ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models [AACLdemo23, #37]
- use the diffusion model to generate music conditioned by free-form text
- outperform the previous models in terms of text-music relevance and music quality, as judged by human
LP-MusicCaps: LLM-Based Pseudo Music Captioning [ISMIR23, #71]
- Use LLM to generate the pseudo captions for MusicCaps dataset
- 4 different instructions (Writing, Summary, Paraphrase, Attribute Predictions)
MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [ICLR24, #68]
- SSL for acoustic music understanding by HuBERT-based model with AcousticMLM and MusicalMLM
- AcousticMLM uses k-means on the log-Mel spectrum and Chroma features and EnCodec (VQ-VAE)
- MusicalMLM uses Constant-Q transform (CQT) spectrogram
Data Collection in Music Generation Training Sets: A Critical Analysis [ISMIR23, #45]
- Analysis of all datasets used to train Automatic Music Generation (AMG) models presented at the last 10 editions of ISMIR
- Discussed ethics and suggested the way to collect or use the dataset for AMG training
A Survey on Deep Learning for Symbolic Music Generation [ACM23/8][#25]
A Survey of AI Music Generation Tools and Models [Zhu+, 23/8][#12]
Modeling temporal tonal relations in polyphonic music through deep networks with a novel image-based representation [Chuan+, AAAI18]
- music generation, convolution
Counterpoint by Convolution [Huang+, ISMIR17]
- music generation, convolution, inpainting, Gibbs sampling
Musicaiz: A Python library for symbolic music generation, analysis, and visualization [Olivan+, 23]
- Python の Symbolic Music Generation 用ライブラリ
Cadence Detection in Symbolic Classical Music using Graph Neural Networks [Karystinaios+, ISMIR22]
VirtuosoTune: Hierarchical Melody Language Model [Jeong, 23]
Noise2Music: Text-conditioned Music Generation with Diffusion Models [[Huang+, 23/2]]
Transformer vae: A hierarchical model for structure-aware and interpretable music representation learning [Jiang+, ICASSP20]
- Music VAE
- NSynth
- MuseNet
- SDMuse
Contrastive Learning for Improving ASR Robustness in Spoken Language Understanding [ICASSP22, #53]
- improve the ASR robustness by contrastive pretraining of RoBERTa
- fine-tuning with self-distillation to reduce the label noises due to ASR errors
Transformer ASR with Contextual Block Processing [ASRU19, #47]
- to capture global information (long dependency), introduce a context-aware inheritance mechanism into the transformer encoder
- also introduce a noble mask technique to implement the above mechanism
BEAST: Online Joint Beat and Downbeat Tracking Based on Streaming Transformer [ICASSP24, #46]
- Beat tracking streaming transformer (BEAST) is for online beat tracking and has a transformer-encoder with relative positional encoding
- improvement of 5% in beat and 13% in downbeat over the SOTA model
ChipSong: A Controllable Lyric Generation System for Chinese Popular Song [ACL22, #64]
- the model generates lyrics based on the given word-level length format, sentence-level length format, trigger word, and rhyme
- BIE Word-granularity embedding/attention for the length control and reverse order for rhyme control
QiuNiu: A Chinese Lyrics Generation System with Passage-Level Input [ACL22, #63]
- given passage-level text as input, QiuNiu generates lyrics that reflect the nuances of the user's need
- A two-step process(3 types of loss) to fine-tune the UMT model of QuiNiu and a post-process(score classifier and n-gram based reranking) are used
Locally Typical Sampling [TACL22, #42]
- in each time t, create the local typical set, consisting of the words that have a probability close to the entropy of the predicted distribution
- random sample from the local typical set
Contrastive Decoding: Open-ended Text Generation as Optimization[ACl23, #41]
- decoding with maximum probability often results in short and repetitive text and sampling can often produce incoherent text
- Contrastive Decoding (CD) is a reliable approach that optimizes a contrastive objective subject to a plausibility constraint
InCoder: A Generative Model for Code Infilling and Synthesis [ICLR23][#32]
- InCoder can infill the program via left-to-right generation
- train by maximizing logP([left; ; right; ; span; ]) and inference by sampling tokens autoregressively from the distributions P(・| [left; ; right; ])
Hierarchical Attention Networks for Document Classification [Yang+, 16][#8]
- HAN can capture the insights of the hierarchical structure (words form sentences, sentences form a document) and the difference in importance of each word and sentence
WhitenedCSE: Whitening-based Contrastive Learning of Sentence Embeddings [Gao+, ACL23][#5]
- SimCSE + whitening
- whitening means the transformation of the data to have a mean of zero and a covariance matrix of the identity matrix
On Positional embeddings in BERT [ICLR21][#31]
- analyze Positional Embeddings (PEs) based on 3 properties, translational invariance, monotonicity, and symmetry
- breaking translational invariance and monotonicity degrades downstream task performance while breaking symmetry improves downstream task performance
- fully learnable absolute position embedding generally improves performance on the classification task, while relative position embedding improves performance on the span prediction task
Jointly Learning to Align and Translate with Transformer Models [EMNLP19, #57]
- train the transformer for both translation and word alignment tasks.
- transformer models produce translations and interpretable alignments simultaneously by a multi-task framework
Linearity of Relation Decoding in Transformer Language Models [ICLR24, #74]
- Relation mappings between subjects and objects can be approximated by a single linear transformation
- E.g., s = intermediate rep of "Miles Davis", R = linear transformation for "play the instrument", o = last rep of "strumpet", then o ≒ Rs
Crawling The Internal Knowledge-Base of Language Models [ACL23 findings, #70]
- given a seed entity, generate the knowledge that is related to the seed by in-context learning
- divide into sub-tasks where the relation, object, and paraphrasing are generated separately by few-shot
Locating and Editing Factual Associations in GPT [Meng+, NeurIPS22]
Mass-Editing Memory in a Transformer [Meng+, ICLR23]
- the research after "Locating and Editing Factual Associations in GPT"
Learning Interpretable Low-dimensional Representation via Physical Symmetry [NeurIPS23, #72]
- use physical symmetry as a self-consistency constraint for the latent space of time-series data
- the constraints lead the model to learn interpretable representation
A Survey on Deep Graph Generation: Methods and Applications [PRML22, #73]
- formulation of the problem of deep graph generation, discussion of its difference with several related graph learning tasks, 3 categories of the methods, applications, challenges
Do Transformers Really Perform Bad for Graph Representation? [NeurIPS21][#21]
- propose Graphormer, which is based on the standard Transformer and can utilize the structural information of a graph
- similar to #18
GaAN: Gated Attention Networks for Learning on Large and Spatiotemporal Graphs [Zhang+, 18] [#19]
- use a convolutional sub-network to control each attention head's weight
Global Self-Attention as a Replacement for Graph Convolution [KDD22][#18]
- add edge information to the self-attention calculation
Graph Transformer Networks [Yun+, NeurIPS19][#17]
- generate a new meta-path that represents multi-hop relations and new-graph structure by multiplication of different adjacency matrix
- can be applied to heterogeneous graph
Graph Transformer [Li+, 18][#16]
- evolve the target graph by recurrently applying source-attention from the source graph and self-attention from the target graph
- they can have different structure
Attention Guided Graph Convolutional Networks for Relation Extraction [Guo+, ACL19][#15]
- transform the original graph to a fully connected edge-weighted graph by self-attention
Multi-hop Attention Graph Neural Networks [Wang+, IJCAI21][#14]
- compute attention weights on edges, then compute self-attention weight between disconnected nodes
- can capture the long-range dependencies
On the Global Self-attention Mechanism for Graph Convolutional Networks[Wang+, IEEE20][#13]
- Apply Global self-attention (GSA) to GCNs
- GSA allows GCNs to capture feature-based vertex relations regardless of edge connections
Self-Attention with Relative Position Representations [Shaw, NAACL18][#9]
- extend self-attention to consider the pairwise relationships between each element
- Add the trainable relative position representations, and add them to key and query vectors
Graph Attention Networks [Velickovic+, ICLR18][#4]
- train weight matrix which represents the relation between nodes
- it can be seen as self-attention with an artificially created mask
GibbsDDRM: A Partially Collapsed Gibbs Sampler for Solving Blind Inverse Problems with Denoising Diffusion Restoration [ICML23, #35]
- solve linear inverse problems using denoising diffusion restoration
- it can be used in cases where the linear operator is unknown
PeerNets: Exploiting Peer Wisdom Against Adversarial Attacks [Svoboda+, 18]