The official implementation of "A Singing Melody Extraction Network Via Self-Distillation and Multi-Level Supervision." Our paper has been accepted by 2025 ICASSP.
We propose a singing melody extraction network consisting of five stacked multi-scale feature time-frequency aggregation (MF-TFA) modules. In the same network, deeper layers generally contain more contextual information than shallower layers. To help the shallower layers enhance the ability of task-relevant feature extraction, we propose a self-distillation and multi-level supervision (SD-MS) method, which leverages the feature distillation from the deepest layer to the shallower one and multi-level supervision to guide network training.
The visualization illustrates that our proposed method can reduce the octave errors and the melody detection errors.
The bold values indicate the best performance for a specific metric.
Results of ablation experiments introducing a self-distillation and multi-level supervision method in partially existing singing melody extraction model. SD-MS indicates that self-distillation and multi-level supervision is used.
Ablation study of the loss function on three datasets
The entire code scripts will be made public after being licensed.