MelBandRoFormer Audio Separation

This ComfyUI workflow performs high-quality voice isolation using the MelBandRoFormer model. It takes a single mixed audio track and separates it into two stems: isolated vocals and instrumental accompaniment. The pipeline starts with LoadAudio to ingest your source file, then MelBandRoFormerModelLoader loads the pretrained weights (MelBandRoformer_fp16.safetensors) from models/diffusion_models. The core separation happens in MelBandRoFormerSampler, which runs inference and outputs two audio tensors—one for vocals and one for background music—ready for export via paired SaveAudioMP3 nodes.

Technically, MelBandRoFormer applies a transformer-based architecture that operates in band-split, time–frequency space to estimate source components. In practice, that means it learns masks that suppress accompaniment when extracting vocals and vice versa, yielding cleaner stems with fewer musical artifacts. The included MarkdownNote provides inline guidance inside the canvas. Because the workflow is node-driven, you can easily swap SaveAudioMP3 for a WAV saver, change output filenames, or tweak Sampler options like chunking/overlap (if exposed by your node version) to balance speed, memory use, and separation quality.