ACE-Step 1.5XL SFT： Text to Music

This ComfyUI workflow turns text prompts into finished music using the ACE-Step 1.5XL SFT (4B) model. It follows a familiar diffusion pipeline adapted for audio: DualCLIPLoader and TextEncodeAceStepAudio1.5 convert your prompt (and optional negative prompt) into conditioning; EmptyAceStep1.5LatentAudio creates a latent audio canvas for your chosen duration; UNETLoader brings in the ACE-Step 1.5XL SFT model; and KSampler performs iterative denoising guided by CFG. ModelSamplingAuraFlow configures the sampler strategy tuned for ACE-Step audio, while VAELoader + VAEDecodeAudio convert the final latent back into a waveform that SaveAudioMP3 writes to disk.

Technically, CFG (classifier-free guidance) lets you balance creativity and prompt adherence: lower CFG values explore broadly, higher values follow your text more strictly. ConditioningZeroOut provides a clean unconditional branch if you don’t specify a negative prompt, ensuring stable guidance. Workflow groups (Model, Duration, Prompt) make it easy to swap checkpoints, set seconds, and iterate on prompts. The result is a practical, reproducible text-to-music pipeline that can quickly iterate across genres, moods, instruments, and structure—ideal for ideation, temp scoring, and background tracks.