This ComfyUI workflow, built around the LTX-2.3 video model, turns a single reference image into a short, coherent video clip with controllable motion and style. It uses your uploaded first frame (LoadImage in the First Frame group) as a strong visual anchor, optionally blends in an ending reference from the Last Frame group, and lets you steer the scene with natural language prompts. Text conditioning is handled by CLIPTextEncode and LTXAVTextEncoderLoader in the Conditioning group, while LTXVAddGuide fuses your reference frame(s) into the model’s conditioning stream.
Under the hood, EmptyLTXVLatentVideo creates the video latent with your chosen resolution, frame count, and FPS. RandomNoise seeds the generation, and sampling is performed with SamplerEulerAncestral guided by CFGGuider and a ManualSigmas schedule for stable, consistent motion. LTXVConcatAVLatent pairs the video latent with an audio latent (LTXVEmptyLatentAudio by default, resulting in silence) so your output is a proper A/V container. Final frames are decoded efficiently with VAEDecodeTiled and written to disk via SaveVideo. Utility nodes like GetImageSize and LTXVCropGuides help you frame the subject reliably, and LTXVPreprocess normalizes your input image to the model’s expected space. The result is a fast, practical pipeline for turning a single image into a smooth, on-style video clip.