返回

This ComfyUI workflow builds personalized, lip-synced video clips by conditioning LTX-2.3 on three inputs: a reference image (appearance), a short audio clip (timing and vocal style), and a structured prompt (scene, transcript, and sound context). The core logic is encapsulated in a subgraph (the custom node with ID 98ee9e5b-467b-40aa-a534-36033f27d0b4), which orchestrates LTX-2.3 generation with ID-LoRA identity conditioning. LoadImage provides the face/pose reference, LoadAudio or RecordAudio supplies your speech, and SaveVideo writes the final video. A MarkdownNote in the canvas summarizes the structured prompt format and links to the LTX-2.3 and ID-LoRA resources.

Technically, ID-LoRA injects a lightweight identity adapter into the LTX-2.3 video diffusion process so the generated person matches the reference image without a full fine-tune. The prompt is split into three tags the subgraph expects: [VISUAL] (scene and appearance), [SPEECH] (the exact words to say), and [SOUNDS] (speaker style and ambient cues). The audio drives mouth shapes and timing, while [SPEECH] helps the model resolve phoneme-level detail for clearer lip articulation. The result is a short video of your subject speaking the provided line, visually aligned to the reference image and temporally synced to the audio.