How is this different from basic image captioning?

Beyond plain descriptions, the workflow can generate structured, diffusion-ready prompts (subject, medium, lens, lighting, style, and optional negatives) when you specify that format in the TextGenerate instruction.

Do I need an internet connection to run this?

No. Once you download qwen3.5_4b_bf16.safetensors into ComfyUI/models/text_encoders/, the workflow runs locally in ComfyUI.

How can I improve the quality or usefulness of the generated prompts?

Use clear, high-quality images; provide a precise instruction template (e.g., ask for 3 concise prompts with subject, medium, lens, lighting, style, and negatives); and tune generation parameters like temperature (for more or less creativity) and max tokens (for longer outputs).

Can I swap in a different Qwen3.5 variant?

Yes, as long as it’s packaged for use with the TextGenerate node and placed where CLIPLoader can load it (models/text_encoders/). Update the CLIPLoader selection to the desired variant and keep the instruction format consistent.

Qwen3.5: Text Generation - ComfyUI Workflow

Back

This workflow demonstrates how to use Qwen3.5 inside ComfyUI to analyze an image and generate descriptive text that doubles as ready-to-use prompts. It performs both image captioning and reverse prompt engineering: you provide an image via LoadImage, and the TextGenerate node, powered by Qwen3.5, returns structured descriptions or prompt candidates you can paste into your image-generation pipelines.

Technically, the CLIPLoader node is pointed at the Qwen3.5 weights (qwen3.5_4b_bf16.safetensors) stored under models/text_encoders/. That model handle feeds into the TextGenerate node, which accepts the loaded image and an instruction prompt (for example, "Produce 3 concise, diffusion-ready prompts"). The node then runs inference and returns text, which you can view with PreviewAny. A MarkdownNote in the graph provides inline guidance and prompt tips, making it easy to iterate on instruction wording, temperature, and token length to dial in results.