Alibaba’s Wan 2.2 model in ComfyUI is a solid step up if you’re doing local video generation. The Wan 2.2 ComfyUI setup handles temporal coherence better—less flicker jumping between frames—and reads motion prompts with more nuance than Wan 2.1.
This guide walks you through the complete image to video ComfyUI pipeline, breaks down the hardware trade-offs between model sizes, and shows you the motion prompt techniques that actually produce fluid motion.
Wan 2.2 vs. Wan 2.1: What Improved
| Aspect | Wan 2.1 | Wan 2.2 |
|---|---|---|
| Temporal coherence | Noticeable flicker between frames | Smooth transitions, minimal jitter |
| Motion prompt understanding | Basic directional prompts | Complex, layered descriptions work reliably |
| Frame artifacts | Color banding at object edges | Reduced distortion and color shifts |
| Node architecture | WanVideoModelLoader, WanVideoSampler, etc. | Same nodes, improved model weights |
| Migration effort | N/A | Drop-in replacement; no workflow rebuild needed |
The node names stay the same: WanVideoModelLoader, WanVideoTextEncode, WanVideoImageEncode, WanVideoSampler, and WanVideoVAEDecode. Upgrading from Wan 2.1? Just download the new model weights, drop them in your models folder, and select the new file in WanVideoModelLoader. Your existing graphs don’t need touching.
👉 Quick takeaway: Wan 2.2 is a direct upgrade with better temporal smoothness and prompt interpretation. If you already use Wan 2.1, migration is seamless—just swap the model file.
Model Sizes: 1.3B vs. 14B
The Wan 2.2 ComfyUI implementation comes in two flavors. Your choice hinges on GPU memory, how much time you’re willing to wait, and what quality threshold you need.
| Aspect | 1.3B Model | 14B Model |
|---|---|---|
| File size | 2.7GB | ~30GB (single merged file) |
| VRAM required (no offload) | 12GB | 24GB |
| VRAM with sequential_cpu_offload | N/A | 14–16GB |
| Render time (33 frames, 16fps) | 2–4 minutes | 3–6 minutes (no offload) |
| Render time with offload | N/A | 8–15 minutes |
| Motion quality | Good, straightforward scenes | Noticeably more fluid, detailed motion |
| Prompt interpretation | Reliable for simple/medium prompts | Handles complex, multi-layered descriptions |
| Best for | RTX 3060/4060 Ti, rapid iteration | RTX 4090, quality-first workflows |
The 1.3B Model: Speed and Accessibility
Grab this if you’re running 12GB VRAM or less, or if you need to test motion prompts without waiting around. The quality is honest—it nails basic camera movements, walking, object motion, and simple scene dynamics without breaking a sweat. The 2–4 minute render window makes it practical to test three or four different prompts in a single session.
The catch: complex, multi-layered prompts (like “camera tracks smoothly while the subject walks and wind blows leaves across the frame”) tend to simplify into less nuanced motion. For straightforward effects and rapid testing, it’s excellent.
💡 Tip: Use the 1.3B model to validate motion prompts at 480p before scaling up to higher resolutions. A quick test here saves you 10+ minutes on the 14B model if the prompt doesn’t work.
The 14B Model: Quality and Complexity
This is the full-featured version. Motion is noticeably more fluid, intricate prompts land better, and spatial coherence holds up over longer sequences. The trade-off is real: you need either 24GB VRAM without offload or 14–16GB with sequential_cpu_offload turned on.
With offload enabled, the model shuffles data between GPU and CPU as needed. Render time climbs to 8–15 minutes for a 33-frame sequence, but it becomes doable on 16–20GB systems. Without offload and with 24GB VRAM, you’re looking at 3–6 minute renders, which is practical for actual production work.
👉 Quick takeaway: The 14B model delivers superior motion fluidity and prompt interpretation. Enable sequential_cpu_offload if you have 14–20GB VRAM; skip it if you have 24GB+.
Installation and Setup
Step 1: Install the ComfyUI-WanVideo Custom Node
- Open ComfyUI Manager in your ComfyUI interface.
- Search for “WanVideo” in the custom nodes browser.
- Click Install on the ComfyUI-WanVideo node pack.
- Restart ComfyUI.
New nodes appear under the Video category. You’ll see WanVideoModelLoader, WanVideoTextEncode, WanVideoImageEncode, WanVideoSampler, WanVideoVAEDecode, and related utilities.
Step 2: Download and Place Model Weights
- Search Hugging Face for the official Alibaba Wan 2.2 repository and confirm you’re on the verified org page before downloading anything.
- Download the
.safetensorsfile for your chosen model size — the 1.3B variant is around 2.7GB, the 14B variant is a single merged file around 30GB. Exact filenames vary by upload, so match against the VRAM figures in the table above rather than a specific filename. - Drop the file directly in
ComfyUI/models/diffusion_models/(no nested folders). - ComfyUI auto-detects it on the next startup.
Building the Image-to-Video Workflow
Here’s the complete node graph for an image to video ComfyUI pipeline:
Node 1: Load Image
- Input your starting frame (PNG, JPG, or WebP).
- Resolution requirements:
- 1.3B model: 480×832 (vertical) or 832×480 (horizontal)
- 14B model: up to 720×1280
- All dimensions must be divisible by 16 (e.g., 480, 496, 512, 528 are valid; 500 is not).
Node 2: WanVideoModelLoader
model_name: [select your .safetensors file]
sequential_cpu_offload: [enabled if VRAM < 24GB, disabled for 1.3B or if you have 24GB+]
- Using the 14B model on 16–20GB VRAM? Enable
sequential_cpu_offload. - Leave it off for the 1.3B model or if you have 24GB+ VRAM.
Node 3: WanVideoImageEncode
- Connect the Load Image output here.
- This node preps your starting frame as the video’s anchor point. No parameters to tweak.
Node 4: WanVideoTextEncode
This is where the magic happens. Write your prompt in English; the model was trained primarily on English descriptions.
Prompts that actually work:
"the person walks slowly forward, camera pans right""ocean waves crash gently on the shore, soft foam movement""clouds drift slowly across the sky, wind-blown motion""the camera follows from behind as the person walks forward, smooth tracking shot"
The key rule: Specify both subject motion and camera motion. Generic stuff like "movement" or "action" produces either static frames or incoherent results. Specific, directional language is what gets you fluid motion.
Wan 2.2 doesn’t use a motion_bucket_id parameter (that’s Stable Video Diffusion territory). Motion intensity comes from the prompt wording and sampler settings.
Node 5: WanVideoSampler
This is where Wan 2.2 video generation happens. The settings that matter:
- num_frames: 33 (sweet spot: 25–65). More frames aren’t automatically better; 33 balances quality and speed nicely.
- fps: 16–24
- 16 fps: slow motion, cinematic feel
- 20–24 fps: standard action, smooth playback
- 33 frames at 16 fps = ~2 seconds of video
- steps: 20–30 for quality/speed balance. Go to 40+ only if you’re chasing maximum quality.
- seed: Fixed value for reproducibility, or -1 for random variation.
- cfg_scale: 7–9 works well. Push to 12 if you need stronger prompt adherence; avoid anything above 13 (causes artifacts).
Node 6: WanVideoVAEDecode
- No parameters. Converts the latent frames from the sampler into actual pixel data.
- Connect directly from
WanVideoSampler.
Node 7: VHS_VideoCombine
Exports your final video:
- format: MP4 (recommended) or WebM
- fps: Must match the sampler’s fps setting
- quality: 95 for maximum quality
📌 Keep in mind: The Wan 2.2 I2V workflow is linear and straightforward: Load Image → Encode Image → Encode Motion Prompt → Sample → Decode → Export. Each node does one thing well.
Motion Prompt Techniques That Work
Motion prompts are what separate a static frame from genuinely fluid movement. Here’s what actually makes a difference:
Be Directional
Instead of: "the person moves"
Try: "the person walks forward slowly, camera follows from behind"
Direction words (forward, backward, left, right, up, down, toward, away) trigger coherent motion. Vague terms just produce random jitter.
Combine Subject and Camera
The best prompts describe both what moves in the scene and how the camera moves:
"the car drives down the road, camera pans left to follow""the dancer spins, camera circles around them""the waves crash, camera slowly zooms in on the foam"
Use Speed Modifiers
Words like “slowly,” “gently,” “rapidly,” “fast,” and “dynamic” shift motion intensity:
- Slow prompts → subtle, controlled movement
- Fast/rapid prompts → more aggressive frame-to-frame change
Test Early, Scale Late
Always start with 33 frames at 480p to validate your motion prompt before jumping to higher resolutions or longer sequences. A failed test at 720p wastes far more time than a quick validation at lower resolution.
⚠️ Important: This single practice—testing at low resolution first—saves hours of wasted renders.
Troubleshooting Common Issues
Out of Memory with the 14B Model
Solution: Enable sequential_cpu_offload in WanVideoModelLoader. Renders take 8–15 minutes for 33 frames, but the model becomes usable on 14–16GB VRAM.
Video Looks Too Static
Cause: Motion prompt is too generic or contradictory.
Fix:
- Rewrite with specific direction and speed:
"rapidly spinning, dynamic camera rotation"instead of"moving". - Check that
num_framesis at least 33. With 16 frames, motion is barely visible. - Bump
cfg_scaleto 9–10 to strengthen prompt adherence.
Flickering or Color Banding
Cause: Model struggling with the prompt or resolution.
Fix:
- Reduce
num_framesto 25–30. - Lower
cfg_scaleto 7. - Simplify the motion prompt.
Slow Renders on the 14B Model
Expected behavior with sequential_cpu_offload enabled. If you need speed, either:
- Upgrade to 24GB+ VRAM
- Use the 1.3B model
- Drop
stepsto 20 (quality loss is minimal)
GPU Strategy by VRAM
- 12GB VRAM: Use the 1.3B model without offload. Expect 2–4 minutes per 33-frame video.
- 16–20GB VRAM: Use the 14B model with
sequential_cpu_offloadenabled. Roughly 8–12 minutes per video; best quality/speed balance for your hardware. - 24GB+ VRAM: Use the 14B model without offload. Roughly 3–6 minutes per video; maximum speed and quality.
Don’t have the VRAM locally? Cloud GPU rentals (Vast.ai, RunPod) are realistic here — the 14B model is one of the most VRAM-intensive workloads covered on this site, so renting a 24GB card for a single test session is a cheap way to try it before committing to hardware. Check current hourly rates on either platform.
FAQ
Q: What’s the difference between Wan 2.1 and Wan 2.2 for image-to-video?
A: Wan 2.2 improves temporal coherence (less flicker between frames) and motion-prompt understanding. The nodes stay the same (WanVideoModelLoader, etc.) but the model weights differ. Already running Wan 2.1? You can keep using it; 2.2 is an incremental upgrade, not a complete architecture overhaul.
Q: How many frames should I generate to start?
A: Always start with 33 frames at 480p (854x480px). This validates your motion prompt and overall behavior in under 5 minutes. Only scale up to 49+ frames and higher resolution once the motion is working correctly. Change one parameter at a time.
Q: Can I animate images generated with Flux or SDXL?
A: Yes. Wan 2.2 accepts any image as input regardless of how it was generated. The input image defines the first frame; the motion prompt describes how it should move. Images with clear composition and a simple background perform better.
Q: Why is the generated video frozen or barely moving?
A: The motion prompt is probably too generic or contradictory. Be specific: instead of ‘moving’, write ‘the person raises their right hand slowly, camera stays fixed’. Also verify num_frames is at least 33—with 16 frames, motion is barely perceptible.
Keep Reading
Don’t have 24GB of local VRAM for the 14B model? See our RunPod vs Vast.ai cloud GPU guide for renting one by the hour. And if you’re running the 1.3B model on limited VRAM, our complete guide to reducing VRAM usage covers offloading techniques that apply to video workflows too.
🏆 Our recommendation
If you have 12GB VRAM or less, use the 1.3B model for fast iteration and accessible performance. If you have 16–20GB VRAM, use the 14B model with sequential_cpu_offload enabled for the best quality-to-speed balance. If you have 24GB+ VRAM, use the 14B model without offload to maximize quality and minimize render time. Start every workflow at 480p and 33 frames, validate your motion prompt, then scale up once you’re confident in the result.
Next steps in ComfyUI
Getting started
FAQ
- What's the difference between Wan 2.1 and Wan 2.2 for image-to-video?
- Wan 2.2 improves temporal coherence (less flicker between frames) and motion-prompt understanding. The nodes are the same (WanVideoModelLoader, etc.) but the model weights differ. If you already have Wan 2.1 installed, you can keep using it; 2.2 is an incremental upgrade, not a radical architecture change.
- How many frames should I generate to start?
- Always start with 33 frames at 480p (854x480px). This validates your motion prompt and overall behavior in under 5 minutes. Only scale up to 49+ frames and higher resolution once the motion is working correctly. Change one parameter at a time.
- Can I animate images generated with Flux or SDXL?
- Yes. Wan 2.2 accepts any image as input regardless of how it was generated. The input image defines the first frame; the motion prompt describes how it should move. Images with clear composition and a simple background give better results.
- Why is the generated video frozen or barely moving?
- The motion prompt is probably too generic or contradictory. Be specific: instead of 'moving', write 'the person raises their right hand slowly, camera stays fixed'. Also verify num_frames is at least 33 -- with 16 frames, motion is barely perceptible.