ComfyLab
How to Reduce VRAM Usage in ComfyUI (Low VRAM / 6GB-8GB Guide)

How to Reduce VRAM Usage in ComfyUI (Low VRAM / 6GB-8GB Guide)

4GB VRAM VRAM Beginner 11 min Any model
Savien

Running ComfyUI on a budget GPU is entirely possible—but only if you know which optimizations actually work. Most users discover this the hard way: they launch ComfyUI with an unoptimized SDXL model on their 6GB card and watch it crash within seconds — if that’s you, start with our CUDA out of memory fix guide for the immediate rescue, then come back here for the full picture. Modern diffusion models are memory-hungry beasts. SDXL alone demands 7–8GB, Flux Dev reaches 24GB, and loading a VAE decoder on top of that can trigger out-of-memory errors even on paper-sufficient hardware.

The good news: there are eight concrete techniques to reduce VRAM usage in ComfyUI, ranging from trivial (one-line flag) to advanced (model quantization). Each trades speed for memory in different ways. This guide walks through every option, shows real VRAM savings with numbers, and gives you a clear roadmap based on your specific GPU tier.

At a Glance: VRAM Reduction Techniques

TechniqueVRAM SavedSpeed ImpactDifficultyBest For
--lowvram flag40–50%+20–50% slowerTrivial6–8GB GPUs, default choice
GGUF quantization50–73%+10–20% slowerEasyLarge models (Flux, Wan)
VAE tiling40–60% (decode only)+5–10% slowerEasy1024×1024+ resolutions
--fp16-vae1–1.5GBNegligibleTrivialQuick extra savings
Resolution reductionUp to 75%NoneTrivialInvisible optimization
--novram flag80%++500–1000% slowerTrivial4GB GPUs, last resort
Attention slicing15–25%+30–50% slowerTrivialExtreme constraints
Sequential CPU offloadEnables 14B models+20–40% slowerModerateVideo models (Wan, Hunyuan)

Understanding ComfyUI’s Memory Footprint

ComfyUI loads three major components into VRAM simultaneously: the diffusion model (the core network), the VAE (encoder/decoder for image conversion), and any control models like ControlNets or IP-Adapters. They don’t unload between operations—they sit in memory waiting for the next generation.

Here’s what unoptimized models consume:

  • SDXL: 6.9–7GB (full precision, 1024×1024)
  • Flux Dev: 24GB (full precision, 1024×1024)
  • SD 1.5: 4–4.5GB (full precision, 512×512)
  • Wan (14B video model): 30GB+ (full precision)

Memory usage scales with three factors: model architecture size, image resolution (quadratically), and number of generation steps. A 1024×1024 image uses roughly 4× the VRAM of a 512×512 image. Add LoRAs, ControlNets, or IP-Adapters and you’re stacking additional overhead on top.

💡 Tip: The VAE decode step is the overlooked culprit. During decoding, VRAM usage can spike 40–60% above the model’s baseline, especially at high resolutions. Many users think their GPU is too weak when really the bottleneck is the VAE decoder, not the diffusion model itself.

👉 Quick takeaway: ComfyUI keeps models in VRAM between generations, and the VAE decoder is often the real memory bottleneck—not the diffusion model itself.


Technique 1: The —lowvram Flag (Best Speed/Memory Balance)

The --lowvram flag is the single easiest win for users with 6–8GB GPUs. It offloads model components to system RAM during generation and retrieves them only when actively needed. Think of it as VRAM overflow to your CPU’s address space.

Real-world impact: An SDXL model that normally consumes 7GB drops to 3.5–4.5GB. Generation time increases by 20–50% depending on your CPU and system RAM speed.

How to enable:

On Windows, edit run_nvidia_gpu.bat and change:

python main.py

to:

python main.py --lowvram

On Linux/Mac, run:

python main.py --lowvram

Using ComfyUI Desktop? Navigate to Settings > Performance > VRAM Options and check the “Reduce VRAM” box—no command-line editing required.

When to use: This is the default choice for anyone with 6GB or 8GB cards running SDXL or smaller models. The speed penalty is acceptable for most workflows.

👉 Quick takeaway: --lowvram is the easiest optimization for 6–8GB GPUs, cutting VRAM usage by 40–50% with a manageable 20–50% speed increase.


Technique 2: The —novram Flag (Extreme Last Resort)

If --lowvram isn’t enough and you’re stuck with 4GB VRAM, --novram moves the entire model to system RAM except during active computation. VRAM usage drops to 1–2GB.

The trade-off is brutal: generation becomes 5–10× slower. A 512×512 image can take 10–30 minutes. Only viable if you have abundant system RAM (16GB+) and patience.

Enable with:

python main.py --novram

When to use: Only if GGUF quantization (see below) isn’t available for your model, or as a temporary workaround while you source a quantized version.


Technique 3: GGUF Quantization (The Game-Changer for Large Models)

GGUF quantization is the only practical way to run Flux Dev or Wan on 4–8GB GPUs. It converts full-precision models to compressed versions with minimal quality loss. The compression is aggressive but the visual impact is surprisingly small.

Real VRAM savings:

ModelOriginalQ4_K_MReduction
Flux Dev24GB7GB71%
Juggernaut XL7GB4GB43%
SDXL6.9GB2.5GB64%
Wan (14B)30GB8–10GB67–73%

How to set up:

  1. Install the ComfyUI-GGUF custom node. Clone it into your custom_nodes folder:

    cd custom_nodes
    git clone https://github.com/comfyui-gguf/ComfyUI-GGUF
  2. Restart ComfyUI.

  3. Download a GGUF/GGML quantized model from HuggingFace. Search for “Flux GGUF” or “SDXL GGUF” to find pre-quantized versions.

  4. In your workflow, replace the standard CheckpointLoader node with GGUFModelLoader and point it to your GGUF file.

Quantization levels explained:

  • Q8_0: 95% of original size, identical quality, normal speed. Use for critical production work.
  • Q5_K_M: 60% of original size, practically identical quality. Best overall balance.
  • Q4_K_M: 50% of original size, slight detail loss in fine textures. Recommended for 4–8GB GPUs.
  • Q3: 35% of original size, noticeable quality loss. Only for extreme memory constraints.

📌 Keep in mind: Q4_K_M is the sweet spot for most users: it saves enough VRAM to make large models feasible while keeping quality degradation imperceptible in most use cases.

👉 Quick takeaway: GGUF quantization cuts model size by 50–73%, making Flux and other large models viable on 4–8GB GPUs with minimal quality loss at Q4_K_M or Q5_K_M settings.


Technique 4: VAE Tiling (Mandatory for 1024×1024+)

The VAE decoder is often the real VRAM bottleneck. At high resolutions, the decode step can spike memory usage 40–60% above the model baseline, causing crashes even when other optimizations are active.

VAE tiling splits the image into smaller tiles, decodes each separately, and reassembles them. The quality impact is zero—the tiles are invisible in the final output.

How to enable:

  1. Install the TileVAE node via ComfyUI Manager (search “TileVAE”).

  2. Place the TileVAE node between your VAE Loader and VAE Decode node.

  3. Set tile_size:

    • 8GB GPU: 512
    • 6GB GPU: 384
    • 4GB GPU: 256

Real impact: Reduces peak VRAM during decode by 40–60%. Essentially mandatory for 1024×1024 images on 8GB cards.

Speed penalty: Negligible (5–10% slower decode, not worth worrying about).

👉 Quick takeaway: VAE tiling is non-negotiable for high-resolution generation (1024×1024+) and solves crashes that appear even when the main model fits in VRAM.


Technique 5: VAE Precision (—fp16-vae)

Running the VAE at half precision (FP16) instead of full precision (FP32) saves ~1–1.5GB with virtually no speed impact.

Enable with:

python main.py --fp16-vae

Or in ComfyUI Desktop: Settings > Performance > VRAM Options > fp16 VAE.

Caveat: Some older models don’t tolerate FP16 well and produce black images or artifacts. If that happens, switch back to --fp32-vae. Test with a single generation first.


Technique 6: Attention Slicing (Last Resort)

Attention slicing splits the attention computation into smaller steps, drastically reducing VRAM peaks. You’ll recover 15–25% VRAM but sacrifice 30–50% of your speed.

Enable with:

python main.py --attention-split

⚠️ Important: This is the slowest optimization option. Try --lowvram + VAE tiling + GGUF first. Only use attention slicing if nothing else works.


Technique 7: Resolution Reduction (The Invisible Optimization)

VRAM grows quadratically with resolution. Dropping from 1024×1024 to 768×768 saves ~43% VRAM with minimal visible quality loss.

SDXL VRAM by resolution:

ResolutionVRAM UsedSavings vs 1024×1024
512×5122GB75%
768×7684.5GB43%
1024×10248GBBaseline

For most use cases, 768×768 is indistinguishable from 1024×1024 unless you’re printing large-format. 512×512 is usable but noticeably lower detail.


Technique 8: Sequential CPU Offload (For Video Models)

Video models like Wan and Hunyuan are memory monsters. Sequential CPU offload moves model layers to CPU sequentially during generation, allowing 14B models to run on 8GB VRAM using 14–16GB system RAM.

Enable it in the ModelLoader node settings or add sequential_cpu_offload: true to your config.


If you’re hitting these limits constantly and considering new hardware instead of more optimization, see our best GPU for ComfyUI guide for what’s actually worth buying at each budget.

4GB GPU

  • Mandatory: GGUF Q4_K_M + --lowvram + VAE tiling
  • Max resolution: 512×512–768×768
  • Fallback: --novram if GGUF isn’t enough
  • Model choice: SDXL or smaller; Flux impossible without extreme compromises

6GB GPU

  • Default: --lowvram for medium models + VAE tiling above 768px
  • For large models: GGUF (Flux becomes viable)
  • Comfortable resolution: 768×768, 1024×1024 with optimizations
  • Model choice: SDXL native, Flux with GGUF

8GB GPU

  • Default: --lowvram only for Flux or other very large models
  • VAE tiling: Required for 1024×1024
  • Model choice: Native SDXL, unquantized Flux, GGUF for multi-model experimentation
  • Comfortable resolution: 1024×1024 without compromise

Combining Techniques

--lowvram and VAE tiling are not mutually exclusive—they target different bottlenecks. Use them together:

python main.py --lowvram
# Then add VAE tiling node in your workflow

This combination is the baseline for 6–8GB users. Add GGUF if you need to run Flux or other large models. Add --fp16-vae if you have room to save another 1GB.


FAQ: Common Questions About VRAM Optimization

Q: What’s the difference between —lowvram and —novram?

A: --lowvram offloads parts of the model to CPU RAM during generation, recovering them when needed. Works well with 6–8GB. --novram puts the entire model in CPU RAM; generation can take 10–30 minutes but works even on 4GB. For regular use, --lowvram is the sweet spot.

Q: Does GGUF reduce image quality?

A: Q8_0: quality virtually identical to the original FP16. Q5_K_M: barely noticeable difference. Q4_K_M: slight loss in fine detail, acceptable for most uses. Q3 and Q2: noticeable loss. For production work, Q5_K_M offers the best quality/VRAM balance.

Q: How do I enable Tile VAE in ComfyUI?

A: Install the TileVAE node from ComfyUI Manager, or use the ‘Enable VAE Tiling’ node from the advanced nodes pack. Connect it between the VAE Loader and VAE Decode. Set tile_size to 512 for 8GB or 256 for 4–6GB. You only need it for large images (>1024px).

Q: Can I run 24GB models on an 8GB GPU?

A: With GGUF Q4_K_M, yes. A 24GB full Flux Dev compresses to ~7GB at Q4_K_M. A 30GB Wan 2.2 14B compresses to ~8–10GB at Q4_K_M. Quality is good but not identical to the original. For video with Wan, enabling sequential_cpu_offload in the ModelLoader lets you use the full 14B model in 14–16GB using system RAM.


Keep Reading

GGUF quantization is one of the most effective single changes you can make — see our dedicated GGUF guide for running Flux on 8GB VRAM. If video generation is on your roadmap, our Wan 2.2 image-to-video guide covers sequential offloading for that specific workload. Still not enough VRAM locally? Renting a cloud GPU is a realistic bridge before buying new hardware.


🏆 Our Recommendation

If you have a 6–8GB GPU and want the easiest setup: Start with --lowvram + VAE tiling. This is the baseline that works for most users without touching advanced features.

If you need to run Flux or other 24GB+ models: Add GGUF quantization at Q4_K_M or Q5_K_M. The quality/VRAM trade-off is unbeatable.

If you’re on 4GB: GGUF is non-negotiable. Combine it with --lowvram and VAE tiling. If that still fails, --novram is your fallback, but expect 10–30 minute generation times.

If speed is your priority: Stick with --lowvram and skip GGUF unless you absolutely need Flux. The native model will always be faster than a quantized version.

Test each optimization in isolation first to understand the trade-offs in your specific setup—VRAM behavior varies with CPU speed, system RAM, and model architecture. The ComfyUI community regularly shares optimized workflows and quantized models on Discord and GitHub; leverage those resources rather than starting from scratch.

FAQ

What's the difference between --lowvram and --novram?
--lowvram offloads parts of the model to CPU RAM during generation, recovering them when needed. Works well with 6-8GB. --novram puts the entire model in CPU RAM; generation can take 10-30 minutes but works even on 4GB. For regular use, --lowvram is the sweet spot.
Does GGUF reduce image quality?
Q8_0: quality virtually identical to the original FP16. Q5_K_M: barely noticeable difference. Q4_K_M: slight loss in fine detail, acceptable for most uses. Q3 and Q2: noticeable loss. For production work, Q5_K_M offers the best quality/VRAM balance.
How do I enable Tile VAE in ComfyUI?
Install the TileVAE node from ComfyUI Manager, or use the 'Enable VAE Tiling' node from the advanced nodes pack. Connect it between the VAE Loader and VAE Decode. Set tile_size to 512 for 8GB or 256 for 4-6GB. It's only needed for large images (>1024px).
Can I run 24GB models on an 8GB GPU?
With GGUF Q4_K_M, yes. A 24GB full Flux Dev compresses to ~7GB at Q4_K_M. A 30GB Wan 2.2 14B compresses to ~8-10GB at Q4_K_M. Quality is good but not identical to the original. For video with Wan, enabling sequential_cpu_offload in the ModelLoader lets you use the full 14B model in 14-16GB using system RAM.
Share X LinkedIn

You may also like