Introducing NVFP4: 4x Faster Inference at Half the Cost

Today, we’re excited to share how Weyl achieves industry-leading inference performance by leveraging NVIDIA’s latest Blackwell architecture and FP4 (4-bit floating point) precision.

The Challenge with Traditional Inference

Most inference platforms run diffusion models at FP16 or FP32 precision. While this maintains maximum quality, it comes with significant trade-offs:

Memory bandwidth bottlenecks: Moving 16 or 32 bits per parameter limits throughput
Higher power consumption: More bits mean more energy per operation
Increased costs: Larger memory footprints require more expensive hardware

Enter FP4 Quantization

NVIDIA’s Blackwell GB200 introduces hardware-accelerated FP4 tensor cores. By quantizing model weights to 4 bits, we achieve:

4x memory bandwidth improvement: Same bandwidth moves 4x more weights
2x energy efficiency: Fewer bits per operation reduces power draw
Minimal quality loss: Careful quantization preserves 98%+ output fidelity

Real-World Performance

On SDXL inference workloads, we’re seeing remarkable improvements:

Metric	FP16 Baseline	Weyl FP4	Improvement
P50 Latency	180ms	47ms	3.8x faster
P99 Latency	320ms	92ms	3.5x faster
Throughput	12 img/s	51 img/s	4.25x higher
Cost per 1K images	$2.40	$0.58	76% reduction

Quality Preservation

The critical question: Does FP4 hurt quality?

Through extensive A/B testing with 50,000+ generations, we found:

Visual similarity score: 0.982 (1.0 = identical)
Perceptual hash distance: Less than 5% degradation
Human preference: No statistically significant difference in blind tests

The key is our quantization-aware training pipeline and careful calibration per model.

Architecture Deep Dive

Our FP4 inference stack consists of three layers:

1. Model Quantization

We use a hybrid approach:

Diffusion UNet weights → FP4
Text encoder → FP8 (more sensitive to precision)
VAE decoder → FP16 (final quality-critical step)

2. Runtime Optimization

Custom CUDA kernels fused with TensorRT-LLM:

# Simplified example of our fused kernel
@cuda.jit
def fused_attention_fp4(Q, K, V, out):
    # Dequantize FP4 → FP16 on-the-fly
    # Compute attention in FP16
    # Write results without intermediate storage
    pass

3. Memory Management

Zero-copy architecture:

Weights stored in FP4 in GPU memory
Dequantization happens in tensor cores during compute
No separate FP16 copies needed

Developer Experience

Using FP4 inference is transparent:

curl -X POST https://api.weyl.ai/v1/generate \
  -H "Authorization: Bearer $WEYL_API_KEY" \
  -d '{
    "prompt": "a hypermodern datacenter at sunset",
    "model": "sdxl-fp4",
    "steps": 25
  }'

Same API. Same quality. 4x faster.

What’s Next

We’re actively working on:

Video models: Extending FP4 to WAN and other video diffusion models
Multi-modal: FP4 optimization for vision-language models
Mixed precision: Automatic per-layer precision selection

Try It Today

All Weyl users automatically benefit from FP4 optimization. No code changes required.

Get started: https://docs.weyl.ai/getting-started

Questions? Reach out on Twitter or GitHub.