Introducing NVFP4: 4x Faster Inference at Half the Cost
Introducing NVFP4: 4x Faster Inference at Half the Cost
Today, we’re excited to share how Weyl achieves industry-leading inference performance by leveraging NVIDIA’s latest Blackwell architecture and FP4 (4-bit floating point) precision.
The Challenge with Traditional Inference
Most inference platforms run diffusion models at FP16 or FP32 precision. While this maintains maximum quality, it comes with significant trade-offs:
- Memory bandwidth bottlenecks: Moving 16 or 32 bits per parameter limits throughput
- Higher power consumption: More bits mean more energy per operation
- Increased costs: Larger memory footprints require more expensive hardware
Enter FP4 Quantization
NVIDIA’s Blackwell GB200 introduces hardware-accelerated FP4 tensor cores. By quantizing model weights to 4 bits, we achieve:
- 4x memory bandwidth improvement: Same bandwidth moves 4x more weights
- 2x energy efficiency: Fewer bits per operation reduces power draw
- Minimal quality loss: Careful quantization preserves 98%+ output fidelity
Real-World Performance
On SDXL inference workloads, we’re seeing remarkable improvements:
| Metric | FP16 Baseline | Weyl FP4 | Improvement |
|---|---|---|---|
| P50 Latency | 180ms | 47ms | 3.8x faster |
| P99 Latency | 320ms | 92ms | 3.5x faster |
| Throughput | 12 img/s | 51 img/s | 4.25x higher |
| Cost per 1K images | $2.40 | $0.58 | 76% reduction |
Quality Preservation
The critical question: Does FP4 hurt quality?
Through extensive A/B testing with 50,000+ generations, we found:
- Visual similarity score: 0.982 (1.0 = identical)
- Perceptual hash distance: Less than 5% degradation
- Human preference: No statistically significant difference in blind tests
The key is our quantization-aware training pipeline and careful calibration per model.
Architecture Deep Dive
Our FP4 inference stack consists of three layers:
1. Model Quantization
We use a hybrid approach:
- Diffusion UNet weights → FP4
- Text encoder → FP8 (more sensitive to precision)
- VAE decoder → FP16 (final quality-critical step)
2. Runtime Optimization
Custom CUDA kernels fused with TensorRT-LLM:
# Simplified example of our fused kernel@cuda.jitdef fused_attention_fp4(Q, K, V, out): # Dequantize FP4 → FP16 on-the-fly # Compute attention in FP16 # Write results without intermediate storage pass3. Memory Management
Zero-copy architecture:
- Weights stored in FP4 in GPU memory
- Dequantization happens in tensor cores during compute
- No separate FP16 copies needed
Developer Experience
Using FP4 inference is transparent:
curl -X POST https://api.weyl.ai/v1/generate \ -H "Authorization: Bearer $WEYL_API_KEY" \ -d '{ "prompt": "a hypermodern datacenter at sunset", "model": "sdxl-fp4", "steps": 25 }'Same API. Same quality. 4x faster.
What’s Next
We’re actively working on:
- Video models: Extending FP4 to WAN and other video diffusion models
- Multi-modal: FP4 optimization for vision-language models
- Mixed precision: Automatic per-layer precision selection
Try It Today
All Weyl users automatically benefit from FP4 optimization. No code changes required.
Get started: https://docs.weyl.ai/getting-started