WEYL
← Back to blog

Introducing NVFP4: 4x Faster Inference at Half the Cost

Weyl Engineering Team
performance infrastructure blackwell

Introducing NVFP4: 4x Faster Inference at Half the Cost

Today, we’re excited to share how Weyl achieves industry-leading inference performance by leveraging NVIDIA’s latest Blackwell architecture and FP4 (4-bit floating point) precision.

The Challenge with Traditional Inference

Most inference platforms run diffusion models at FP16 or FP32 precision. While this maintains maximum quality, it comes with significant trade-offs:

Enter FP4 Quantization

NVIDIA’s Blackwell GB200 introduces hardware-accelerated FP4 tensor cores. By quantizing model weights to 4 bits, we achieve:

Real-World Performance

On SDXL inference workloads, we’re seeing remarkable improvements:

MetricFP16 BaselineWeyl FP4Improvement
P50 Latency180ms47ms3.8x faster
P99 Latency320ms92ms3.5x faster
Throughput12 img/s51 img/s4.25x higher
Cost per 1K images$2.40$0.5876% reduction

Quality Preservation

The critical question: Does FP4 hurt quality?

Through extensive A/B testing with 50,000+ generations, we found:

The key is our quantization-aware training pipeline and careful calibration per model.

Architecture Deep Dive

Our FP4 inference stack consists of three layers:

1. Model Quantization

We use a hybrid approach:

2. Runtime Optimization

Custom CUDA kernels fused with TensorRT-LLM:

# Simplified example of our fused kernel
@cuda.jit
def fused_attention_fp4(Q, K, V, out):
# Dequantize FP4 → FP16 on-the-fly
# Compute attention in FP16
# Write results without intermediate storage
pass

3. Memory Management

Zero-copy architecture:

Developer Experience

Using FP4 inference is transparent:

Terminal window
curl -X POST https://api.weyl.ai/v1/generate \
-H "Authorization: Bearer $WEYL_API_KEY" \
-d '{
"prompt": "a hypermodern datacenter at sunset",
"model": "sdxl-fp4",
"steps": 25
}'

Same API. Same quality. 4x faster.

What’s Next

We’re actively working on:

Try It Today

All Weyl users automatically benefit from FP4 optimization. No code changes required.

Get started: https://docs.weyl.ai/getting-started


Questions? Reach out on Twitter or GitHub.