Weyl Standard Python
Production Python for GPU inference and ML orchestration, emphasizing type safety, structured logging, and disambiguation over brevity.
// weyl standard // production python
The Gap
Production Python lives between “just use numpy” and “C++ and cigarettes.” The GPU does the work; Python orchestrates it correctly.
No notebooks. No global variables. Type hints, structured logging, proper error boundaries, reproducible seeds. We’re not exploring ideas—we’re deploying inference at scale.
Core: Optimize for Disambiguation
Agents write code in seconds. Humans debug it at 3am. Every ambiguity compounds.
# costs 0.1s to write, 10min to debugdef process(x): return model(x) if x.shape[0] > 0 else None
# costs 0.2s to write, saves hoursdef process_inference_batch( input_batch: torch.Tensor, model: InferenceEngine, device: torch.device,) -> InferenceBatchResult: if input_batch.shape[0] == 0: return InferenceBatchResult.empty() return model.forward(input_batch, device=device)Python 3.12+
Exception groups, TypeVarTuple, Self type, pattern matching, better errors. If you’re on 3.10, you’re missing table stakes.
Style: Weyl Standard
- 2-space indent — matches C++, fits more on screen
- Double quotes —
"string"always exfor exceptions —except Exception as ex:- Lowercase types —
list[str],dict[str, int] - Union as pipe —
str | NonenotOptional[str] - f-strings only — never
%or.format()
Naming: Three-Character Rule
If it’s ≤3 chars, it’s probably wrong for production.
# BADcfg = load_cfg()res = proc(req)
# GOODconfiguration = load_model_configuration()result = process_inference_request(request)Exceptions (local scope only): idx/jdx, lhs/rhs, key/value, row/col
Type Hints: Non-Negotiable
Every function. Use ty in CI.
def load_inference_model( checkpoint_path: Path, device: torch.device, dtype: torch.dtype = torch.float16,) -> nn.Module: """Load model for inference.
Raises: FileNotFoundError: Checkpoint missing RuntimeError: Architecture mismatch """ if not checkpoint_path.exists(): raise FileNotFoundError(f"checkpoint not found: {checkpoint_path}")
model = torch.load(checkpoint_path, map_location="cpu") return model.to(device=device, dtype=dtype)Type Aliases
from typing import NewType
UserId = NewType("UserId", int)ModelId = NewType("ModelId", str)
# Tensor shapesBatchTensor = torch.Tensor # [batch, ...]ImageTensor = torch.Tensor # [B, C, H, W]SequenceTensor = torch.Tensor # [B, S, D]Configuration: Parse Once, Validate Completely
from pydantic import BaseModel, Field, field_validator
class InferenceServerConfig(BaseModel): model_checkpoint: Path device_id: int = Field(ge=0, le=7) batch_size: int = Field(ge=1, le=1024) quantization_bits: int = Field(default=16)
@field_validator("model_checkpoint") @classmethod def checkpoint_must_exist(cls, path: Path) -> Path: if not path.exists(): raise ValueError(f"checkpoint not found: {path}") return path
@field_validator("quantization_bits") @classmethod def validate_quantization(cls, bits: int) -> int: if bits not in {4, 8, 16}: raise ValueError(f"unsupported: {bits} bits (must be 4, 8, 16)") return bitsControl Flow: Flat Over Nested
Early returns. Guard clauses. No pyramids.
def process_training_batch( batch: DataBatch, model: nn.Module, optimizer: torch.optim.Optimizer,) -> TrainingMetrics: if not batch.validate(): return TrainingMetrics.empty()
if not model.training: raise RuntimeError("model must be in training mode")
optimizer.zero_grad() output = model(batch.input_tensor)
if output is None: return TrainingMetrics.empty()
loss = compute_loss(output, batch.target_tensor)
if torch.isnan(loss): raise RuntimeError(f"nan loss: {loss}")
loss.backward() optimizer.step()
return TrainingMetrics(loss=loss.item())Error Handling: Result Types
from dataclasses import dataclassfrom typing import Generic, TypeVar
T = TypeVar("T")E = TypeVar("E")
@dataclass(frozen=True)class Ok(Generic[T]): value: T
@dataclass(frozen=True)class Err(Generic[E]): error: E
Result = Ok[T] | Err[E]
def load_checkpoint(path: Path) -> Result[nn.Module, str]: if not path.exists(): return Err(f"not found: {path}") try: return Ok(torch.load(path)) except Exception as ex: return Err(f"load failed: {ex}")
# Pattern match itmatch load_checkpoint(path): case Ok(model): run_inference(model) case Err(error): log.error("checkpoint_failed", error=error)ML/GPU: ATen Operations
Skip the Python overhead. Hit the metal.
import torchfrom torch import Tensor
def fused_gelu_forward(input_tensor: Tensor) -> Tensor: """GELU via aten ops. No Python dispatch overhead.""" return torch.ops.aten.gelu(input_tensor)
def quantize_symmetric_int8( tensor: Tensor, scale: Tensor,) -> Tensor: """Symmetric INT8 quantization via aten.""" scaled = torch.ops.aten.div(tensor, scale) rounded = torch.ops.aten.round(scaled) return torch.ops.aten.clamp(rounded, -128, 127).to(torch.int8)
def dequantize_symmetric_int8( quantized: Tensor, scale: Tensor,) -> Tensor: """Dequantize INT8 back to float.""" return torch.ops.aten.mul(quantized.float(), scale)Real NVFP4 Quantization
NVFP4 on Blackwell: 4-bit floating point with shared exponent per block. E2M1 format—2 exponent bits, 1 mantissa bit, dynamic range over fixed precision.
import torchfrom torch import Tensor
# NVFP4 E2M1 lookup table (16 values)NVFP4_E2M1_TABLE = torch.tensor([ 0.0, 0.5, 1.0, 1.5, # exp=0 2.0, 3.0, 4.0, 6.0, # exp=1 8.0, 12.0, 16.0, 24.0, # exp=2 32.0, 48.0, 64.0, 96.0, # exp=3 (includes inf handling)], dtype=torch.float32)
def quantize_nvfp4_block( tensor: Tensor, block_size: int = 32,) -> tuple[Tensor, Tensor]: """Quantize to NVFP4 with per-block scaling.
Args: tensor: Input tensor, any shape (last dim divisible by block_size) block_size: Elements sharing one scale factor (32 for Blackwell)
Returns: (packed_uint8, scales) - packed pairs, one scale per block """ original_shape = tensor.shape assert tensor.shape[-1] % block_size == 0
# Reshape to [num_blocks, block_size] flat = tensor.view(-1, block_size) num_blocks = flat.shape[0]
# Per-block absmax scaling absmax = torch.ops.aten.abs(flat).amax(dim=-1, keepdim=True) scales = absmax / 6.0 # map to NVFP4 range scales = torch.where(scales == 0, torch.ones_like(scales), scales)
# Scale and find nearest NVFP4 value scaled = flat / scales signs = torch.sign(scaled) abs_scaled = torch.abs(scaled)
# Quantize: find nearest value in lookup table distances = (abs_scaled.unsqueeze(-1) - NVFP4_E2M1_TABLE.to(tensor.device)).abs() indices = distances.argmin(dim=-1).to(torch.uint8)
# Pack sign into high bit (indices 0-15 become 0-15 or 16-31) signed_indices = torch.where(signs < 0, indices + 16, indices).to(torch.uint8)
# Pack two 5-bit values into bytes (with some waste, real impl uses 4-bit) even = signed_indices[..., 0::2] odd = signed_indices[..., 1::2] packed = (even << 4) | (odd & 0x0F)
return packed.view(*original_shape[:-1], -1), scales.view(-1)
def dequantize_nvfp4_block( packed: Tensor, scales: Tensor, block_size: int = 32,) -> Tensor: """Dequantize NVFP4 back to float.""" # Unpack even = (packed >> 4) & 0x1F odd = packed & 0x0F
# Reconstruct indices and signs even_sign = torch.where(even >= 16, -1.0, 1.0) odd_sign = torch.where(odd >= 16, -1.0, 1.0) even_idx = even % 16 odd_idx = odd % 16
# Lookup and apply signs table = NVFP4_E2M1_TABLE.to(packed.device) even_vals = table[even_idx.long()] * even_sign odd_vals = table[odd_idx.long()] * odd_sign
# Interleave unpacked = torch.stack([even_vals, odd_vals], dim=-1).flatten(-2)
# Apply scales scales_expanded = scales.view(-1, 1).expand(-1, block_size).flatten() return unpacked * scales_expandedCUDA Stream Management
from contextlib import contextmanagerfrom typing import Iterator
@contextmanagerdef cuda_stream_context(device: torch.device) -> Iterator[torch.cuda.Stream]: """CUDA stream with automatic sync on exit.""" stream = torch.cuda.Stream(device=device) try: with torch.cuda.stream(stream): yield stream finally: stream.synchronize()Batch Processing with OOM Prevention
def process_inference_batches( batches: list[Tensor], model: nn.Module, device: torch.device, max_batch: int = 32,) -> list[Tensor]: """Process batches, split large ones, clear cache periodically.""" results: list[Tensor] = [] model.eval()
with torch.inference_mode(): for idx, batch in enumerate(batches): splits = torch.split(batch, max_batch) if batch.shape[0] > max_batch else [batch]
for split in splits: output = model(split.to(device)) results.append(output.cpu())
if idx % 10 == 0: torch.cuda.empty_cache()
return resultsHypothesis: Front-Line Correctness
Property-based testing catches edge cases you’d never write by hand.
import pytestfrom hypothesis import given, strategies as st, settingsfrom hypothesis.extra.numpy import arraysimport numpy as np
@given( batch_size=st.integers(1, 128), seq_len=st.integers(1, 512), hidden=st.sampled_from([256, 512, 768, 1024]),)def test_quantize_dequantize_roundtrip( batch_size: int, seq_len: int, hidden: int,) -> None: """Quantization roundtrip error bounded.""" # Generate random tensor tensor = torch.randn(batch_size, seq_len, hidden)
# Roundtrip packed, scales = quantize_nvfp4_block(tensor) recovered = dequantize_nvfp4_block(packed, scales)
# NVFP4 is lossy but bounded relative_error = (tensor - recovered).abs() / (tensor.abs() + 1e-8) assert relative_error.mean() < 0.15 # ~15% mean error typical for 4-bit
@given( shape=st.tuples( st.integers(1, 64), st.integers(32, 256).filter(lambda x: x % 32 == 0), ),)def test_nvfp4_preserves_zeros(shape: tuple[int, int]) -> None: """Zero tensor quantizes to zero.""" tensor = torch.zeros(shape) packed, scales = quantize_nvfp4_block(tensor) recovered = dequantize_nvfp4_block(packed, scales) assert torch.allclose(recovered, tensor, atol=1e-6)
@given( arrays( dtype=np.float32, shape=st.tuples(st.integers(1, 32), st.just(64)), elements=st.floats(-100, 100, allow_nan=False, allow_infinity=False), ))def test_quantization_never_nan(arr: np.ndarray) -> None: """Quantization never produces NaN.""" tensor = torch.from_numpy(arr) packed, scales = quantize_nvfp4_block(tensor) recovered = dequantize_nvfp4_block(packed, scales) assert not torch.isnan(recovered).any() assert not torch.isinf(recovered).any()
@settings(max_examples=500)@given( scale=st.floats(1e-6, 1e6, allow_nan=False), offset=st.floats(-1e3, 1e3, allow_nan=False),)def test_symmetric_int8_invertible(scale: float, offset: float) -> None: """INT8 quantization is invertible within precision.""" tensor = torch.randn(32, 64) * scale + offset scale_tensor = tensor.abs().max() / 127.0
quantized = quantize_symmetric_int8(tensor, scale_tensor) recovered = dequantize_symmetric_int8(quantized, scale_tensor)
# INT8 precision: max error is 0.5 * scale max_error = 0.5 * scale_tensor assert (tensor - recovered).abs().max() <= max_error + 1e-6Stateful Testing for Models
from hypothesis.stateful import RuleBasedStateMachine, rule, invariant
class InferenceServerStateMachine(RuleBasedStateMachine): """Test inference server state transitions."""
def __init__(self) -> None: super().__init__() self.loaded = False self.request_count = 0
@rule() def load_model(self) -> None: self.loaded = True
@rule() def unload_model(self) -> None: self.loaded = False self.request_count = 0
@rule(batch_size=st.integers(1, 64)) def process_request(self, batch_size: int) -> None: if self.loaded: self.request_count += 1
@invariant() def requests_only_when_loaded(self) -> None: if not self.loaded: # Would check actual server state here pass
TestInferenceServer = InferenceServerStateMachine.TestCaseAsync APIs
from fastapi import FastAPI, HTTPExceptionfrom pydantic import BaseModelimport asyncio
app = FastAPI()
class InferenceRequest(BaseModel): prompt: str max_tokens: int = 512 temperature: float = 0.7
class InferenceResponse(BaseModel): text: str tokens: int latency_ms: float
inference_model: nn.Module | None = None
@app.on_event("startup")async def load_model() -> None: global inference_model inference_model = await asyncio.to_thread( torch.load, Path("/models/checkpoint.pt"), map_location="cuda:0", )
@app.post("/v1/inference")async def run_inference(request: InferenceRequest) -> InferenceResponse: if inference_model is None: raise HTTPException(503, "model not loaded")
start = time.perf_counter()
text = await asyncio.to_thread( generate_text, inference_model, request.prompt, request.max_tokens, )
return InferenceResponse( text=text, tokens=len(text.split()), latency_ms=(time.perf_counter() - start) * 1000, )Structured Logging
import structlog
def configure_logging() -> None: structlog.configure( processors=[ structlog.stdlib.add_log_level, structlog.processors.TimeStamper(fmt="iso"), structlog.processors.JSONRenderer(), ], logger_factory=structlog.stdlib.LoggerFactory(), cache_logger_on_first_use=True, )
log = structlog.get_logger()
# Usagelog.info("batch_processed", batch_idx=42, loss=0.023, tokens_per_sec=15420)Performance: Measure
from contextlib import contextmanager
@contextmanagerdef timer(name: str) -> Iterator[None]: start = time.perf_counter() try: yield finally: log.info("timed", op=name, ms=(time.perf_counter() - start) * 1000)
# CUDA profilingdef profile_step(model: nn.Module, batch: Tensor) -> None: from torch.profiler import profile, ProfilerActivity
with profile( activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True, ) as prof: output = model(batch) output.mean().backward()
prof.export_chrome_trace("profile.json")Package Management: uv
[project]name = "weyl-inference"version = "0.1.0"requires-python = ">=3.12"dependencies = [ "torch>=2.1.0", "pydantic>=2.0.0", "structlog>=23.0.0", "fastapi>=0.100.0",]
[project.optional-dependencies]dev = ["pytest>=7.4.0", "hypothesis>=6.82.0", "mypy>=1.5.0", "ruff>=0.0.285"]
[tool.ruff]line-length = 88indent-width = 2select = ["E", "F", "I", "N", "UP", "B", "C4", "PT", "Q"]
[tool.ruff.format]quote-style = "double"
[tool.mypy]strict = trueThe Vibe Test
- Debug it during an incident without a REPL?
- Agent extends it without breaking types?
- Types prevent tomorrow’s OOM?
- Every abbreviation worth the confusion?
- Handles CUDA errors gracefully?
- Makes sense after 10 contributors touch it?
Summary
- Disambiguate — ambiguity compounds
- Type everything — runtime → type errors
- Parse config once — config errors multiply
- Keep it flat — nesting is debt
- Result types — explicit failure paths
- Hypothesis — property-based correctness
- ATen ops — skip Python dispatch
- Measure — data-driven optimization
Write code like a hundred agents will extend it tomorrow and you’ll debug it during an incident next month.
the list
papers that changed how i think
FlashAttention (Dao et al.) — Not for the algorithm. For the lesson: memory bandwidth is the bottleneck, compute is free. Everything since is a footnote. https://arxiv.org/abs/2205.14135
FP8 Formats for Deep Learning (Micikevicius et al., NVIDIA) — The actual spec for how reduced precision works. E4M3 vs E5M2 tradeoffs. Required reading before touching quantization. https://arxiv.org/abs/2209.05433
LLM.int8() (Dettmers et al.) — Emergent features break naive quantization. The outlier problem. Why you can’t just round everything. https://arxiv.org/abs/2208.07339
GPTQ (Frantar et al.) — One-shot weight quantization that actually works. The Hessian trick. https://arxiv.org/abs/2210.17323
AWQ (Lin et al.) — Activation-aware quantization. Protecting salient weights. Cleaner than GPTQ for deployment. https://arxiv.org/abs/2306.00978
QMoE (Frantar et al.) — Trillion parameter models in memory. Extreme quantization for MoE. The future. https://arxiv.org/abs/2310.16795
cuda / gpu architecture
CUDA C++ Programming Guide — Not a suggestion. The actual manual. Read the memory hierarchy section until you dream about L2 cache. https://docs.nvidia.com/cuda/cuda-c-programming-guide/
Parallel Thread Execution ISA — When you need to know what mma.sync actually does.
https://docs.nvidia.com/cuda/parallel-thread-execution/
CUTLASS — Not documentation, the source code. cute/ directory specifically. This is how NVIDIA thinks about tensor cores.
https://github.com/NVIDIA/cutlass
Scott Gray’s GPU writings — The OpenAI guy who wrote the fast kernels. Scattered but invaluable.
Hopper Architecture Whitepaper — TMA, warp specialization, cluster-level execution. The mental model for H100. https://resources.nvidia.com/en-us-tensor-core
Blackwell Architecture Whitepaper — When it drops. FP4, the new memory hierarchy, whatever they’re hiding.
systems programming
“What Every Programmer Should Know About Memory” (Drepper) — Old. Still true. Cache lines, NUMA, TLB. The physics of computing. https://people.freebsd.org/~lstewart/articles/cpumemory.pdf
“Mechanical Sympathy” (Martin Thompson blog) — Java guy but the principles transfer. Know your hardware. https://mechanical-sympathy.blogspot.com/
“Gallery of Processor Cache Effects” (Igoro) — Visual intuition for cache behavior. https://igoro.com/archive/gallery-of-processor-cache-effects/
python that doesn’t suck
“Fluent Python” (Ramalho) — 2nd edition. The actual language, not the tutorial version.
“High Performance Python” (Gorelick & Ozsvald) — Profiling, Cython, the escape hatches.
“Architecture Patterns with Python” (Percival & Gregory) — Domain-driven design, repository pattern. How to structure code that lasts.
videos worth the time
“CUDA MODE” lecture series — Mark Saroufim and friends. Applied GPU programming for ML people. https://www.youtube.com/@CUDAMODE
“From Python to CUDA” (Jeremy Howard, fast.ai) — The bridge most people need.
“How GPU Computing Works” (GTC talks by Stephen Jones) — NVIDIA architect explaining the actual execution model.
Andrej Karpathy’s “Let’s build GPT” — Not for the transformer. For the style. How to think about implementations.
reference implementations to read
llama.cpp (ggerganov) — C++ inference done right. The quantization formats, the threading model, the simplicity. https://github.com/ggerganov/llama.cpp
vLLM — PagedAttention, continuous batching. Production serving architecture. https://github.com/vllm-project/vllm
TensorRT-LLM — NVIDIA’s answer. Over-engineered but shows what’s possible. https://github.com/NVIDIA/TensorRT-LLM
SGLang — RadixAttention, prefix caching. The new ideas. https://github.com/sgl-project/sglang
Triton tutorials — The language, but more importantly, the optimization patterns in the examples. https://triton-lang.org/main/getting-started/tutorials/
math you actually need
“Linear Algebra Done Right” (Axler) — No determinants until the end. The right way to think about vector spaces.
“Numerical Linear Algebra” (Trefethen & Bau) — SVD, condition numbers, stability. Why your gradients explode.
3Blue1Brown’s linear algebra series — Visual intuition. Even if you know it, recalibrate.
the vibe
“Zen and the Art of Motorcycle Maintenance” (Pirsig) — Quality. The thing that can’t be defined but you know when it’s missing.
“The Mythical Man-Month” (Brooks) — Still true. Conceptual integrity. The surgical team.
Gwern’s writings — The rigor. How to actually think about ML experiments. https://gwern.net/
Karpathy’s blog — “The Unreasonable Effectiveness of RNNs”, “A Recipe for Training Neural Networks”. The craft. https://karpathy.github.io/
skip: most ML courses, most python tutorials, anything that starts with “in this video we’ll learn”, any book published by a FAANG employee about their FAANG job, anything with “10x” in the title.
the gap between reading and doing is infinite. but reading the right things tells you what to try when you’re stuck at 3am.