📰 Dev.to · Tech_Nuggets

8 articles · Updated every 3 hours · View all reads

All Articles 91,066 Blog Posts 109,397 Tech Tutorials 22,773 Research Papers 19,223 News 14,847 ⚡ AI Lessons

KV cache and PagedAttention: what they do and why they matter

Dev.to · Tech_Nuggets 9h ago

KV cache and PagedAttention: what they do and why they matter

An explanation of the KV cache memory problem in production LLM serving and how PagedAttention (the technique behind vLLM) solves it with OS-inspired virtual me

The Model Context Protocol (MCP): what it is and how to build a server

Dev.to · Tech_Nuggets 5d ago

The Model Context Protocol (MCP): what it is and how to build a server

A practical walkthrough of the Model Context Protocol — what it is, how JSON-RPC 2.0 transports work, and how to build an MCP server in Python with FastMCP.

Structured output from LLMs: JSON mode, function calling, and grammar-constrained decoding

Dev.to · Tech_Nuggets 6d ago

Structured output from LLMs: JSON mode, function calling, and grammar-constrained decoding

A practical comparison of three approaches to getting structured data from LLMs: prompt-only JSON, API-level JSON mode and function calling, and grammar-constra

Mixture of Experts (MoE): what it actually does under the hood, and when it pays off

Dev.to · Tech_Nuggets 📐 ML Fundamentals ⚡ AI Lesson 1w ago

Mixture of Experts (MoE): what it actually does under the hood, and when it pays off

MoE explained for practitioners: how the router works, load-balancing loss, why Mixtral has 45B params but activates 13B, and when not to use it. Practical, no

Sampling strategies compared: temperature, top-p, top-k, min-p, and what actually works in production

Dev.to · Tech_Nuggets 1w ago

Sampling strategies compared: temperature, top-p, top-k, min-p, and what actually works in production

A production-oriented comparison of LLM sampling parameters -- how temperature, top-p, top-k, and min-p reshape the output distribution, what combos actually wo

Quantization formats compared: GGUF vs GPTQ vs AWQ vs NF4

Dev.to · Tech_Nuggets 1w ago

Quantization formats compared: GGUF vs GPTQ vs AWQ vs NF4

A practical comparison of the four major LLM weight quantization formats — which one to use for CPU, GPU serving, and fine-tuning, with current version numbers

LoRA and QLoRA fine-tuning: what they actually do under the hood

Dev.to · Tech_Nuggets 1w ago

LoRA and QLoRA fine-tuning: what they actually do under the hood

A practical walkthrough of LoRA and QLoRA -- how low-rank adaptation works, what NF4 quantization brings, and when to use each.

KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break

Dev.to · Tech_Nuggets 2w ago

KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break

FP8 and INT8 KV caches cut attention state ~50%, but they shift the target model's logit distribution — and that can quietly halve the gains from speculative de