How LLMs Shrink from 28GB to 3.5GB | Quantization Explained in Tamil | QLoRA

Name: How LLMs Shrink from 28GB to 3.5GB | Quantization Explained in Tamil | QLoRA
Uploaded: 2026-05-16T05:52:24Z
Channel: Adi Explains
Description: In this video, I explain Quantization in Tamil in a simple, intuitive, and practical way for students, software engineers, data scientists, and working ...

Adi Explains · Beginner ·🧠 Large Language Models ·23h ago

Skills: LLM Engineering80%

In this video, I explain Quantization in Tamil in a simple, intuitive, and practical way for students, software engineers, data scientists, and working professionals who are learning Generative AI, Large Language Models (LLMs), and modern AI engineering in Tamil. If you have ever wondered how massive AI models like Llama 3, Mistral, Gemma, Qwen, and DeepSeek can be compressed from tens of gigabytes to just a few gigabytes while still maintaining excellent performance, this video will give you a deep understanding of one of the most important model optimization techniques used in the AI industry today. In my previous video, I explained LoRA Fine-Tuning (Low-Rank Adaptation) and showed how we can efficiently fine-tune large language models by training only a small number of parameters. In this video, we take the next step by understanding Quantization, a technique that reduces the size of a model by storing weights using lower precision formats such as FP16, INT8, and 4-bit representations. Quantization is the key technology that allows developers to run powerful LLMs on laptops, low-memory GPUs, cloud instances, and even edge devices. This Tamil tutorial starts from the basics and explains why AI models consume so much memory. We discuss how billions of model parameters are stored as floating-point numbers and how reducing the number of bits per parameter can dramatically decrease storage and GPU memory usage. You will learn the differences between FP32, FP16, INT8, and 4-bit quantization, along with practical examples of how a 7B or 8B model can shrink from more than 14 GB to just 3–4 GB. I also explain the trade-off between model size and accuracy, helping you understand why quantization is so effective in real-world applications. The video covers important quantization concepts such as Post-Training Quantization (PTQ), Quantization Aware Training (QAT), scale and zero point, and how quantized values are represented. We also discuss popular quantization techniqu

Watch on YouTube ↗ (saves to browser)