Autoregressive GPT![]() Token-by-token generation |
RNN vs GRU![]() Vanishing gradients and gating |
LSTM![]() 4-gate memory highway |
BPE Tokenizer![]() Iterative pair merging → vocabulary |
Word Embeddings![]() Contrastive learning → semantic clusters |
RAG Pipeline![]() Retrieve → augment → generate |
BERT![]() Bidirectional attention + [MASK] prediction |
Convolutional Net![]() Sliding kernels → feature maps |
ResNet![]() F(x) + x = gradient highway |
Vision Transformer![]() Image patches as tokens |
Diffusion![]() Noise → data via iterative denoising |
VAE![]() Encode → sample z → decode |
GAN![]() Generator vs discriminator minimax |
Optimizers![]() SGD vs Momentum vs Adam convergence |
LoRA Fine-tuning![]() Low-rank weight injection |
QLoRA![]() 4-bit base + full-precision adapters |
DPO Alignment![]() Preferred vs. rejected → policy update |
PPO (RLHF)![]() Clipped policy gradient for alignment |
GRPO![]() Group-relative rewards, no critic |
REINFORCE![]() Log P(a) × reward = gradient |
Mixture of Experts![]() Sparse routing to specialist MLPs |
Batch Normalization![]() Normalize activations → stable training |
Dropout![]() Kill neurons → prevent overfitting |
Attention Mechanism![]() Q·KT → softmax → weighted V |
Flash Attention![]() Tiled O(N) memory computation |
RoPE![]() Position via rotation matrices |
KV-Cache![]() Memoize keys/values — stop recomputing |
PagedAttention![]() OS-style paged KV-cache memory |
Quantization![]() Float32 → Int8 = 4x compression |
Beam Search![]() Tree search with top-k pruning |
Checkpointing![]() O(n) → O(√n) memory via recompute |
Model Parallelism![]() Tensor + pipeline across devices |
State Space Models![]() Linear-time selective state transitions |
Vector Search![]() Exact vs LSH approximate search |
BM25![]() TF → TF-IDF → BM25 evolution |
Speculative Decoding![]() Draft fast, verify once |
Monte Carlo Tree Search![]() UCB1 tree search + random rollouts |
ReAct Agent![]() Thought → Action → Observation |