{"cells":[{"cell_type":"markdown","metadata":{"id":"A2kvRfGnp_kt"},"source":["# PyTorch CUDA 101: GPU Optimization Mastery\n","\n","**From First Principles to Tensor Cores**\n","\n","This notebook demonstrates essential CUDA patterns in PyTorch, based on performance principles revealed by GPU microbenchmarking.\n","\n","## Key Principles:\n","1. Minimize GPU-CPU data transfers\n","2. Choose appropriate data types (float32 vs float64)\n","3. Batch operations to increase arithmetic intensity\n","4. Use in-place operations when possible\n","5. Leverage tensor cores for matrix operations\n","6. Understand memory access patterns\n","7. Profile to identify bottlenecks\n","\n","---"]},{"cell_type":"code","source":["!nvidia-smi"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"DD3gMGbPqN4P","executionInfo":{"status":"ok","timestamp":1755176308057,"user_tz":-420,"elapsed":119,"user":{"displayName":"Laam Pham","userId":"04566654796696849937"}},"outputId":"875df861-8118-417f-f597-371b12037fa3"},"execution_count":1,"outputs":[{"output_type":"stream","name":"stdout","text":["Thu Aug 14 12:58:27 2025       \n","+-----------------------------------------------------------------------------------------+\n","| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |\n","|-----------------------------------------+------------------------+----------------------+\n","| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |\n","| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |\n","|                                         |                        |               MIG M. |\n","|=========================================+========================+======================|\n","|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |\n","| N/A   53C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |\n","|                                         |                        |                  N/A |\n","+-----------------------------------------+------------------------+----------------------+\n","                                                                                         \n","+-----------------------------------------------------------------------------------------+\n","| Processes:                                                                              |\n","|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |\n","|        ID   ID                                                               Usage      |\n","|=========================================================================================|\n","|  No running processes found                                                             |\n","+-----------------------------------------------------------------------------------------+\n"]}]},{"cell_type":"markdown","metadata":{"id":"vltvInI_p_ku"},"source":["## Setup and Imports"]},{"cell_type":"code","execution_count":2,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"9d38y-0tp_ku","executionInfo":{"status":"ok","timestamp":1755176315357,"user_tz":-420,"elapsed":6345,"user":{"displayName":"Laam Pham","userId":"04566654796696849937"}},"outputId":"f4000e08-da34-423c-a035-994bd4911170"},"outputs":[{"output_type":"stream","name":"stdout","text":["✅ Using GPU: Tesla T4\n","✅ CUDA Version: 12.4\n","✅ PyTorch Version: 2.6.0+cu124\n"]}],"source":["import torch\n","import time\n","import math\n","from typing import Tuple, Optional\n","\n","def benchmark_operation(func, *args, num_iters=1000, warmup=100):\n","    \"\"\"Benchmark a PyTorch operation with proper CUDA synchronization.\"\"\"\n","    # Warmup to eliminate kernel compilation overhead\n","    for _ in range(warmup):\n","        func(*args)\n","    torch.cuda.synchronize()\n","\n","    # Actual timing\n","    start = time.perf_counter()\n","    for _ in range(num_iters):\n","        result = func(*args)\n","    torch.cuda.synchronize()\n","\n","    elapsed = time.perf_counter() - start\n","    return (elapsed / num_iters) * 1000  # Convert to milliseconds\n","\n","# Check CUDA availability\n","if not torch.cuda.is_available():\n","    raise RuntimeError(\"CUDA not available - GPU required for tutorial\")\n","\n","device = torch.device('cuda')\n","print(f'✅ Using GPU: {torch.cuda.get_device_name()}')\n","print(f'✅ CUDA Version: {torch.version.cuda}')\n","print(f'✅ PyTorch Version: {torch.__version__}')"]},{"cell_type":"markdown","metadata":{"id":"H-Ip68DFp_kv"},"source":["# Lesson 0: GPU Memory Baseline - Understanding CUDA Overhead\n","\n","**Key Reality Check:** CUDA kernels consume 1-2 GB regardless of your model size!\n","\n","You might think you could compute memory requirements exactly, but CUDA kernels require substantial overhead that makes precise calculations challenging."]},{"cell_type":"code","execution_count":3,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"MzC_ZSalp_kv","executionInfo":{"status":"ok","timestamp":1755176322060,"user_tz":-420,"elapsed":324,"user":{"displayName":"Laam Pham","userId":"04566654796696849937"}},"outputId":"b367dacd-246e-4005-fee0-a34b6f574546"},"outputs":[{"output_type":"stream","name":"stdout","text":["📊 Memory before CUDA initialization: 0.0 MB\n","📊 Memory after CUDA initialization: 0.0 MB\n","🎯 CUDA kernel overhead: 0.0 MB\n","\n","💡 This overhead is constant and unavoidable for any GPU computation!\n","   Additional memory used for buffers, intermediate results, and fragmentation\n","   makes precise memory calculations challenging - focus on relative improvements.\n"]}],"source":["# Demonstrate CUDA kernel memory overhead\n","torch.cuda.empty_cache()\n","torch.cuda.reset_peak_memory_stats()\n","baseline_memory = torch.cuda.memory_allocated() / 1024**2\n","\n","# Create minimal tensor to initialize CUDA context\n","minimal_tensor = torch.ones((1, 1), device='cuda')\n","cuda_overhead = torch.cuda.memory_allocated() / 1024**2\n","\n","print(f'📊 Memory before CUDA initialization: {baseline_memory:.1f} MB')\n","print(f'📊 Memory after CUDA initialization: {cuda_overhead:.1f} MB')\n","print(f'🎯 CUDA kernel overhead: {cuda_overhead - baseline_memory:.1f} MB')\n","print(f'\\n💡 This overhead is constant and unavoidable for any GPU computation!')\n","print(f'   Additional memory used for buffers, intermediate results, and fragmentation')\n","print(f'   makes precise memory calculations challenging - focus on relative improvements.')"]},{"cell_type":"markdown","metadata":{"id":"jCCwZc1Yp_kv"},"source":["# Lesson 1: Device Management & Tensor Creation\n","\n","**Principle:** Memory allocation location is immutable post-creation. CPU→GPU transfer involves PCIe bandwidth (~16GB/s) vs GPU memory bandwidth (~1500GB/s)."]},{"cell_type":"code","execution_count":4,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"vJ0s8NWtp_kw","executionInfo":{"status":"ok","timestamp":1755176335403,"user_tz":-420,"elapsed":1863,"user":{"displayName":"Laam Pham","userId":"04566654796696849937"}},"outputId":"5645c438-8e8d-4f4f-93bb-d10da266bde9"},"outputs":[{"output_type":"stream","name":"stdout","text":["❌ BAD: Creating on CPU then moving to GPU\n","✅ GOOD: Creating directly on GPU\n","Bad approach:  7.84 ms\n","Good approach: 0.04 ms\n","Speedup: 206.32x\n","\n","🎯 Key Takeaway: Always create tensors directly on the target device\n","   Use device=\"cuda\" parameter in tensor creation functions\n"]}],"source":["print('❌ BAD: Creating on CPU then moving to GPU')\n","def bad_tensor_creation(size):\n","    x = torch.randn(size, size)  # Created on CPU\n","    x = x.cuda()  # Expensive CPU->GPU transfer\n","    return x\n","\n","print('✅ GOOD: Creating directly on GPU')\n","def good_tensor_creation(size):\n","    x = torch.randn(size, size, device='cuda')  # Created directly on GPU\n","    return x\n","\n","size = 1024\n","bad_time = benchmark_operation(bad_tensor_creation, size, num_iters=100)\n","good_time = benchmark_operation(good_tensor_creation, size, num_iters=100)\n","\n","print(f'Bad approach:  {bad_time:.2f} ms')\n","print(f'Good approach: {good_time:.2f} ms')\n","print(f'Speedup: {bad_time/good_time:.2f}x')\n","\n","print('\\n🎯 Key Takeaway: Always create tensors directly on the target device')\n","print('   Use device=\"cuda\" parameter in tensor creation functions')"]},{"cell_type":"markdown","metadata":{"id":"CWEYcjqap_kw"},"source":["# Lesson 2: Data Type Optimization\n","\n","**Surprising Implication:** Float16 isn't just 2x faster—it enables Tensor Cores (312 TFLOPS vs 19.5 TFLOPS). This demonstrates a 16x performance cliff, not gradual degradation."]},{"cell_type":"code","execution_count":5,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"C2TBbxWbp_kw","executionInfo":{"status":"ok","timestamp":1755176398588,"user_tz":-420,"elapsed":12064,"user":{"displayName":"Laam Pham","userId":"04566654796696849937"}},"outputId":"b708001f-89ce-45dc-80b0-d2991d744a90"},"outputs":[{"output_type":"stream","name":"stdout","text":["Float64: 71.54 ms\n","Float32: 4.58 ms (15.63x faster)\n","Float16: 1.15 ms (62.09x faster)\n","\n","🎯 Key Takeaway: Use float32 unless you need float64 precision\n","   Float16 is even faster but may have numerical stability issues\n","\n","Memory usage:\n","Float64: 7.6 MB\n","Float32: 3.8 MB\n"]}],"source":["size = 2048\n","\n","def matmul_float64():\n","    A = torch.randn(size, size, dtype=torch.float64, device='cuda')\n","    B = torch.randn(size, size, dtype=torch.float64, device='cuda')\n","    return torch.mm(A, B)\n","\n","def matmul_float32():\n","    A = torch.randn(size, size, dtype=torch.float32, device='cuda')\n","    B = torch.randn(size, size, dtype=torch.float32, device='cuda')\n","    return torch.mm(A, B)\n","\n","def matmul_float16():\n","    A = torch.randn(size, size, dtype=torch.float16, device='cuda')\n","    B = torch.randn(size, size, dtype=torch.float16, device='cuda')\n","    return torch.mm(A, B)\n","\n","time_f64 = benchmark_operation(matmul_float64, num_iters=50)\n","time_f32 = benchmark_operation(matmul_float32, num_iters=50)\n","time_f16 = benchmark_operation(matmul_float16, num_iters=50)\n","\n","print(f'Float64: {time_f64:.2f} ms')\n","print(f'Float32: {time_f32:.2f} ms ({time_f64/time_f32:.2f}x faster)')\n","print(f'Float16: {time_f16:.2f} ms ({time_f64/time_f16:.2f}x faster)')\n","\n","print('\\n🎯 Key Takeaway: Use float32 unless you need float64 precision')\n","print('   Float16 is even faster but may have numerical stability issues')\n","\n","# Memory usage comparison\n","f64_tensor = torch.randn(1000, 1000, dtype=torch.float64, device='cuda')\n","f32_tensor = torch.randn(1000, 1000, dtype=torch.float32, device='cuda')\n","\n","print(f'\\nMemory usage:')\n","print(f'Float64: {f64_tensor.element_size() * f64_tensor.numel() / 1024**2:.1f} MB')\n","print(f'Float32: {f32_tensor.element_size() * f32_tensor.numel() / 1024**2:.1f} MB')"]},{"cell_type":"markdown","metadata":{"id":"nMSyOSqnp_kw"},"source":["# Lesson 3: CPU-GPU Transfer Optimization\n","\n","**Hidden Cost:** Each transfer incurs ~10μs latency + bandwidth cost. For small operations, latency dominates—you're paying milliseconds to save microseconds."]},{"cell_type":"code","execution_count":6,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"dJ_6DtHzp_kw","executionInfo":{"status":"ok","timestamp":1755176416931,"user_tz":-420,"elapsed":350,"user":{"displayName":"Laam Pham","userId":"04566654796696849937"}},"outputId":"0aaf1e5f-c83c-40a3-edb8-a5cf271b71dd"},"outputs":[{"output_type":"stream","name":"stdout","text":["❌ BAD: Frequent CPU-GPU transfers\n","✅ GOOD: Keep operations on GPU\n","Bad approach:  1.64 ms\n","Good approach: 0.02 ms\n","Speedup: 75.2x\n","\n","🎯 Key Takeaway: Keep data on GPU as long as possible\n","   Use PyTorch operations instead of numpy when possible\n"]}],"source":["x = torch.randn(1000, 1000, device='cuda')\n","\n","print('❌ BAD: Frequent CPU-GPU transfers')\n","def bad_cpu_gpu_pattern():\n","    # Convert to CPU, do numpy operation, back to GPU\n","    x_cpu = x.cpu().numpy()  # GPU -> CPU\n","    result_cpu = x_cpu.sum()  # CPU operation\n","    result_gpu = torch.tensor(result_cpu, device='cuda')  # CPU -> GPU\n","    return result_gpu\n","\n","print('✅ GOOD: Keep operations on GPU')\n","def good_gpu_pattern():\n","    result = x.sum()  # All on GPU\n","    return result\n","\n","bad_time = benchmark_operation(bad_cpu_gpu_pattern, num_iters=100)\n","good_time = benchmark_operation(good_gpu_pattern, num_iters=100)\n","\n","print(f'Bad approach:  {bad_time:.2f} ms')\n","print(f'Good approach: {good_time:.2f} ms')\n","print(f'Speedup: {bad_time/good_time:.1f}x')\n","\n","print('\\n🎯 Key Takeaway: Keep data on GPU as long as possible')\n","print('   Use PyTorch operations instead of numpy when possible')"]},{"cell_type":"markdown","metadata":{"id":"WsLFJphvp_kw"},"source":["# Lesson 4: Batching for Arithmetic Intensity\n","\n","**First Principles:** Single operations have low arithmetic intensity (FLOPS/memory_access). Batching increases intensity from O(n²) to O(n³) for matrix operations."]},{"cell_type":"code","execution_count":7,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"UkzNkIjip_kx","executionInfo":{"status":"ok","timestamp":1755176426734,"user_tz":-420,"elapsed":253,"user":{"displayName":"Laam Pham","userId":"04566654796696849937"}},"outputId":"6c305f14-5f06-4df4-8968-c3554fa9abce"},"outputs":[{"output_type":"stream","name":"stdout","text":["❌ BAD: Processing one sample at a time\n","✅ GOOD: Batch processing\n","Bad approach:  1.71 ms\n","Good approach: 0.34 ms\n","Speedup: 5.03x\n","\n","🎯 Key Takeaway: Batch operations whenever possible\n","   Use bmm(), batch matrix operations, and higher-dimensional tensors\n"]}],"source":["print('❌ BAD: Processing one sample at a time')\n","def bad_sequential_processing():\n","    samples = [torch.randn(256, 256, device='cuda') for _ in range(32)]\n","    results = []\n","    for sample in samples:\n","        result = torch.mm(sample, sample.T)  # Individual matrix multiply\n","        results.append(result)\n","    return torch.stack(results)\n","\n","print('✅ GOOD: Batch processing')\n","def good_batch_processing():\n","    # Create batched tensor directly\n","    batch = torch.randn(32, 256, 256, device='cuda')\n","    # Batched matrix multiply - much more efficient\n","    result = torch.bmm(batch, batch.transpose(-2, -1))\n","    return result\n","\n","bad_time = benchmark_operation(bad_sequential_processing, num_iters=10)\n","good_time = benchmark_operation(good_batch_processing, num_iters=10)\n","\n","print(f'Bad approach:  {bad_time:.2f} ms')\n","print(f'Good approach: {good_time:.2f} ms')\n","print(f'Speedup: {bad_time/good_time:.2f}x')\n","\n","print('\\n🎯 Key Takeaway: Batch operations whenever possible')\n","print('   Use bmm(), batch matrix operations, and higher-dimensional tensors')"]},{"cell_type":"markdown","metadata":{"id":"oI9DFq-0p_kx"},"source":["# Lesson 5: In-place Operations\n","\n","**Memory Allocator Tax:** Each allocation involves GPU memory manager overhead. In-place operations eliminate allocation/deallocation cycles entirely."]},{"cell_type":"code","execution_count":8,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"3Jl88InLp_kx","executionInfo":{"status":"ok","timestamp":1755176436411,"user_tz":-420,"elapsed":256,"user":{"displayName":"Laam Pham","userId":"04566654796696849937"}},"outputId":"f3ae7d30-bb88-467f-b401-11711dd1f3a0"},"outputs":[{"output_type":"stream","name":"stdout","text":["❌ BAD: Creating new tensors\n","✅ GOOD: In-place operations\n","Bad approach:  0.59 ms, 103.8 MB peak\n","Good approach: 0.52 ms, 71.8 MB peak\n","Speedup: 1.13x\n","Memory reduction: 1.45x\n","\n","🎯 Key Takeaway: Use in-place operations (add_, mul_, etc.)\n","   Reduces memory allocation and garbage collection overhead\n"]}],"source":["size = (2048, 2048)\n","\n","print('❌ BAD: Creating new tensors')\n","def bad_memory_allocation():\n","    x = torch.randn(*size, device='cuda')\n","    y = torch.randn(*size, device='cuda')\n","    z = x + y  # Creates new tensor\n","    w = z * 2  # Creates another new tensor\n","    return w\n","\n","print('✅ GOOD: In-place operations')\n","def good_inplace_operations():\n","    x = torch.randn(*size, device='cuda')\n","    y = torch.randn(*size, device='cuda')\n","    x.add_(y)  # In-place addition\n","    x.mul_(2)  # In-place multiplication\n","    return x\n","\n","# Monitor memory usage\n","torch.cuda.empty_cache()\n","torch.cuda.reset_peak_memory_stats()\n","\n","bad_time = benchmark_operation(bad_memory_allocation, num_iters=50)\n","bad_memory = torch.cuda.max_memory_allocated() / 1024**2\n","\n","torch.cuda.empty_cache()\n","torch.cuda.reset_peak_memory_stats()\n","\n","good_time = benchmark_operation(good_inplace_operations, num_iters=50)\n","good_memory = torch.cuda.max_memory_allocated() / 1024**2\n","\n","print(f'Bad approach:  {bad_time:.2f} ms, {bad_memory:.1f} MB peak')\n","print(f'Good approach: {good_time:.2f} ms, {good_memory:.1f} MB peak')\n","print(f'Speedup: {bad_time/good_time:.2f}x')\n","print(f'Memory reduction: {bad_memory/good_memory:.2f}x')\n","\n","print('\\n🎯 Key Takeaway: Use in-place operations (add_, mul_, etc.)')\n","print('   Reduces memory allocation and garbage collection overhead')"]},{"cell_type":"markdown","metadata":{"id":"7ISx7KIGp_kx"},"source":["# Lesson 6: Tensor Core Optimization\n","\n","**Hardware Constraint:** Tensor Cores operate on 4×4 matrices of float16. Misaligned dimensions force fallback to CUDA cores—a 16x performance penalty."]},{"cell_type":"code","execution_count":9,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"3oQ53k8pp_kx","executionInfo":{"status":"ok","timestamp":1755176445833,"user_tz":-420,"elapsed":246,"user":{"displayName":"Laam Pham","userId":"04566654796696849937"}},"outputId":"2dd83185-735b-4129-9e85-771846cb998e"},"outputs":[{"output_type":"stream","name":"stdout","text":["Matrix multiply performance depends on tensor core compatibility:\n","Size     Time (ms)  TFLOPS     Notes\n","--------------------------------------------------\n","512      0.08       3.36       ✅ TC-friendly\n","768      0.11       7.90       ✅ TC-friendly\n","1024     0.22       9.67       ✅ TC-friendly\n","1536     0.55       13.14      ✅ TC-friendly\n","2048     0.92       18.70      ✅ TC-friendly\n","\n","🎯 Key Takeaway: Use float16 and dimensions divisible by 16\n","   This maximizes tensor core utilization on modern GPUs\n"]}],"source":["print('Matrix multiply performance depends on tensor core compatibility:')\n","\n","# Test different matrix sizes - tensor cores prefer certain dimensions\n","sizes = [512, 768, 1024, 1536, 2048]\n","\n","print(f'{\"Size\":<8} {\"Time (ms)\":<10} {\"TFLOPS\":<10} {\"Notes\"}')\n","print('-' * 50)\n","\n","for size in sizes:\n","    def matmul_test():\n","        A = torch.randn(size, size, dtype=torch.float16, device='cuda')\n","        B = torch.randn(size, size, dtype=torch.float16, device='cuda')\n","        return torch.mm(A, B)\n","\n","    time_ms = benchmark_operation(matmul_test, num_iters=20)\n","    flops = 2 * size**3  # Matrix multiply FLOPS\n","    tflops = (flops / (time_ms * 1e-3)) / 1e12\n","\n","    # Tensor cores work best with dimensions divisible by 8/16\n","    tc_friendly = '✅ TC-friendly' if size % 16 == 0 else '⚠️  Sub-optimal'\n","\n","    print(f'{size:<8} {time_ms:<10.2f} {tflops:<10.2f} {tc_friendly}')\n","\n","print('\\n🎯 Key Takeaway: Use float16 and dimensions divisible by 16')\n","print('   This maximizes tensor core utilization on modern GPUs')"]},{"cell_type":"markdown","metadata":{"id":"Iz9G6I4fp_kx"},"source":["# Lesson 7: Memory Access Patterns\n","\n","**Memory Layout Principle:** GPU threads access memory in coalesced patterns. Non-contiguous access forces multiple memory transactions instead of single wide loads."]},{"cell_type":"code","execution_count":10,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"DRaeNq3Gp_kx","executionInfo":{"status":"ok","timestamp":1755176454772,"user_tz":-420,"elapsed":406,"user":{"displayName":"Laam Pham","userId":"04566654796696849937"}},"outputId":"32d4ea35-0ffe-450a-d3f2-b2e297cd77e0"},"outputs":[{"output_type":"stream","name":"stdout","text":["❌ BAD: Non-contiguous memory access\n","✅ GOOD: Contiguous memory access\n","Bad approach:  0.27 ms\n","Good approach: 1.32 ms\n","Speedup: 0.21x\n","\n","Memory layout check:\n","Original tensor is_contiguous: True\n","Transposed tensor is_contiguous: False\n","After .contiguous(): True\n","\n","🎯 Key Takeaway: Use .contiguous() after shape operations\n","   Check .is_contiguous() and call .contiguous() when needed\n"]}],"source":["size = (4096, 4096)\n","x = torch.randn(*size, device='cuda')\n","\n","print('❌ BAD: Non-contiguous memory access')\n","def bad_memory_pattern():\n","    # Transpose creates a view with different strides\n","    x_t = x.T\n","    return torch.sum(x_t, dim=0)  # Non-contiguous access\n","\n","print('✅ GOOD: Contiguous memory access')\n","def good_memory_pattern():\n","    # Make contiguous first\n","    x_t = x.T.contiguous()\n","    return torch.sum(x_t, dim=0)  # Contiguous access\n","\n","bad_time = benchmark_operation(bad_memory_pattern, num_iters=100)\n","good_time = benchmark_operation(good_memory_pattern, num_iters=100)\n","\n","print(f'Bad approach:  {bad_time:.2f} ms')\n","print(f'Good approach: {good_time:.2f} ms')\n","print(f'Speedup: {bad_time/good_time:.2f}x')\n","\n","print(f'\\nMemory layout check:')\n","print(f'Original tensor is_contiguous: {x.is_contiguous()}')\n","print(f'Transposed tensor is_contiguous: {x.T.is_contiguous()}')\n","print(f'After .contiguous(): {x.T.contiguous().is_contiguous()}')\n","\n","print('\\n🎯 Key Takeaway: Use .contiguous() after shape operations')\n","print('   Check .is_contiguous() and call .contiguous() when needed')"]},{"cell_type":"markdown","metadata":{"id":"V1E-TpW1p_kx"},"source":["# Lesson 8: Performance Profiling\n","\n","**Measurement Principle:** You cannot optimize what you cannot measure. The profiler reveals the actual bottleneck—often surprising compared to intuition."]},{"cell_type":"code","execution_count":11,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":382},"id":"ZQYGvFQdp_kx","executionInfo":{"status":"error","timestamp":1755176465409,"user_tz":-420,"elapsed":247,"user":{"displayName":"Laam Pham","userId":"04566654796696849937"}},"outputId":"0134a7c8-94ef-4cfc-f62a-d0cba546cd64"},"outputs":[{"output_type":"stream","name":"stdout","text":["Running profiler example...\n"]},{"output_type":"error","ename":"RuntimeError","evalue":"element 0 of tensors does not require grad and does not have a grad_fn","traceback":["\u001b[0;31m---------------------------------------------------------------------------\u001b[0m","\u001b[0;31mRuntimeError\u001b[0m                              Traceback (most recent call last)","\u001b[0;32m/tmp/ipython-input-1548132624.py\u001b[0m in \u001b[0;36m<cell line: 0>\u001b[0;34m()\u001b[0m\n\u001b[1;32m     25\u001b[0m ) as prof:\n\u001b[1;32m     26\u001b[0m     \u001b[0;32mfor\u001b[0m \u001b[0m_\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m10\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 27\u001b[0;31m         \u001b[0mexample_neural_network\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     28\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     29\u001b[0m \u001b[0;31m# Print profiling results\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;32m/tmp/ipython-input-1548132624.py\u001b[0m in \u001b[0;36mexample_neural_network\u001b[0;34m()\u001b[0m\n\u001b[1;32m     12\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     13\u001b[0m     \u001b[0;31m# Backward pass\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 14\u001b[0;31m     \u001b[0mloss\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbackward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     15\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     16\u001b[0m     \u001b[0;32mreturn\u001b[0m \u001b[0mloss\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;32m/usr/local/lib/python3.11/dist-packages/torch/_tensor.py\u001b[0m in \u001b[0;36mbackward\u001b[0;34m(self, gradient, retain_graph, create_graph, inputs)\u001b[0m\n\u001b[1;32m    624\u001b[0m                 \u001b[0minputs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0minputs\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    625\u001b[0m             )\n\u001b[0;32m--> 626\u001b[0;31m         torch.autograd.backward(\n\u001b[0m\u001b[1;32m    627\u001b[0m             \u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mgradient\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mretain_graph\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcreate_graph\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minputs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0minputs\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    628\u001b[0m         )\n","\u001b[0;32m/usr/local/lib/python3.11/dist-packages/torch/autograd/__init__.py\u001b[0m in \u001b[0;36mbackward\u001b[0;34m(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)\u001b[0m\n\u001b[1;32m    345\u001b[0m     \u001b[0;31m# some Python versions print out the first line of a multi-line function\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    346\u001b[0m     \u001b[0;31m# calls in the traceback and some print out the last line\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 347\u001b[0;31m     _engine_run_backward(\n\u001b[0m\u001b[1;32m    348\u001b[0m         \u001b[0mtensors\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    349\u001b[0m         \u001b[0mgrad_tensors_\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;32m/usr/local/lib/python3.11/dist-packages/torch/autograd/graph.py\u001b[0m in \u001b[0;36m_engine_run_backward\u001b[0;34m(t_outputs, *args, **kwargs)\u001b[0m\n\u001b[1;32m    821\u001b[0m         \u001b[0munregister_hooks\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_register_logging_hooks_on_whole_graph\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mt_outputs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    822\u001b[0m     \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 823\u001b[0;31m         return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass\n\u001b[0m\u001b[1;32m    824\u001b[0m             \u001b[0mt_outputs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    825\u001b[0m         )  # Calls into the C++ engine to run the backward pass\n","\u001b[0;31mRuntimeError\u001b[0m: element 0 of tensors does not require grad and does not have a grad_fn"]}],"source":["def example_neural_network():\n","    # Simple neural network operations\n","    x = torch.randn(1024, 512, device='cuda')\n","    W1 = torch.randn(512, 256, device='cuda')\n","    W2 = torch.randn(256, 10, device='cuda')\n","\n","    # Forward pass\n","    h1 = torch.mm(x, W1)\n","    h1 = torch.relu(h1)\n","    output = torch.mm(h1, W2)\n","    loss = torch.sum(output**2)\n","\n","    # Backward pass\n","    loss.backward()\n","\n","    return loss\n","\n","print('Running profiler example...')\n","\n","# Profile the neural network\n","with torch.profiler.profile(\n","    activities=[torch.profiler.ProfilerActivity.CPU,\n","               torch.profiler.ProfilerActivity.CUDA],\n","    record_shapes=True,\n",") as prof:\n","    for _ in range(10):\n","        example_neural_network()\n","\n","# Print profiling results\n","print('\\nTop 5 GPU operations by time:')\n","print(prof.key_averages().table(sort_by='cuda_time_total', row_limit=5))\n","\n","print('\\n🎯 Key Takeaway: Use torch.profiler to identify bottlenecks')\n","print('   Focus optimization efforts on the most time-consuming operations')"]},{"cell_type":"markdown","metadata":{"id":"dtm6fq10p_kx"},"source":["# Summary: PyTorch CUDA Best Practices\n","\n","## The Systematic Optimization Framework\n","\n","**P(optimization_success|measurement) >> P(optimization_success|intuition)**\n","\n","### Core Practices:\n","\n","1. **📱 Create tensors directly on GPU** with `device='cuda'`\n","2. **🔢 Use float32** unless float64 precision is required\n","3. **🚫 Minimize CPU-GPU transfers** (`.cpu()`, `.cuda()`)\n","4. **📦 Batch operations** using `bmm()`, 3D+ tensors\n","5. **⚡ Use in-place operations** (`add_`, `mul_`, etc.) to save memory\n","6. **🎯 Leverage tensor cores** with float16 + dims divisible by 16\n","7. **🧠 Ensure memory contiguity** with `.contiguous()`\n","8. **📊 Profile code** to identify actual bottlenecks\n","9. **🔄 Always use `torch.cuda.synchronize()`** for accurate timing\n","10. **🎮 Understand hardware limits** (memory vs compute bound)\n","\n","### The Three Performance Regimes:\n","\n","| **Regime** | **Characteristics** | **Solutions** |\n","|------------|--------------------|--------------|\n","| **Overhead-Bound** | Runtime doesn't scale with data size | Tracing, operator fusion, JIT compilation |\n","| **Memory-Bound** | Low FLOPS utilization, high bandwidth | Operator fusion, increase arithmetic intensity |\n","| **Compute-Bound** | High FLOPS utilization | Use Tensor Cores, upgrade hardware |\n","\n","### Key Formulas:\n","\n","- **Arithmetic Intensity** = `FLOPS / Bytes_Accessed`\n","- **Memory Usage** = `batch_size × seq_len × hidden_dim × bytes_per_element`\n","- **P(tensor_core_usage|float16 + aligned_dims) ≈ 1.0**\n","\n","**Remember:** The microbenchmarking results show that performance depends on arithmetic intensity. Optimize based on whether your operations are memory-bound or compute-bound!"]}],"metadata":{"kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.0"},"colab":{"provenance":[],"gpuType":"T4"},"accelerator":"GPU"},"nbformat":4,"nbformat_minor":0}