{"cells":[{"cell_type":"markdown","metadata":{"id":"A2kvRfGnp_kt"},"source":["# PyTorch CUDA 101: GPU Optimization Mastery\n","\n","**From First Principles to Tensor Cores**\n","\n","This notebook demonstrates essential CUDA patterns in PyTorch, based on performance principles revealed by GPU microbenchmarking.\n","\n","## Key Principles:\n","1. Minimize GPU-CPU data transfers\n","2. Choose appropriate data types (float32 vs float64)\n","3. Batch operations to increase arithmetic intensity\n","4. Use in-place operations when possible\n","5. Leverage tensor cores for matrix operations\n","6. Understand memory access patterns\n","7. Profile to identify bottlenecks\n","\n","---"]},{"cell_type":"code","source":["!nvidia-smi"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"DD3gMGbPqN4P","executionInfo":{"status":"ok","timestamp":1755176308057,"user_tz":-420,"elapsed":119,"user":{"displayName":"Laam Pham","userId":"04566654796696849937"}},"outputId":"875df861-8118-417f-f597-371b12037fa3"},"execution_count":1,"outputs":[{"output_type":"stream","name":"stdout","text":["Thu Aug 14 12:58:27 2025 \n","+-----------------------------------------------------------------------------------------+\n","| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |\n","|-----------------------------------------+------------------------+----------------------+\n","| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |\n","| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |\n","| | | MIG M. |\n","|=========================================+========================+======================|\n","| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |\n","| N/A 53C P8 10W / 70W | 0MiB / 15360MiB | 0% Default |\n","| | | N/A |\n","+-----------------------------------------+------------------------+----------------------+\n"," \n","+-----------------------------------------------------------------------------------------+\n","| Processes: |\n","| GPU GI CI PID Type Process name GPU Memory |\n","| ID ID Usage |\n","|=========================================================================================|\n","| No running processes found |\n","+-----------------------------------------------------------------------------------------+\n"]}]},{"cell_type":"markdown","metadata":{"id":"vltvInI_p_ku"},"source":["## Setup and Imports"]},{"cell_type":"code","execution_count":2,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"9d38y-0tp_ku","executionInfo":{"status":"ok","timestamp":1755176315357,"user_tz":-420,"elapsed":6345,"user":{"displayName":"Laam Pham","userId":"04566654796696849937"}},"outputId":"f4000e08-da34-423c-a035-994bd4911170"},"outputs":[{"output_type":"stream","name":"stdout","text":["✅ Using GPU: Tesla T4\n","✅ CUDA Version: 12.4\n","✅ PyTorch Version: 2.6.0+cu124\n"]}],"source":["import torch\n","import time\n","import math\n","from typing import Tuple, Optional\n","\n","def benchmark_operation(func, *args, num_iters=1000, warmup=100):\n"," \"\"\"Benchmark a PyTorch operation with proper CUDA synchronization.\"\"\"\n"," # Warmup to eliminate kernel compilation overhead\n"," for _ in range(warmup):\n"," func(*args)\n"," torch.cuda.synchronize()\n","\n"," # Actual timing\n"," start = time.perf_counter()\n"," for _ in range(num_iters):\n"," result = func(*args)\n"," torch.cuda.synchronize()\n","\n"," elapsed = time.perf_counter() - start\n"," return (elapsed / num_iters) * 1000 # Convert to milliseconds\n","\n","# Check CUDA availability\n","if not torch.cuda.is_available():\n"," raise RuntimeError(\"CUDA not available - GPU required for tutorial\")\n","\n","device = torch.device('cuda')\n","print(f'✅ Using GPU: {torch.cuda.get_device_name()}')\n","print(f'✅ CUDA Version: {torch.version.cuda}')\n","print(f'✅ PyTorch Version: {torch.__version__}')"]},{"cell_type":"markdown","metadata":{"id":"H-Ip68DFp_kv"},"source":["# Lesson 0: GPU Memory Baseline - Understanding CUDA Overhead\n","\n","**Key Reality Check:** CUDA kernels consume 1-2 GB regardless of your model size!\n","\n","You might think you could compute memory requirements exactly, but CUDA kernels require substantial overhead that makes precise calculations challenging."]},{"cell_type":"code","execution_count":3,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"MzC_ZSalp_kv","executionInfo":{"status":"ok","timestamp":1755176322060,"user_tz":-420,"elapsed":324,"user":{"displayName":"Laam Pham","userId":"04566654796696849937"}},"outputId":"b367dacd-246e-4005-fee0-a34b6f574546"},"outputs":[{"output_type":"stream","name":"stdout","text":["📊 Memory before CUDA initialization: 0.0 MB\n","📊 Memory after CUDA initialization: 0.0 MB\n","🎯 CUDA kernel overhead: 0.0 MB\n","\n","💡 This overhead is constant and unavoidable for any GPU computation!\n"," Additional memory used for buffers, intermediate results, and fragmentation\n"," makes precise memory calculations challenging - focus on relative improvements.\n"]}],"source":["# Demonstrate CUDA kernel memory overhead\n","torch.cuda.empty_cache()\n","torch.cuda.reset_peak_memory_stats()\n","baseline_memory = torch.cuda.memory_allocated() / 1024**2\n","\n","# Create minimal tensor to initialize CUDA context\n","minimal_tensor = torch.ones((1, 1), device='cuda')\n","cuda_overhead = torch.cuda.memory_allocated() / 1024**2\n","\n","print(f'📊 Memory before CUDA initialization: {baseline_memory:.1f} MB')\n","print(f'📊 Memory after CUDA initialization: {cuda_overhead:.1f} MB')\n","print(f'🎯 CUDA kernel overhead: {cuda_overhead - baseline_memory:.1f} MB')\n","print(f'\\n💡 This overhead is constant and unavoidable for any GPU computation!')\n","print(f' Additional memory used for buffers, intermediate results, and fragmentation')\n","print(f' makes precise memory calculations challenging - focus on relative improvements.')"]},{"cell_type":"markdown","metadata":{"id":"jCCwZc1Yp_kv"},"source":["# Lesson 1: Device Management & Tensor Creation\n","\n","**Principle:** Memory allocation location is immutable post-creation. CPU→GPU transfer involves PCIe bandwidth (~16GB/s) vs GPU memory bandwidth (~1500GB/s)."]},{"cell_type":"code","execution_count":4,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"vJ0s8NWtp_kw","executionInfo":{"status":"ok","timestamp":1755176335403,"user_tz":-420,"elapsed":1863,"user":{"displayName":"Laam Pham","userId":"04566654796696849937"}},"outputId":"5645c438-8e8d-4f4f-93bb-d10da266bde9"},"outputs":[{"output_type":"stream","name":"stdout","text":["❌ BAD: Creating on CPU then moving to GPU\n","✅ GOOD: Creating directly on GPU\n","Bad approach: 7.84 ms\n","Good approach: 0.04 ms\n","Speedup: 206.32x\n","\n","🎯 Key Takeaway: Always create tensors directly on the target device\n"," Use device=\"cuda\" parameter in tensor creation functions\n"]}],"source":["print('❌ BAD: Creating on CPU then moving to GPU')\n","def bad_tensor_creation(size):\n"," x = torch.randn(size, size) # Created on CPU\n"," x = x.cuda() # Expensive CPU->GPU transfer\n"," return x\n","\n","print('✅ GOOD: Creating directly on GPU')\n","def good_tensor_creation(size):\n"," x = torch.randn(size, size, device='cuda') # Created directly on GPU\n"," return x\n","\n","size = 1024\n","bad_time = benchmark_operation(bad_tensor_creation, size, num_iters=100)\n","good_time = benchmark_operation(good_tensor_creation, size, num_iters=100)\n","\n","print(f'Bad approach: {bad_time:.2f} ms')\n","print(f'Good approach: {good_time:.2f} ms')\n","print(f'Speedup: {bad_time/good_time:.2f}x')\n","\n","print('\\n🎯 Key Takeaway: Always create tensors directly on the target device')\n","print(' Use device=\"cuda\" parameter in tensor creation functions')"]},{"cell_type":"markdown","metadata":{"id":"CWEYcjqap_kw"},"source":["# Lesson 2: Data Type Optimization\n","\n","**Surprising Implication:** Float16 isn't just 2x faster—it enables Tensor Cores (312 TFLOPS vs 19.5 TFLOPS). This demonstrates a 16x performance cliff, not gradual degradation."]},{"cell_type":"code","execution_count":5,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"C2TBbxWbp_kw","executionInfo":{"status":"ok","timestamp":1755176398588,"user_tz":-420,"elapsed":12064,"user":{"displayName":"Laam Pham","userId":"04566654796696849937"}},"outputId":"b708001f-89ce-45dc-80b0-d2991d744a90"},"outputs":[{"output_type":"stream","name":"stdout","text":["Float64: 71.54 ms\n","Float32: 4.58 ms (15.63x faster)\n","Float16: 1.15 ms (62.09x faster)\n","\n","🎯 Key Takeaway: Use float32 unless you need float64 precision\n"," Float16 is even faster but may have numerical stability issues\n","\n","Memory usage:\n","Float64: 7.6 MB\n","Float32: 3.8 MB\n"]}],"source":["size = 2048\n","\n","def matmul_float64():\n"," A = torch.randn(size, size, dtype=torch.float64, device='cuda')\n"," B = torch.randn(size, size, dtype=torch.float64, device='cuda')\n"," return torch.mm(A, B)\n","\n","def matmul_float32():\n"," A = torch.randn(size, size, dtype=torch.float32, device='cuda')\n"," B = torch.randn(size, size, dtype=torch.float32, device='cuda')\n"," return torch.mm(A, B)\n","\n","def matmul_float16():\n"," A = torch.randn(size, size, dtype=torch.float16, device='cuda')\n"," B = torch.randn(size, size, dtype=torch.float16, device='cuda')\n"," return torch.mm(A, B)\n","\n","time_f64 = benchmark_operation(matmul_float64, num_iters=50)\n","time_f32 = benchmark_operation(matmul_float32, num_iters=50)\n","time_f16 = benchmark_operation(matmul_float16, num_iters=50)\n","\n","print(f'Float64: {time_f64:.2f} ms')\n","print(f'Float32: {time_f32:.2f} ms ({time_f64/time_f32:.2f}x faster)')\n","print(f'Float16: {time_f16:.2f} ms ({time_f64/time_f16:.2f}x faster)')\n","\n","print('\\n🎯 Key Takeaway: Use float32 unless you need float64 precision')\n","print(' Float16 is even faster but may have numerical stability issues')\n","\n","# Memory usage comparison\n","f64_tensor = torch.randn(1000, 1000, dtype=torch.float64, device='cuda')\n","f32_tensor = torch.randn(1000, 1000, dtype=torch.float32, device='cuda')\n","\n","print(f'\\nMemory usage:')\n","print(f'Float64: {f64_tensor.element_size() * f64_tensor.numel() / 1024**2:.1f} MB')\n","print(f'Float32: {f32_tensor.element_size() * f32_tensor.numel() / 1024**2:.1f} MB')"]},{"cell_type":"markdown","metadata":{"id":"nMSyOSqnp_kw"},"source":["# Lesson 3: CPU-GPU Transfer Optimization\n","\n","**Hidden Cost:** Each transfer incurs ~10μs latency + bandwidth cost. For small operations, latency dominates—you're paying milliseconds to save microseconds."]},{"cell_type":"code","execution_count":6,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"dJ_6DtHzp_kw","executionInfo":{"status":"ok","timestamp":1755176416931,"user_tz":-420,"elapsed":350,"user":{"displayName":"Laam Pham","userId":"04566654796696849937"}},"outputId":"0aaf1e5f-c83c-40a3-edb8-a5cf271b71dd"},"outputs":[{"output_type":"stream","name":"stdout","text":["❌ BAD: Frequent CPU-GPU transfers\n","✅ GOOD: Keep operations on GPU\n","Bad approach: 1.64 ms\n","Good approach: 0.02 ms\n","Speedup: 75.2x\n","\n","🎯 Key Takeaway: Keep data on GPU as long as possible\n"," Use PyTorch operations instead of numpy when possible\n"]}],"source":["x = torch.randn(1000, 1000, device='cuda')\n","\n","print('❌ BAD: Frequent CPU-GPU transfers')\n","def bad_cpu_gpu_pattern():\n"," # Convert to CPU, do numpy operation, back to GPU\n"," x_cpu = x.cpu().numpy() # GPU -> CPU\n"," result_cpu = x_cpu.sum() # CPU operation\n"," result_gpu = torch.tensor(result_cpu, device='cuda') # CPU -> GPU\n"," return result_gpu\n","\n","print('✅ GOOD: Keep operations on GPU')\n","def good_gpu_pattern():\n"," result = x.sum() # All on GPU\n"," return result\n","\n","bad_time = benchmark_operation(bad_cpu_gpu_pattern, num_iters=100)\n","good_time = benchmark_operation(good_gpu_pattern, num_iters=100)\n","\n","print(f'Bad approach: {bad_time:.2f} ms')\n","print(f'Good approach: {good_time:.2f} ms')\n","print(f'Speedup: {bad_time/good_time:.1f}x')\n","\n","print('\\n🎯 Key Takeaway: Keep data on GPU as long as possible')\n","print(' Use PyTorch operations instead of numpy when possible')"]},{"cell_type":"markdown","metadata":{"id":"WsLFJphvp_kw"},"source":["# Lesson 4: Batching for Arithmetic Intensity\n","\n","**First Principles:** Single operations have low arithmetic intensity (FLOPS/memory_access). Batching increases intensity from O(n²) to O(n³) for matrix operations."]},{"cell_type":"code","execution_count":7,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"UkzNkIjip_kx","executionInfo":{"status":"ok","timestamp":1755176426734,"user_tz":-420,"elapsed":253,"user":{"displayName":"Laam Pham","userId":"04566654796696849937"}},"outputId":"6c305f14-5f06-4df4-8968-c3554fa9abce"},"outputs":[{"output_type":"stream","name":"stdout","text":["❌ BAD: Processing one sample at a time\n","✅ GOOD: Batch processing\n","Bad approach: 1.71 ms\n","Good approach: 0.34 ms\n","Speedup: 5.03x\n","\n","🎯 Key Takeaway: Batch operations whenever possible\n"," Use bmm(), batch matrix operations, and higher-dimensional tensors\n"]}],"source":["print('❌ BAD: Processing one sample at a time')\n","def bad_sequential_processing():\n"," samples = [torch.randn(256, 256, device='cuda') for _ in range(32)]\n"," results = []\n"," for sample in samples:\n"," result = torch.mm(sample, sample.T) # Individual matrix multiply\n"," results.append(result)\n"," return torch.stack(results)\n","\n","print('✅ GOOD: Batch processing')\n","def good_batch_processing():\n"," # Create batched tensor directly\n"," batch = torch.randn(32, 256, 256, device='cuda')\n"," # Batched matrix multiply - much more efficient\n"," result = torch.bmm(batch, batch.transpose(-2, -1))\n"," return result\n","\n","bad_time = benchmark_operation(bad_sequential_processing, num_iters=10)\n","good_time = benchmark_operation(good_batch_processing, num_iters=10)\n","\n","print(f'Bad approach: {bad_time:.2f} ms')\n","print(f'Good approach: {good_time:.2f} ms')\n","print(f'Speedup: {bad_time/good_time:.2f}x')\n","\n","print('\\n🎯 Key Takeaway: Batch operations whenever possible')\n","print(' Use bmm(), batch matrix operations, and higher-dimensional tensors')"]},{"cell_type":"markdown","metadata":{"id":"oI9DFq-0p_kx"},"source":["# Lesson 5: In-place Operations\n","\n","**Memory Allocator Tax:** Each allocation involves GPU memory manager overhead. In-place operations eliminate allocation/deallocation cycles entirely."]},{"cell_type":"code","execution_count":8,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"3Jl88InLp_kx","executionInfo":{"status":"ok","timestamp":1755176436411,"user_tz":-420,"elapsed":256,"user":{"displayName":"Laam Pham","userId":"04566654796696849937"}},"outputId":"f3ae7d30-bb88-467f-b401-11711dd1f3a0"},"outputs":[{"output_type":"stream","name":"stdout","text":["❌ BAD: Creating new tensors\n","✅ GOOD: In-place operations\n","Bad approach: 0.59 ms, 103.8 MB peak\n","Good approach: 0.52 ms, 71.8 MB peak\n","Speedup: 1.13x\n","Memory reduction: 1.45x\n","\n","🎯 Key Takeaway: Use in-place operations (add_, mul_, etc.)\n"," Reduces memory allocation and garbage collection overhead\n"]}],"source":["size = (2048, 2048)\n","\n","print('❌ BAD: Creating new tensors')\n","def bad_memory_allocation():\n"," x = torch.randn(*size, device='cuda')\n"," y = torch.randn(*size, device='cuda')\n"," z = x + y # Creates new tensor\n"," w = z * 2 # Creates another new tensor\n"," return w\n","\n","print('✅ GOOD: In-place operations')\n","def good_inplace_operations():\n"," x = torch.randn(*size, device='cuda')\n"," y = torch.randn(*size, device='cuda')\n"," x.add_(y) # In-place addition\n"," x.mul_(2) # In-place multiplication\n"," return x\n","\n","# Monitor memory usage\n","torch.cuda.empty_cache()\n","torch.cuda.reset_peak_memory_stats()\n","\n","bad_time = benchmark_operation(bad_memory_allocation, num_iters=50)\n","bad_memory = torch.cuda.max_memory_allocated() / 1024**2\n","\n","torch.cuda.empty_cache()\n","torch.cuda.reset_peak_memory_stats()\n","\n","good_time = benchmark_operation(good_inplace_operations, num_iters=50)\n","good_memory = torch.cuda.max_memory_allocated() / 1024**2\n","\n","print(f'Bad approach: {bad_time:.2f} ms, {bad_memory:.1f} MB peak')\n","print(f'Good approach: {good_time:.2f} ms, {good_memory:.1f} MB peak')\n","print(f'Speedup: {bad_time/good_time:.2f}x')\n","print(f'Memory reduction: {bad_memory/good_memory:.2f}x')\n","\n","print('\\n🎯 Key Takeaway: Use in-place operations (add_, mul_, etc.)')\n","print(' Reduces memory allocation and garbage collection overhead')"]},{"cell_type":"markdown","metadata":{"id":"7ISx7KIGp_kx"},"source":["# Lesson 6: Tensor Core Optimization\n","\n","**Hardware Constraint:** Tensor Cores operate on 4×4 matrices of float16. Misaligned dimensions force fallback to CUDA cores—a 16x performance penalty."]},{"cell_type":"code","execution_count":9,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"3oQ53k8pp_kx","executionInfo":{"status":"ok","timestamp":1755176445833,"user_tz":-420,"elapsed":246,"user":{"displayName":"Laam Pham","userId":"04566654796696849937"}},"outputId":"2dd83185-735b-4129-9e85-771846cb998e"},"outputs":[{"output_type":"stream","name":"stdout","text":["Matrix multiply performance depends on tensor core compatibility:\n","Size Time (ms) TFLOPS Notes\n","--------------------------------------------------\n","512 0.08 3.36 ✅ TC-friendly\n","768 0.11 7.90 ✅ TC-friendly\n","1024 0.22 9.67 ✅ TC-friendly\n","1536 0.55 13.14 ✅ TC-friendly\n","2048 0.92 18.70 ✅ TC-friendly\n","\n","🎯 Key Takeaway: Use float16 and dimensions divisible by 16\n"," This maximizes tensor core utilization on modern GPUs\n"]}],"source":["print('Matrix multiply performance depends on tensor core compatibility:')\n","\n","# Test different matrix sizes - tensor cores prefer certain dimensions\n","sizes = [512, 768, 1024, 1536, 2048]\n","\n","print(f'{\"Size\":<8} {\"Time (ms)\":<10} {\"TFLOPS\":<10} {\"Notes\"}')\n","print('-' * 50)\n","\n","for size in sizes:\n"," def matmul_test():\n"," A = torch.randn(size, size, dtype=torch.float16, device='cuda')\n"," B = torch.randn(size, size, dtype=torch.float16, device='cuda')\n"," return torch.mm(A, B)\n","\n"," time_ms = benchmark_operation(matmul_test, num_iters=20)\n"," flops = 2 * size**3 # Matrix multiply FLOPS\n"," tflops = (flops / (time_ms * 1e-3)) / 1e12\n","\n"," # Tensor cores work best with dimensions divisible by 8/16\n"," tc_friendly = '✅ TC-friendly' if size % 16 == 0 else '⚠️ Sub-optimal'\n","\n"," print(f'{size:<8} {time_ms:<10.2f} {tflops:<10.2f} {tc_friendly}')\n","\n","print('\\n🎯 Key Takeaway: Use float16 and dimensions divisible by 16')\n","print(' This maximizes tensor core utilization on modern GPUs')"]},{"cell_type":"markdown","metadata":{"id":"Iz9G6I4fp_kx"},"source":["# Lesson 7: Memory Access Patterns\n","\n","**Memory Layout Principle:** GPU threads access memory in coalesced patterns. Non-contiguous access forces multiple memory transactions instead of single wide loads."]},{"cell_type":"code","execution_count":10,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"DRaeNq3Gp_kx","executionInfo":{"status":"ok","timestamp":1755176454772,"user_tz":-420,"elapsed":406,"user":{"displayName":"Laam Pham","userId":"04566654796696849937"}},"outputId":"32d4ea35-0ffe-450a-d3f2-b2e297cd77e0"},"outputs":[{"output_type":"stream","name":"stdout","text":["❌ BAD: Non-contiguous memory access\n","✅ GOOD: Contiguous memory access\n","Bad approach: 0.27 ms\n","Good approach: 1.32 ms\n","Speedup: 0.21x\n","\n","Memory layout check:\n","Original tensor is_contiguous: True\n","Transposed tensor is_contiguous: False\n","After .contiguous(): True\n","\n","🎯 Key Takeaway: Use .contiguous() after shape operations\n"," Check .is_contiguous() and call .contiguous() when needed\n"]}],"source":["size = (4096, 4096)\n","x = torch.randn(*size, device='cuda')\n","\n","print('❌ BAD: Non-contiguous memory access')\n","def bad_memory_pattern():\n"," # Transpose creates a view with different strides\n"," x_t = x.T\n"," return torch.sum(x_t, dim=0) # Non-contiguous access\n","\n","print('✅ GOOD: Contiguous memory access')\n","def good_memory_pattern():\n"," # Make contiguous first\n"," x_t = x.T.contiguous()\n"," return torch.sum(x_t, dim=0) # Contiguous access\n","\n","bad_time = benchmark_operation(bad_memory_pattern, num_iters=100)\n","good_time = benchmark_operation(good_memory_pattern, num_iters=100)\n","\n","print(f'Bad approach: {bad_time:.2f} ms')\n","print(f'Good approach: {good_time:.2f} ms')\n","print(f'Speedup: {bad_time/good_time:.2f}x')\n","\n","print(f'\\nMemory layout check:')\n","print(f'Original tensor is_contiguous: {x.is_contiguous()}')\n","print(f'Transposed tensor is_contiguous: {x.T.is_contiguous()}')\n","print(f'After .contiguous(): {x.T.contiguous().is_contiguous()}')\n","\n","print('\\n🎯 Key Takeaway: Use .contiguous() after shape operations')\n","print(' Check .is_contiguous() and call .contiguous() when needed')"]},{"cell_type":"markdown","metadata":{"id":"V1E-TpW1p_kx"},"source":["# Lesson 8: Performance Profiling\n","\n","**Measurement Principle:** You cannot optimize what you cannot measure. The profiler reveals the actual bottleneck—often surprising compared to intuition."]},{"cell_type":"code","execution_count":11,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":382},"id":"ZQYGvFQdp_kx","executionInfo":{"status":"error","timestamp":1755176465409,"user_tz":-420,"elapsed":247,"user":{"displayName":"Laam Pham","userId":"04566654796696849937"}},"outputId":"0134a7c8-94ef-4cfc-f62a-d0cba546cd64"},"outputs":[{"output_type":"stream","name":"stdout","text":["Running profiler example...\n"]},{"output_type":"error","ename":"RuntimeError","evalue":"element 0 of tensors does not require grad and does not have a grad_fn","traceback":["\u001b[0;31m---------------------------------------------------------------------------\u001b[0m","\u001b[0;31mRuntimeError\u001b[0m Traceback (most recent call last)","\u001b[0;32m/tmp/ipython-input-1548132624.py\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 25\u001b[0m ) as prof:\n\u001b[1;32m 26\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0m_\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m10\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 27\u001b[0;31m \u001b[0mexample_neural_network\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 28\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 29\u001b[0m \u001b[0;31m# Print profiling results\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;32m/tmp/ipython-input-1548132624.py\u001b[0m in \u001b[0;36mexample_neural_network\u001b[0;34m()\u001b[0m\n\u001b[1;32m 12\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 13\u001b[0m \u001b[0;31m# Backward pass\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 14\u001b[0;31m \u001b[0mloss\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbackward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 15\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 16\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mloss\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;32m/usr/local/lib/python3.11/dist-packages/torch/_tensor.py\u001b[0m in \u001b[0;36mbackward\u001b[0;34m(self, gradient, retain_graph, create_graph, inputs)\u001b[0m\n\u001b[1;32m 624\u001b[0m \u001b[0minputs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0minputs\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 625\u001b[0m )\n\u001b[0;32m--> 626\u001b[0;31m torch.autograd.backward(\n\u001b[0m\u001b[1;32m 627\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mgradient\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mretain_graph\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcreate_graph\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minputs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0minputs\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 628\u001b[0m )\n","\u001b[0;32m/usr/local/lib/python3.11/dist-packages/torch/autograd/__init__.py\u001b[0m in \u001b[0;36mbackward\u001b[0;34m(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)\u001b[0m\n\u001b[1;32m 345\u001b[0m \u001b[0;31m# some Python versions print out the first line of a multi-line function\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 346\u001b[0m \u001b[0;31m# calls in the traceback and some print out the last line\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 347\u001b[0;31m _engine_run_backward(\n\u001b[0m\u001b[1;32m 348\u001b[0m \u001b[0mtensors\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 349\u001b[0m \u001b[0mgrad_tensors_\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;32m/usr/local/lib/python3.11/dist-packages/torch/autograd/graph.py\u001b[0m in \u001b[0;36m_engine_run_backward\u001b[0;34m(t_outputs, *args, **kwargs)\u001b[0m\n\u001b[1;32m 821\u001b[0m \u001b[0munregister_hooks\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_register_logging_hooks_on_whole_graph\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mt_outputs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 822\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 823\u001b[0;31m return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass\n\u001b[0m\u001b[1;32m 824\u001b[0m \u001b[0mt_outputs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 825\u001b[0m ) # Calls into the C++ engine to run the backward pass\n","\u001b[0;31mRuntimeError\u001b[0m: element 0 of tensors does not require grad and does not have a grad_fn"]}],"source":["def example_neural_network():\n"," # Simple neural network operations\n"," x = torch.randn(1024, 512, device='cuda')\n"," W1 = torch.randn(512, 256, device='cuda')\n"," W2 = torch.randn(256, 10, device='cuda')\n","\n"," # Forward pass\n"," h1 = torch.mm(x, W1)\n"," h1 = torch.relu(h1)\n"," output = torch.mm(h1, W2)\n"," loss = torch.sum(output**2)\n","\n"," # Backward pass\n"," loss.backward()\n","\n"," return loss\n","\n","print('Running profiler example...')\n","\n","# Profile the neural network\n","with torch.profiler.profile(\n"," activities=[torch.profiler.ProfilerActivity.CPU,\n"," torch.profiler.ProfilerActivity.CUDA],\n"," record_shapes=True,\n",") as prof:\n"," for _ in range(10):\n"," example_neural_network()\n","\n","# Print profiling results\n","print('\\nTop 5 GPU operations by time:')\n","print(prof.key_averages().table(sort_by='cuda_time_total', row_limit=5))\n","\n","print('\\n🎯 Key Takeaway: Use torch.profiler to identify bottlenecks')\n","print(' Focus optimization efforts on the most time-consuming operations')"]},{"cell_type":"markdown","metadata":{"id":"dtm6fq10p_kx"},"source":["# Summary: PyTorch CUDA Best Practices\n","\n","## The Systematic Optimization Framework\n","\n","**P(optimization_success|measurement) >> P(optimization_success|intuition)**\n","\n","### Core Practices:\n","\n","1. **📱 Create tensors directly on GPU** with `device='cuda'`\n","2. **🔢 Use float32** unless float64 precision is required\n","3. **🚫 Minimize CPU-GPU transfers** (`.cpu()`, `.cuda()`)\n","4. **📦 Batch operations** using `bmm()`, 3D+ tensors\n","5. **⚡ Use in-place operations** (`add_`, `mul_`, etc.) to save memory\n","6. **🎯 Leverage tensor cores** with float16 + dims divisible by 16\n","7. **🧠 Ensure memory contiguity** with `.contiguous()`\n","8. **📊 Profile code** to identify actual bottlenecks\n","9. **🔄 Always use `torch.cuda.synchronize()`** for accurate timing\n","10. **🎮 Understand hardware limits** (memory vs compute bound)\n","\n","### The Three Performance Regimes:\n","\n","| **Regime** | **Characteristics** | **Solutions** |\n","|------------|--------------------|--------------|\n","| **Overhead-Bound** | Runtime doesn't scale with data size | Tracing, operator fusion, JIT compilation |\n","| **Memory-Bound** | Low FLOPS utilization, high bandwidth | Operator fusion, increase arithmetic intensity |\n","| **Compute-Bound** | High FLOPS utilization | Use Tensor Cores, upgrade hardware |\n","\n","### Key Formulas:\n","\n","- **Arithmetic Intensity** = `FLOPS / Bytes_Accessed`\n","- **Memory Usage** = `batch_size × seq_len × hidden_dim × bytes_per_element`\n","- **P(tensor_core_usage|float16 + aligned_dims) ≈ 1.0**\n","\n","**Remember:** The microbenchmarking results show that performance depends on arithmetic intensity. Optimize based on whether your operations are memory-bound or compute-bound!"]}],"metadata":{"kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.0"},"colab":{"provenance":[],"gpuType":"T4"},"accelerator":"GPU"},"nbformat":4,"nbformat_minor":0}