# PyTorch CUDA 101: GPU Optimization Mastery

**From First Principles to Tensor Cores**

This notebook demonstrates essential CUDA patterns in PyTorch, based on performance principles revealed by GPU microbenchmarking.

## Key Principles:
1. Minimize GPU-CPU data transfers
2. Choose appropriate data types (float32 vs float64)
3. Batch operations to increase arithmetic intensity
4. Use in-place operations when possible
5. Leverage tensor cores for matrix operations
6. Understand memory access patterns
7. Profile to identify bottlenecks

---

In [1]:
!nvidia-smi

Thu Aug 14 12:58:27 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   53C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## Setup and Imports

In [2]:
import torch
import time
import math
from typing import Tuple, Optional

def benchmark_operation(func, *args, num_iters=1000, warmup=100):
    """Benchmark a PyTorch operation with proper CUDA synchronization."""
    # Warmup to eliminate kernel compilation overhead
    for _ in range(warmup):
        func(*args)
    torch.cuda.synchronize()

    # Actual timing
    start = time.perf_counter()
    for _ in range(num_iters):
        result = func(*args)
    torch.cuda.synchronize()

    elapsed = time.perf_counter() - start
    return (elapsed / num_iters) * 1000  # Convert to milliseconds

# Check CUDA availability
if not torch.cuda.is_available():
    raise RuntimeError("CUDA not available - GPU required for tutorial")

device = torch.device('cuda')
print(f'‚úÖ Using GPU: {torch.cuda.get_device_name()}')
print(f'‚úÖ CUDA Version: {torch.version.cuda}')
print(f'‚úÖ PyTorch Version: {torch.__version__}')

‚úÖ Using GPU: Tesla T4
‚úÖ CUDA Version: 12.4
‚úÖ PyTorch Version: 2.6.0+cu124


# Lesson 0: GPU Memory Baseline - Understanding CUDA Overhead

**Key Reality Check:** CUDA kernels consume 1-2 GB regardless of your model size!

You might think you could compute memory requirements exactly, but CUDA kernels require substantial overhead that makes precise calculations challenging.

In [3]:
# Demonstrate CUDA kernel memory overhead
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()
baseline_memory = torch.cuda.memory_allocated() / 1024**2

# Create minimal tensor to initialize CUDA context
minimal_tensor = torch.ones((1, 1), device='cuda')
cuda_overhead = torch.cuda.memory_allocated() / 1024**2

print(f'üìä Memory before CUDA initialization: {baseline_memory:.1f} MB')
print(f'üìä Memory after CUDA initialization: {cuda_overhead:.1f} MB')
print(f'üéØ CUDA kernel overhead: {cuda_overhead - baseline_memory:.1f} MB')
print(f'\nüí° This overhead is constant and unavoidable for any GPU computation!')
print(f'   Additional memory used for buffers, intermediate results, and fragmentation')
print(f'   makes precise memory calculations challenging - focus on relative improvements.')

üìä Memory before CUDA initialization: 0.0 MB
üìä Memory after CUDA initialization: 0.0 MB
üéØ CUDA kernel overhead: 0.0 MB

üí° This overhead is constant and unavoidable for any GPU computation!
   Additional memory used for buffers, intermediate results, and fragmentation
   makes precise memory calculations challenging - focus on relative improvements.


# Lesson 1: Device Management & Tensor Creation

**Principle:** Memory allocation location is immutable post-creation. CPU‚ÜíGPU transfer involves PCIe bandwidth (~16GB/s) vs GPU memory bandwidth (~1500GB/s).

In [4]:
print('‚ùå BAD: Creating on CPU then moving to GPU')
def bad_tensor_creation(size):
    x = torch.randn(size, size)  # Created on CPU
    x = x.cuda()  # Expensive CPU->GPU transfer
    return x

print('‚úÖ GOOD: Creating directly on GPU')
def good_tensor_creation(size):
    x = torch.randn(size, size, device='cuda')  # Created directly on GPU
    return x

size = 1024
bad_time = benchmark_operation(bad_tensor_creation, size, num_iters=100)
good_time = benchmark_operation(good_tensor_creation, size, num_iters=100)

print(f'Bad approach:  {bad_time:.2f} ms')
print(f'Good approach: {good_time:.2f} ms')
print(f'Speedup: {bad_time/good_time:.2f}x')

print('\nüéØ Key Takeaway: Always create tensors directly on the target device')
print('   Use device="cuda" parameter in tensor creation functions')

‚ùå BAD: Creating on CPU then moving to GPU
‚úÖ GOOD: Creating directly on GPU
Bad approach:  7.84 ms
Good approach: 0.04 ms
Speedup: 206.32x

üéØ Key Takeaway: Always create tensors directly on the target device
   Use device="cuda" parameter in tensor creation functions


# Lesson 2: Data Type Optimization

**Surprising Implication:** Float16 isn't just 2x faster‚Äîit enables Tensor Cores (312 TFLOPS vs 19.5 TFLOPS). This demonstrates a 16x performance cliff, not gradual degradation.

In [5]:
size = 2048

def matmul_float64():
    A = torch.randn(size, size, dtype=torch.float64, device='cuda')
    B = torch.randn(size, size, dtype=torch.float64, device='cuda')
    return torch.mm(A, B)

def matmul_float32():
    A = torch.randn(size, size, dtype=torch.float32, device='cuda')
    B = torch.randn(size, size, dtype=torch.float32, device='cuda')
    return torch.mm(A, B)

def matmul_float16():
    A = torch.randn(size, size, dtype=torch.float16, device='cuda')
    B = torch.randn(size, size, dtype=torch.float16, device='cuda')
    return torch.mm(A, B)

time_f64 = benchmark_operation(matmul_float64, num_iters=50)
time_f32 = benchmark_operation(matmul_float32, num_iters=50)
time_f16 = benchmark_operation(matmul_float16, num_iters=50)

print(f'Float64: {time_f64:.2f} ms')
print(f'Float32: {time_f32:.2f} ms ({time_f64/time_f32:.2f}x faster)')
print(f'Float16: {time_f16:.2f} ms ({time_f64/time_f16:.2f}x faster)')

print('\nüéØ Key Takeaway: Use float32 unless you need float64 precision')
print('   Float16 is even faster but may have numerical stability issues')

# Memory usage comparison
f64_tensor = torch.randn(1000, 1000, dtype=torch.float64, device='cuda')
f32_tensor = torch.randn(1000, 1000, dtype=torch.float32, device='cuda')

print(f'\nMemory usage:')
print(f'Float64: {f64_tensor.element_size() * f64_tensor.numel() / 1024**2:.1f} MB')
print(f'Float32: {f32_tensor.element_size() * f32_tensor.numel() / 1024**2:.1f} MB')

Float64: 71.54 ms
Float32: 4.58 ms (15.63x faster)
Float16: 1.15 ms (62.09x faster)

üéØ Key Takeaway: Use float32 unless you need float64 precision
   Float16 is even faster but may have numerical stability issues

Memory usage:
Float64: 7.6 MB
Float32: 3.8 MB


# Lesson 3: CPU-GPU Transfer Optimization

**Hidden Cost:** Each transfer incurs ~10Œºs latency + bandwidth cost. For small operations, latency dominates‚Äîyou're paying milliseconds to save microseconds.

In [6]:
x = torch.randn(1000, 1000, device='cuda')

print('‚ùå BAD: Frequent CPU-GPU transfers')
def bad_cpu_gpu_pattern():
    # Convert to CPU, do numpy operation, back to GPU
    x_cpu = x.cpu().numpy()  # GPU -> CPU
    result_cpu = x_cpu.sum()  # CPU operation
    result_gpu = torch.tensor(result_cpu, device='cuda')  # CPU -> GPU
    return result_gpu

print('‚úÖ GOOD: Keep operations on GPU')
def good_gpu_pattern():
    result = x.sum()  # All on GPU
    return result

bad_time = benchmark_operation(bad_cpu_gpu_pattern, num_iters=100)
good_time = benchmark_operation(good_gpu_pattern, num_iters=100)

print(f'Bad approach:  {bad_time:.2f} ms')
print(f'Good approach: {good_time:.2f} ms')
print(f'Speedup: {bad_time/good_time:.1f}x')

print('\nüéØ Key Takeaway: Keep data on GPU as long as possible')
print('   Use PyTorch operations instead of numpy when possible')

‚ùå BAD: Frequent CPU-GPU transfers
‚úÖ GOOD: Keep operations on GPU
Bad approach:  1.64 ms
Good approach: 0.02 ms
Speedup: 75.2x

üéØ Key Takeaway: Keep data on GPU as long as possible
   Use PyTorch operations instead of numpy when possible


# Lesson 4: Batching for Arithmetic Intensity

**First Principles:** Single operations have low arithmetic intensity (FLOPS/memory_access). Batching increases intensity from O(n¬≤) to O(n¬≥) for matrix operations.

In [7]:
print('‚ùå BAD: Processing one sample at a time')
def bad_sequential_processing():
    samples = [torch.randn(256, 256, device='cuda') for _ in range(32)]
    results = []
    for sample in samples:
        result = torch.mm(sample, sample.T)  # Individual matrix multiply
        results.append(result)
    return torch.stack(results)

print('‚úÖ GOOD: Batch processing')
def good_batch_processing():
    # Create batched tensor directly
    batch = torch.randn(32, 256, 256, device='cuda')
    # Batched matrix multiply - much more efficient
    result = torch.bmm(batch, batch.transpose(-2, -1))
    return result

bad_time = benchmark_operation(bad_sequential_processing, num_iters=10)
good_time = benchmark_operation(good_batch_processing, num_iters=10)

print(f'Bad approach:  {bad_time:.2f} ms')
print(f'Good approach: {good_time:.2f} ms')
print(f'Speedup: {bad_time/good_time:.2f}x')

print('\nüéØ Key Takeaway: Batch operations whenever possible')
print('   Use bmm(), batch matrix operations, and higher-dimensional tensors')

‚ùå BAD: Processing one sample at a time
‚úÖ GOOD: Batch processing
Bad approach:  1.71 ms
Good approach: 0.34 ms
Speedup: 5.03x

üéØ Key Takeaway: Batch operations whenever possible
   Use bmm(), batch matrix operations, and higher-dimensional tensors


# Lesson 5: In-place Operations

**Memory Allocator Tax:** Each allocation involves GPU memory manager overhead. In-place operations eliminate allocation/deallocation cycles entirely.

In [8]:
size = (2048, 2048)

print('‚ùå BAD: Creating new tensors')
def bad_memory_allocation():
    x = torch.randn(*size, device='cuda')
    y = torch.randn(*size, device='cuda')
    z = x + y  # Creates new tensor
    w = z * 2  # Creates another new tensor
    return w

print('‚úÖ GOOD: In-place operations')
def good_inplace_operations():
    x = torch.randn(*size, device='cuda')
    y = torch.randn(*size, device='cuda')
    x.add_(y)  # In-place addition
    x.mul_(2)  # In-place multiplication
    return x

# Monitor memory usage
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()

bad_time = benchmark_operation(bad_memory_allocation, num_iters=50)
bad_memory = torch.cuda.max_memory_allocated() / 1024**2

torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()

good_time = benchmark_operation(good_inplace_operations, num_iters=50)
good_memory = torch.cuda.max_memory_allocated() / 1024**2

print(f'Bad approach:  {bad_time:.2f} ms, {bad_memory:.1f} MB peak')
print(f'Good approach: {good_time:.2f} ms, {good_memory:.1f} MB peak')
print(f'Speedup: {bad_time/good_time:.2f}x')
print(f'Memory reduction: {bad_memory/good_memory:.2f}x')

print('\nüéØ Key Takeaway: Use in-place operations (add_, mul_, etc.)')
print('   Reduces memory allocation and garbage collection overhead')

‚ùå BAD: Creating new tensors
‚úÖ GOOD: In-place operations
Bad approach:  0.59 ms, 103.8 MB peak
Good approach: 0.52 ms, 71.8 MB peak
Speedup: 1.13x
Memory reduction: 1.45x

üéØ Key Takeaway: Use in-place operations (add_, mul_, etc.)
   Reduces memory allocation and garbage collection overhead


# Lesson 6: Tensor Core Optimization

**Hardware Constraint:** Tensor Cores operate on 4√ó4 matrices of float16. Misaligned dimensions force fallback to CUDA cores‚Äîa 16x performance penalty.

In [9]:
print('Matrix multiply performance depends on tensor core compatibility:')

# Test different matrix sizes - tensor cores prefer certain dimensions
sizes = [512, 768, 1024, 1536, 2048]

print(f'{"Size":<8} {"Time (ms)":<10} {"TFLOPS":<10} {"Notes"}')
print('-' * 50)

for size in sizes:
    def matmul_test():
        A = torch.randn(size, size, dtype=torch.float16, device='cuda')
        B = torch.randn(size, size, dtype=torch.float16, device='cuda')
        return torch.mm(A, B)

    time_ms = benchmark_operation(matmul_test, num_iters=20)
    flops = 2 * size**3  # Matrix multiply FLOPS
    tflops = (flops / (time_ms * 1e-3)) / 1e12

    # Tensor cores work best with dimensions divisible by 8/16
    tc_friendly = '‚úÖ TC-friendly' if size % 16 == 0 else '‚ö†Ô∏è  Sub-optimal'

    print(f'{size:<8} {time_ms:<10.2f} {tflops:<10.2f} {tc_friendly}')

print('\nüéØ Key Takeaway: Use float16 and dimensions divisible by 16')
print('   This maximizes tensor core utilization on modern GPUs')

Matrix multiply performance depends on tensor core compatibility:
Size     Time (ms)  TFLOPS     Notes
--------------------------------------------------
512      0.08       3.36       ‚úÖ TC-friendly
768      0.11       7.90       ‚úÖ TC-friendly
1024     0.22       9.67       ‚úÖ TC-friendly
1536     0.55       13.14      ‚úÖ TC-friendly
2048     0.92       18.70      ‚úÖ TC-friendly

üéØ Key Takeaway: Use float16 and dimensions divisible by 16
   This maximizes tensor core utilization on modern GPUs


# Lesson 7: Memory Access Patterns

**Memory Layout Principle:** GPU threads access memory in coalesced patterns. Non-contiguous access forces multiple memory transactions instead of single wide loads.

In [10]:
size = (4096, 4096)
x = torch.randn(*size, device='cuda')

print('‚ùå BAD: Non-contiguous memory access')
def bad_memory_pattern():
    # Transpose creates a view with different strides
    x_t = x.T
    return torch.sum(x_t, dim=0)  # Non-contiguous access

print('‚úÖ GOOD: Contiguous memory access')
def good_memory_pattern():
    # Make contiguous first
    x_t = x.T.contiguous()
    return torch.sum(x_t, dim=0)  # Contiguous access

bad_time = benchmark_operation(bad_memory_pattern, num_iters=100)
good_time = benchmark_operation(good_memory_pattern, num_iters=100)

print(f'Bad approach:  {bad_time:.2f} ms')
print(f'Good approach: {good_time:.2f} ms')
print(f'Speedup: {bad_time/good_time:.2f}x')

print(f'\nMemory layout check:')
print(f'Original tensor is_contiguous: {x.is_contiguous()}')
print(f'Transposed tensor is_contiguous: {x.T.is_contiguous()}')
print(f'After .contiguous(): {x.T.contiguous().is_contiguous()}')

print('\nüéØ Key Takeaway: Use .contiguous() after shape operations')
print('   Check .is_contiguous() and call .contiguous() when needed')

‚ùå BAD: Non-contiguous memory access
‚úÖ GOOD: Contiguous memory access
Bad approach:  0.27 ms
Good approach: 1.32 ms
Speedup: 0.21x

Memory layout check:
Original tensor is_contiguous: True
Transposed tensor is_contiguous: False
After .contiguous(): True

üéØ Key Takeaway: Use .contiguous() after shape operations
   Check .is_contiguous() and call .contiguous() when needed


# Lesson 8: Performance Profiling

**Measurement Principle:** You cannot optimize what you cannot measure. The profiler reveals the actual bottleneck‚Äîoften surprising compared to intuition.

In [11]:
def example_neural_network():
    # Simple neural network operations
    x = torch.randn(1024, 512, device='cuda')
    W1 = torch.randn(512, 256, device='cuda')
    W2 = torch.randn(256, 10, device='cuda')

    # Forward pass
    h1 = torch.mm(x, W1)
    h1 = torch.relu(h1)
    output = torch.mm(h1, W2)
    loss = torch.sum(output**2)

    # Backward pass
    loss.backward()

    return loss

print('Running profiler example...')

# Profile the neural network
with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CPU,
               torch.profiler.ProfilerActivity.CUDA],
    record_shapes=True,
) as prof:
    for _ in range(10):
        example_neural_network()

# Print profiling results
print('\nTop 5 GPU operations by time:')
print(prof.key_averages().table(sort_by='cuda_time_total', row_limit=5))

print('\nüéØ Key Takeaway: Use torch.profiler to identify bottlenecks')
print('   Focus optimization efforts on the most time-consuming operations')

Running profiler example...


RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

# Summary: PyTorch CUDA Best Practices

## The Systematic Optimization Framework

**P(optimization_success|measurement) >> P(optimization_success|intuition)**

### Core Practices:

1. **üì± Create tensors directly on GPU** with `device='cuda'`
2. **üî¢ Use float32** unless float64 precision is required
3. **üö´ Minimize CPU-GPU transfers** (`.cpu()`, `.cuda()`)
4. **üì¶ Batch operations** using `bmm()`, 3D+ tensors
5. **‚ö° Use in-place operations** (`add_`, `mul_`, etc.) to save memory
6. **üéØ Leverage tensor cores** with float16 + dims divisible by 16
7. **üß† Ensure memory contiguity** with `.contiguous()`
8. **üìä Profile code** to identify actual bottlenecks
9. **üîÑ Always use `torch.cuda.synchronize()`** for accurate timing
10. **üéÆ Understand hardware limits** (memory vs compute bound)

### The Three Performance Regimes:

| **Regime** | **Characteristics** | **Solutions** |
|------------|--------------------|--------------|
| **Overhead-Bound** | Runtime doesn't scale with data size | Tracing, operator fusion, JIT compilation |
| **Memory-Bound** | Low FLOPS utilization, high bandwidth | Operator fusion, increase arithmetic intensity |
| **Compute-Bound** | High FLOPS utilization | Use Tensor Cores, upgrade hardware |

### Key Formulas:

- **Arithmetic Intensity** = `FLOPS / Bytes_Accessed`
- **Memory Usage** = `batch_size √ó seq_len √ó hidden_dim √ó bytes_per_element`
- **P(tensor_core_usage|float16 + aligned_dims) ‚âà 1.0**

**Remember:** The microbenchmarking results show that performance depends on arithmetic intensity. Optimize based on whether your operations are memory-bound or compute-bound!