Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
87
67
113
Aryan V S
a-r-r-o-w
Follow
adler3565's profile picture
zhaoyun0071's profile picture
yisroel234's profile picture
164 followers
·
109 following
a-r-r-o-w
AI & ML interests
computer vision, reinforcement learning
Recent Activity
liked
a model
about 14 hours ago
openai/gpt-oss-120b
liked
a model
about 14 hours ago
openai/gpt-oss-20b
replied
to
their
post
about 15 hours ago
You would've implemented the 3-loop matrix multiplication many times as a ML practitioner, but the naive implementation is terrible for GPU performance. Modern GPUs achieve peak performance through careful memory access patterns and minimizing scheduling overhead. In naive matmul (MxK . KxN), the computation happens in tiles - both for the output matrix and for how you read chunks from the input matrices. Each thread-block processes one output tile by loading corresponding tiles from input (for sum-reduction across K dimension), performing the computation, then terminating. The GPU launches many thread-blocks and schedules them across available streaming multiprocessors (SMs). When an SM finishes one tile, it gets assigned a new thread-block for the next uncomputed tile. This way, multiple output tiles are computed in parallel across the SMs, but we pay the cost for launching thread-blocks each time a new tile is computed. Persistent matmul changes this approach. Instead of launching thread-blocks to compute some output tiles, computing the results on SMs in parallel, and repeating until all output tiles are computed, you launch only as many thread-blocks as you have SMs available (typically 80-132 on modern GPUs). These thread-blocks stay alive until all output tiles are computed, looping through multiple tiles sequentially. Each persistent thread-block may handle multiple output tiles. The key benefit is the reduced thread-block launch latency. This persistence strategy, combined with other optimizations like coalesced memory loads/stores, block-tiling, warp-tiling, warp-specialization, double-buffering, ping-pong scheduling and other tricks, helps achieve peak performance. More on this in the future! Code snippet for testing: https://gist.github.com/a-r-r-o-w/28339b442d164084506c0967029968a8 (Bonus: Since I've wanted to learn Manim for a while, this was a great opportunity to make a visualization for Naive VS Persistent matmul. Enjoy ✨)
View all activity
Organizations
a-r-r-o-w
's datasets
6
Sort: Recently updated
a-r-r-o-w/trojblue-pixelart-videos
Viewer
•
Updated
Mar 9
•
173
•
170
a-r-r-o-w/trojblue-pixelart-images
Viewer
•
Updated
Mar 9
•
140
•
19
•
1
a-r-r-o-w/penguin-video-benchmark
Viewer
•
Updated
Feb 23
•
600
•
22
a-r-r-o-w/flux-retrostyle-dataset-mini
Viewer
•
Updated
Jan 5
•
100
•
15
a-r-r-o-w/randoms
Viewer
•
Updated
Nov 25, 2024
•
693
•
2.14k
a-r-r-o-w/tiny-meme-dataset-captioned
Updated
Sep 19, 2024
•
137
•
1