Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
87
67
113
Aryan V S
a-r-r-o-w
Follow
John6666's profile picture
lucataco's profile picture
MahaboobBash03's profile picture
164 followers
·
109 following
a-r-r-o-w
AI & ML interests
computer vision, reinforcement learning
Recent Activity
liked
a model
about 11 hours ago
openai/gpt-oss-120b
liked
a model
about 11 hours ago
openai/gpt-oss-20b
replied
to
their
post
about 12 hours ago
You would've implemented the 3-loop matrix multiplication many times as a ML practitioner, but the naive implementation is terrible for GPU performance. Modern GPUs achieve peak performance through careful memory access patterns and minimizing scheduling overhead. In naive matmul (MxK . KxN), the computation happens in tiles - both for the output matrix and for how you read chunks from the input matrices. Each thread-block processes one output tile by loading corresponding tiles from input (for sum-reduction across K dimension), performing the computation, then terminating. The GPU launches many thread-blocks and schedules them across available streaming multiprocessors (SMs). When an SM finishes one tile, it gets assigned a new thread-block for the next uncomputed tile. This way, multiple output tiles are computed in parallel across the SMs, but we pay the cost for launching thread-blocks each time a new tile is computed. Persistent matmul changes this approach. Instead of launching thread-blocks to compute some output tiles, computing the results on SMs in parallel, and repeating until all output tiles are computed, you launch only as many thread-blocks as you have SMs available (typically 80-132 on modern GPUs). These thread-blocks stay alive until all output tiles are computed, looping through multiple tiles sequentially. Each persistent thread-block may handle multiple output tiles. The key benefit is the reduced thread-block launch latency. This persistence strategy, combined with other optimizations like coalesced memory loads/stores, block-tiling, warp-tiling, warp-specialization, double-buffering, ping-pong scheduling and other tricks, helps achieve peak performance. More on this in the future! Code snippet for testing: https://gist.github.com/a-r-r-o-w/28339b442d164084506c0967029968a8 (Bonus: Since I've wanted to learn Manim for a while, this was a great opportunity to make a visualization for Naive VS Persistent matmul. Enjoy ✨)
View all activity
Organizations
a-r-r-o-w
's models
13
Sort: Recently updated
a-r-r-o-w/Wan-VACE-1.3B-diffusers
Updated
May 20
•
1.08k
•
1
a-r-r-o-w/LTX-Video-0.9.7-Latent-Spatial-Upsampler-diffusers
Updated
May 7
•
114
a-r-r-o-w/LTX-Video-0.9.7-diffusers
Updated
May 7
•
303
•
3
a-r-r-o-w/LTX-Video-0.9.1-diffusers
Text-to-Video
•
Updated
Mar 18
•
643
•
7
a-r-r-o-w/HunyuanVideo-tuxemons
Text-to-Video
•
Updated
Jan 6
•
9
•
5
a-r-r-o-w/LTX-Video-diffusers
Text-to-Video
•
Updated
Dec 25, 2024
•
733
•
4
a-r-r-o-w/cogvideox-disney-adamw-4000-0.0003-constant
Text-to-Video
•
Updated
Oct 10, 2024
•
12
•
4
a-r-r-o-w/cogvideox-disney-adamw-3000-0.0003
Text-to-Video
•
Updated
Oct 10, 2024
•
9
•
8
a-r-r-o-w/ConsistencyTTA
Updated
Jul 1, 2024
•
2
a-r-r-o-w/AnyText
Text-to-Image
•
Updated
Jun 22, 2024
•
5
•
1
a-r-r-o-w/animatediff-motion-adapter-sdxl-beta
Updated
Apr 30, 2024
•
64
•
3
a-r-r-o-w/motionctrl-svd
Updated
Mar 1, 2024
•
5
•
2
a-r-r-o-w/dragnuwa-svd
Updated
Feb 24, 2024
•
3