CUDA Programming for NVIDIA H100s – Comprehensive Course
Learn CUDA programming for NVIDIA Hopper GPUs. You will learn to build efficient WGMMA pipelines and leverage Cutlass optimizations to perform the massive matrix multiplications that power modern AI. Beyond single-chip performance, the curriculum covers multi-GPU scaling and NCCL primitives necessary for training trillion-parameter models. To get the most out of these lessons, you should have a foundational grasp of C++ syntax and linear algebra, particularly how matrices are tiled and multiplied.- Course website: https://cudacourseh100.github.io
- Course repo: https://github.com/cudacourseh100/H100-Course
- X: https://x.com/_PrateekShukla_
- GitHub Sponsors: https://github.com/sponsors/prateekshukla1108
✏️ Developed by @Prateek_Shukla
❤️ Support for this channel comes from our friends at Scrimba – the coding platform that's reinvented interactive learning: https://scrimba.com/freecodecamp
Contents
- 0:00:00 Course Introduction
- 0:07:27 Table of Contents & Course Overview
- 0:23:30 LESSON 1 — H100 Hopper GPU Architecture
- 0:25:47 H100 Specifications: HBM3, Bandwidth & Power
- 0:26:22 Tensor Cores Overview
- 0:27:18 Tensor Memory Accelerator (TMA)
- 0:34:44 Transformer Engine
- 0:34:58 L2 Cache Architecture
- 0:35:21 GPCs, TPCs & SM Layout
- 0:37:00 Thread Block Clusters
- 0:46:22 Distributed Shared Memory
- 0:52:44 SM Sub-Partitions (SMSPs)
- 0:54:01 Warp Schedulers & Dispatch Units
- 1:02:37 Shared Memory & Data Movement
- 1:12:20 Occupancy
- 1:32:49 LESSON 2 — Clusters, Data Types, Inline PTX & Pointers
- 1:32:57 Thread Block Clusters Programming
- 1:42:11 Configuring Cluster Dimensions
- 1:48:08 Inline PTX Assembly
- 1:59:31 State Spaces
- 2:06:01 Data Types in PTX
- 2:07:16 Generic Pointers
- 2:09:59 Address Space Conversion
- 2:15:14 LESSON 3 — Asynchronicity & Barriers
- 2:15:22 Introduction to Async Operations
- 2:28:06 Proxies
- 2:28:56 Fences & Memory Ordering
- 2:36:17 Fence Ordering & Visibility
- 2:38:58 Fence Scopes
- 2:40:30 Acquire & Release Fences
- 2:45:18 Expected Count & Thread Arrival
- 2:46:01 M-Barrier Arrive Operations
- 2:55:37 M-Barrier PTX Instructions
- 3:07:21 Barrier Wait Operations
- 3:10:03 Phase & Parity
- 3:59:42 Commit Operations
- 4:10:06 LESSON 4 — CuTensorMap Descriptors
- 4:16:18 Tensor Shape, Stride & Data Type
- 4:22:52 Element Stride & Dimensions
- 4:24:13 Box Dimensions (Tile Size)
- 4:30:30 Bank Conflicts
- 4:31:05 Swizzling
- 4:33:02 Swizzle Formula Deep Dive
- 4:52:48 Interleave Layouts
- 5:04:22 Out-of-Bounds Fill (OOB)
- 5:06:01 LESSON 5 — cp.async.bulk (Async Bulk Copies via TMA)
- 5:08:04 Bulk Tensor Operations (1D–5D)
- 5:27:31 Multicast Operations
- 5:47:44 Prefetch
- 5:53:41 LESSON 6 — WGMMA Part 1 (Warp Group Matrix Multiply Accumulate)
- 5:59:18 Warp Groups & Matrix Multiplication
- 6:03:21 WGMMA Descriptors
- 6:07:26 Accumulators & Register Reuse
- 6:30:34 Scale Factors (Scale D, Scale A, Scale B)
- 6:47:05 Core Matrices & 16×16 Tiles
- 7:44:03 LESSON 7 — WGMMA Part 2
- 7:46:02 Commit Groups & Wait Groups
- 8:04:31 WGMMA with FP8 Data Types
- 8:48:46 LESSON 8 — Kernel Design
- 8:50:58 Compute-Bound vs. Memory-Bound Kernels
- 8:54:10 Warp Specialization
- 9:08:56 Cooperative vs. Ping-Pong Pipelines
- 9:09:47 Pipelining Fundamentals
- 9:12:47 Circular Buffering
- 9:36:38 Ping-Pong Pipeline Deep Dive
- 9:37:34 Epilogue Handling in Pipelines
- 9:43:52 Persistent Scheduling
- 10:48:13 Split-K & Stream-K Strategies
- 10:57:20 Data-Parallel Tile Scheduling
- 11:35:23 Epilogue Fusion (Bias, Activation, Scaling)
- 11:41:35 Epilogue Operations Overview
- 12:05:32 CUTLASS SOURCE CODE WALKTHROUGH
- 13:04:17 Main Loop & Scheduling Policies
- 13:51:03 Dispatch Policy
- 15:18:49 SM90 Tile Scheduler
- 17:58:46 SM90 Epilogue (TMA Warp Specialized)
- 19:22:42 SM90 Builder
- 19:44:58 Collective Builder
- 19:49:56 FAST.CU KERNEL WALKTHROUGH
- 19:55:19 Main Loop Implementation
- 20:06:51 Producer Warp Group (Dependence Wall)
- 20:12:08 Consumer Warp Group
- 21:29:30 Prologue
- 21:47:09 MULTI-GPU PROGRAMMING — Part 1
- 21:56:19 NVSwitch
- 22:03:23 Topology & System Architecture
- 22:17:03 NVSwitch, BlueField DPUs & Storage Fabrics
- 22:37:19 CUDA Peer-to-Peer Communication
- 22:37:57 MPI (Message Passing Interface)
- 22:46:29 P2P Limitations & Trade-offs
- 22:49:29 MULTI-GPU PROGRAMMING — Part 2
- 22:52:20 SLURM Resource Allocation
- 22:52:52 PMIx Process Management
- 23:05:27 NCCL (NVIDIA Collective Communications Library)
- 23:15:20 NCCL Internals & Ring Algorithm
- 23:17:43 AllReduce Operations
- 23:34:18 NCCL Collectives: Broadcast, AllGather, ReduceScatter
- 23:36:16 Parallelism Strategies: Data, Tensor, Pipeline & Expert Parallelism
- 24:37:56 Course Conclusion & Next Steps Receive SMS online on sms24.me
TubeReader video aggregator is a website that collects and organizes online videos from the YouTube source. Video aggregation is done for different purposes, and TubeReader take different approaches to achieve their purpose.
Our try to collect videos of high quality or interest for visitors to view; the collection may be made by editors or may be based on community votes.
Another method is to base the collection on those videos most viewed, either at the aggregator site or at various popular video hosting sites.
TubeReader site exists to allow users to collect their own sets of videos, for personal use as well as for browsing and viewing by others; TubeReader can develop online communities around video sharing.
Our site allow users to create a personalized video playlist, for personal use as well as for browsing and viewing by others.
@YouTubeReaderBot allows you to subscribe to Youtube channels.
By using @YouTubeReaderBot Bot you agree with YouTube Terms of Service.
Use the @YouTubeReaderBot telegram bot to be the first to be notified when new videos are released on your favorite channels.
Look for new videos or channels and share them with your friends.
You can start using our bot from this video, subscribe now to CUDA Programming for NVIDIA H100s – Comprehensive Course
What is YouTube?
YouTube is a free video sharing website that makes it easy to watch online videos. You can even create and upload your own videos to share with others. Originally created in 2005, YouTube is now one of the most popular sites on the Web, with visitors watching around 6 billion hours of video every month.