Stanford CS25: V5 I Transformers in Diffusion Models for Image Generation and Beyond

May 27, 2025
Sayak Paul of Hugging Face

Diffusion models have been all the rage in recent times when it comes to generating realistic yet synthetic continuous media content. This talk covers how Transformers are used in diffusion models for image generation and goes far beyond that.
We set the context by briefly discussing some preliminaries around diffusion models and how they are trained. We then cover the UNet-based network architecture that used to be the de facto choice for diffusion models. This helps us to motivate the introduction and rise of transformer-based architectures for diffusion.

We cover the fundamental blocks and the degrees of freedom one can ablate in the base architecture in different conditional settings. We then shift our focus to the different flavors of attention and other connected components that the community has been using in some of the SoTA open models for various use cases. We conclude by shedding light on some promising future directions around efficiency.

Speaker: Sayak works on diffusion models at Hugging Face. His day-to-day includes contributing to the diffusers library, training and babysitting diffusion models, and working on applied ideas. He's interested in subject-driven generation, preference alignment, and evaluation of diffusion models. When he is not working, he can be found playing the guitar and binge-watching ICML tutorials and Suits.

More about the course can be found here: https://web.stanford.edu/class/cs25/

View the entire CS25 Transformers United playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rNiJRchCzutFw5ItR_Z27CM Receive SMS online on sms24.me

Watch on YouTube

Subscribe on YouTubeReaderBot

TubeReader video aggregator is a website that collects and organizes online videos from the YouTube source. Video aggregation is done for different purposes, and TubeReader take different approaches to achieve their purpose.

Our try to collect videos of high quality or interest for visitors to view; the collection may be made by editors or may be based on community votes.

Another method is to base the collection on those videos most viewed, either at the aggregator site or at various popular video hosting sites.

TubeReader site exists to allow users to collect their own sets of videos, for personal use as well as for browsing and viewing by others; TubeReader can develop online communities around video sharing.

Our site allow users to create a personalized video playlist, for personal use as well as for browsing and viewing by others.

@YouTubeReaderBot allows you to subscribe to Youtube channels.

By using @YouTubeReaderBot Bot you agree with YouTube Terms of Service.

Use the @YouTubeReaderBot telegram bot to be the first to be notified when new videos are released on your favorite channels.

Look for new videos or channels and share them with your friends.

You can start using our bot from this video, subscribe now to Stanford CS25: V5 I Transformers in Diffusion Models for Image Generation and Beyond

What is YouTube?

YouTube is a free video sharing website that makes it easy to watch online videos. You can even create and upload your own videos to share with others. Originally created in 2005, YouTube is now one of the most popular sites on the Web, with visitors watching around 6 billion hours of video every month.