生数科技
menu

Research

TurboDiffusion: a 100-200x Acceleration Framework for Video Generation

December 5, 2025

To improve the inference speed of diffusion models, especially video generation models, Tsinghua University's TSAIL Lab and Shengshu Technology jointly release the diffusion acceleration framework TurboDiffusion, which incorporates multiple acceleration techniques and can accelerate video generation by 100-200 times. TurboDiffusion TurboDiffusion mainly contains four techniques to accelerate diffusion models. First, TurboDiffusion uses SageAttention for low-bit quantized attention acceleration, specifically the SageAttention2++ version. Second, TurboDiffusion employs Sparse-Linear Attention (SLA) to speed up sparse attention. Since sparse computation is orthogonal to low-bit Tensor Core acceleration, SLA can be built on top of SageAttention to achieve additional multi-fold speedups during inference. Third, TurboDiffusion leverages rCM for step distillation acceleration, which is currently a state-of-the-art method in this area. Finally, TurboDiffusion applies W8A8 quantization to accelerate Linear layers. It uses the INT8 data type with block-wise quantization, with a block size of 128×128. These four core technologies were independently developed by Tsinghua University TSAIL team and Shengshu Technology. They carry milestone significance and far-reaching impact for both breakthroughs in AI multimodal foundation models and their industrial-scale deployment. In particular, SageAttention is the first method to enable low-bit attention acceleration and has already been deployed at scale across the industry. For example, SageAttention has been successfully integrated into NVIDIA’s inference engine TensorRT, and has also been deployed and productionized on major GPU platforms such as Huawei Ascend and Moore Threads S6000. In addition, leading global and domestic technology companies and teams, including Tencent Hunyuan, ByteDance Doubao, Alibaba Tora, Shengshu Vidu, Zhipu Qingying, Baidu PaddlePaddle, Kunlun Wanwei, Google Veo3, SenseTime, and vLLM, have adopted this technology in their core products, generating substantial economic value. For the open-sourced T2V and I2V models, TurboDiffusion could achieve 100× and 200× end-to-end acceleration in video generation on a single RTX 5090, respectively. For more details, please refer to TurboDiffusion. Applying the techniques in TurboDiffusion to the Vidu model also delivers extremely high inference acceleration without sacrificing video generation quality. For example, when generating an 8-second high-quality video at 1080P resolution, TurboDiffusion can reduce the end-to-end generation latency from 900 seconds to 8 seconds, compared with video generation without inference-acceleration optimizations.