mysora/docs/hcae.md

39 lines
2.1 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 10× inference speedup with high-compression autoencoder
The high computational cost of training video generation models arises from the
large number of tokens and the dominance of attention computation. To further reduce training expenses,
we explore training video generation models with high-compression autoencoders (Video DC-AEs). As shown in the comparason below, by switching to the Video DC-AE with a much higher downsample ratio (4 x 32 x 32), we can afford to further reduce the patch size to 1 and still achieve __5.2× speedup in training throughput__ and __10x speedup during inference__:
![opensorav2_speed](https://github.com/hpcaitech/Open-Sora-Demo/blob/main/readme/hcae_opensorav2_speed.png)
Nevertheless, despite the advantanges in drastically lower computation costs, other challenges remain. For instance, larger channels low down convergance. Our generation model adapted with a 128-channel Video DC-AE for 25K iterations achieves a loss level of 0.5, as compared to 0.1 from the initialization model. While the fast video generation model underperforms the original, it still captures spatial-temporal
relationships. We release this model to the research community for further exploration.
Checkout more details in our [report](https://arxiv.org/abs/2503.09642v1).
## Model Download
Download from 🤗 [Huggingface](https://huggingface.co/hpcai-tech/Open-Sora-v2-Video-DC-AE):
```bash
pip install "huggingface_hub[cli]"
huggingface-cli download hpcai-tech/Open-Sora-v2-Video-DC-AE --local-dir ./ckpts
```
## Inference
To inference on our fast video generation model:
```bash
torchrun --nproc_per_node 1 --standalone scripts/diffusion/inference.py configs/diffusion/inference/high_compression.py --prompt "The story of a robot's life in a cyberpunk setting."
```
## Training
Follow this [guide](./train.md#prepare-dataset) to parepare the __DATASET__ for training.
Then, you may train your own fast generation model with the following command:
```bash
torchrun --nproc_per_node 8 scripts/diffusion/train.py configs/diffusion/train/high_compression.py --dataset.data-path datasets/pexels_45k_necessary.csv
```