mysora/tools/datasets/README.md

283 lines
13 KiB
Markdown

# Dataset Management
- [Dataset Management](#dataset-management)
- [Dataset Format](#dataset-format)
- [Dataset to CSV](#dataset-to-csv)
- [Manage datasets](#manage-datasets)
- [Requirement](#requirement)
- [Basic Usage](#basic-usage)
- [Score filtering](#score-filtering)
- [Documentation](#documentation)
- [Transform datasets](#transform-datasets)
- [Resize](#resize)
- [Frame extraction](#frame-extraction)
- [Crop Midjourney 4 grid](#crop-midjourney-4-grid)
- [Analyze datasets](#analyze-datasets)
- [Data Process Pipeline](#data-process-pipeline)
After preparing the raw dataset according to the [instructions](/docs/datasets.md), you can use the following commands to manage the dataset.
## Dataset Format
All dataset should be provided in a `.csv` file (or `parquet.gzip` to save space), which is used for both training and data preprocessing. The columns should follow the words below:
- `path`: the relative/absolute path or url to the image or video file. Required.
- `text`: the caption or description of the image or video. Required for training.
- `num_frames`: the number of frames in the video. Required for training.
- `width`: the width of the video frame. Required for dynamic bucket.
- `height`: the height of the video frame. Required for dynamic bucket.
- `aspect_ratio`: the aspect ratio of the video frame (height / width). Required for dynamic bucket.
- `resolution`: height x width. For analysis.
- `text_len`: the number of tokens in the text. For analysis.
- `aes`: aesthetic score calculated by [asethetic scorer](/tools/aesthetic/README.md). For filtering.
- `flow`: optical flow score calculated by [UniMatch](/tools/scoring/README.md). For filtering.
- `match`: matching score of a image-text/video-text pair calculated by [CLIP](/tools/scoring/README.md). For filtering.
- `fps`: the frame rate of the video. Optional.
- `cmotion`: the camera motion.
An example ready for training:
```csv
path, text, num_frames, width, height, aspect_ratio
/absolute/path/to/image1.jpg, caption, 1, 720, 1280, 0.5625
/absolute/path/to/video1.mp4, caption, 120, 720, 1280, 0.5625
/absolute/path/to/video2.mp4, caption, 20, 256, 256, 1
```
We use pandas to manage the `.csv` or `.parquet` files. The following code is for reading and writing files:
```python
df = pd.read_csv(input_path)
df = df.to_csv(output_path, index=False)
# or use parquet, which is smaller
df = pd.read_parquet(input_path)
df = df.to_parquet(output_path, index=False)
```
## Dataset to CSV
As a start point, `convert.py` is used to convert the dataset to a CSV file. You can use the following commands to convert the dataset to a CSV file:
```bash
python -m tools.datasets.convert DATASET-TYPE DATA_FOLDER
# general video folder
python -m tools.datasets.convert video VIDEO_FOLDER --output video.csv
# general image folder
python -m tools.datasets.convert image IMAGE_FOLDER --output image.csv
# imagenet
python -m tools.datasets.convert imagenet IMAGENET_FOLDER --split train
# ucf101
python -m tools.datasets.convert ucf101 UCF101_FOLDER --split videos
# vidprom
python -m tools.datasets.convert vidprom VIDPROM_FOLDER --info VidProM_semantic_unique.csv
```
## Manage datasets
Use `datautil` to manage the dataset.
### Requirement
Follow our [installation guide](../../docs/installation.md)'s "Data Dependencies" and "Datasets" section to install the required packages.
<!-- To accelerate processing speed, you can install [pandarallel](https://github.com/nalepae/pandarallel):
```bash
pip install pandarallel
``` -->
<!-- To get image and video information, you need to install [opencv-python](https://github.com/opencv/opencv-python): -->
<!-- ```bash
pip install opencv-python
# If your videos are in av1 codec instead of h264, you need to
# - install ffmpeg first
# - install via conda to support av1 codec
conda install -c conda-forge opencv
``` -->
<!-- Or to get video information, you can install ffmpeg and ffmpeg-python:
```bash
pip install ffmpeg-python
``` -->
<!-- To filter a specific language, you need to install [lingua](https://github.com/pemistahl/lingua-py):
```bash
pip install lingua-language-detector
``` -->
### Basic Usage
You can use the following commands to process the `csv` or `parquet` files. The output file will be saved in the same directory as the input, with different suffixes indicating the processed method.
```bash
# datautil takes multiple CSV files as input and merge them into one CSV file
# output: DATA1+DATA2.csv
python -m tools.datasets.datautil DATA1.csv DATA2.csv
# shard CSV files into multiple CSV files
# output: DATA1_0.csv, DATA1_1.csv, ...
python -m tools.datasets.datautil DATA1.csv --shard 10
# filter frames between 128 and 256, with captions
# output: DATA1_fmin_128_fmax_256.csv
python -m tools.datasets.datautil DATA.csv --fmin 128 --fmax 256
# Disable parallel processing
python -m tools.datasets.datautil DATA.csv --fmin 128 --fmax 256 --disable-parallel
# Compute num_frames, height, width, fps, aspect_ratio for videos or images
# output: IMG_DATA+VID_DATA_vinfo.csv
python -m tools.datasets.datautil IMG_DATA.csv VID_DATA.csv --video-info
# You can run multiple operations at the same time.
python -m tools.datasets.datautil DATA.csv --video-info --remove-empty-caption --remove-url --lang en
```
### Score filtering
To examine and filter the quality of the dataset by aesthetic score and clip score, you can use the following commands:
```bash
# sort the dataset by aesthetic score
# output: DATA_sort.csv
python -m tools.datasets.datautil DATA.csv --sort aesthetic_score
# View examples of high aesthetic score
head -n 10 DATA_sort.csv
# View examples of low aesthetic score
tail -n 10 DATA_sort.csv
# sort the dataset by clip score
# output: DATA_sort.csv
python -m tools.datasets.datautil DATA.csv --sort clip_score
# filter the dataset by aesthetic score
# output: DATA_aesmin_0.5.csv
python -m tools.datasets.datautil DATA.csv --aesmin 0.5
# filter the dataset by clip score
# output: DATA_matchmin_0.5.csv
python -m tools.datasets.datautil DATA.csv --matchmin 0.5
```
### Documentation
You can also use `python -m tools.datasets.datautil --help` to see usage.
| Args | File suffix | Description |
| --------------------------- | -------------- | ------------------------------------------------------------- |
| `--output OUTPUT` | | Output path |
| `--format FORMAT` | | Output format (csv, parquet, parquet.gzip) |
| `--disable-parallel` | | Disable `pandarallel` |
| `--seed SEED` | | Random seed |
| `--shard SHARD` | `_0`,`_1`, ... | Shard the dataset |
| `--sort KEY` | `_sort` | Sort the dataset by KEY |
| `--sort-descending KEY` | `_sort` | Sort the dataset by KEY in descending order |
| `--difference DATA.csv` | | Remove the paths in DATA.csv from the dataset |
| `--intersection DATA.csv` | | Keep the paths in DATA.csv from the dataset and merge columns |
| `--info` | `_info` | Get the basic information of each video and image (cv2) |
| `--ext` | `_ext` | Remove rows if the file does not exist |
| `--relpath` | `_relpath` | Modify the path to relative path by root given |
| `--abspath` | `_abspath` | Modify the path to absolute path by root given |
| `--remove-empty-caption` | `_noempty` | Remove rows with empty caption |
| `--remove-url` | `_nourl` | Remove rows with url in caption |
| `--lang LANG` | `_lang` | Remove rows with other language |
| `--remove-path-duplication` | `_noduppath` | Remove rows with duplicated path |
| `--remove-text-duplication` | `_noduptext` | Remove rows with duplicated caption |
| `--refine-llm-caption` | `_llm` | Modify the caption generated by LLM |
| `--clean-caption MODEL` | `_clean` | Modify the caption according to T5 pipeline to suit training |
| `--unescape` | `_unescape` | Unescape the caption |
| `--merge-cmotion` | `_cmotion` | Merge the camera motion to the caption |
| `--count-num-token` | `_ntoken` | Count the number of tokens in the caption |
| `--load-caption EXT` | `_load` | Load the caption from the file |
| `--fmin FMIN` | `_fmin` | Filter the dataset by minimum number of frames |
| `--fmax FMAX` | `_fmax` | Filter the dataset by maximum number of frames |
| `--hwmax HWMAX` | `_hwmax` | Filter the dataset by maximum height x width |
| `--aesmin AESMIN` | `_aesmin` | Filter the dataset by minimum aesthetic score |
| `--matchmin MATCHMIN` | `_matchmin` | Filter the dataset by minimum clip score |
| `--flowmin FLOWMIN` | `_flowmin` | Filter the dataset by minimum optical flow score |
## Transform datasets
The `tools.datasets.transform` module provides a set of tools to transform the dataset. The general usage is as follows:
```bash
python -m tools.datasets.transform TRANSFORM_TYPE META.csv ORIGINAL_DATA_FOLDER DATA_FOLDER_TO_SAVE_RESULTS --additional-args
```
### Resize
Sometimes you may need to resize the images or videos to a specific resolution. You can use the following commands to resize the dataset:
```bash
python -m tools.datasets.transform meta.csv /path/to/raw/data /path/to/new/data --length 2160
```
### Frame extraction
To extract frames from videos, you can use the following commands:
```bash
python -m tools.datasets.transform vid_frame_extract meta.csv /path/to/raw/data /path/to/new/data --points 0.1 0.5 0.9
```
### Crop Midjourney 4 grid
Randomly select one of the 4 images in the 4 grid generated by Midjourney.
```bash
python -m tools.datasets.transform img_rand_crop meta.csv /path/to/raw/data /path/to/new/data
```
## Analyze datasets
You can easily get basic information about a `.csv` dataset by using the following commands:
```bash
# examine the first 10 rows of the CSV file
head -n 10 DATA1.csv
# count the number of data in the CSV file (approximately)
wc -l DATA1.csv
```
For the dataset provided in a `.csv` or `.parquet` file, you can easily analyze the dataset using the following commands. Plots will be automatically saved.
```python
pyhton -m tools.datasets.analyze DATA_info.csv
```
## Data Process Pipeline
```bash
# Suppose videos and images under ~/dataset/
# 1. Convert dataset to CSV
python -m tools.datasets.convert video ~/dataset --output meta.csv
# 2. Get video information
python -m tools.datasets.datautil meta.csv --info --fmin 1
# 3. Get caption
# 3.1. generate caption
torchrun --nproc_per_node 8 --standalone -m tools.caption.caption_llava meta_info_fmin1.csv --dp-size 8 --tp-size 1 --model-path liuhaotian/llava-v1.6-mistral-7b --prompt video
# merge generated results
python -m tools.datasets.datautil meta_info_fmin1_caption_part*.csv --output meta_caption.csv
# merge caption and info
python -m tools.datasets.datautil meta_info_fmin1.csv --intersection meta_caption.csv --output meta_caption_info.csv
# clean caption
python -m tools.datasets.datautil meta_caption_info.csv --clean-caption --refine-llm-caption --remove-empty-caption --output meta_caption_processed.csv
# 3.2. extract caption
python -m tools.datasets.datautil meta_info_fmin1.csv --load-caption json --remove-empty-caption --clean-caption
# 4. Scoring
# aesthetic scoring
torchrun --standalone --nproc_per_node 8 -m tools.scoring.aesthetic.inference meta_caption_processed.csv
python -m tools.datasets.datautil meta_caption_processed_part*.csv --output meta_caption_processed_aes.csv
# optical flow scoring
torchrun --standalone --nproc_per_node 8 -m tools.scoring.optical_flow.inference meta_caption_processed.csv
# matching scoring
torchrun --standalone --nproc_per_node 8 -m tools.scoring.matching.inference meta_caption_processed.csv
# camera motion
python -m tools.caption.camera_motion_detect meta_caption_processed.csv
```