# Dataset Management - [Dataset Management](#dataset-management) - [Dataset Format](#dataset-format) - [Dataset to CSV](#dataset-to-csv) - [Manage datasets](#manage-datasets) - [Requirement](#requirement) - [Basic Usage](#basic-usage) - [Score filtering](#score-filtering) - [Documentation](#documentation) - [Transform datasets](#transform-datasets) - [Resize](#resize) - [Frame extraction](#frame-extraction) - [Crop Midjourney 4 grid](#crop-midjourney-4-grid) - [Analyze datasets](#analyze-datasets) - [Data Process Pipeline](#data-process-pipeline) After preparing the raw dataset according to the [instructions](/docs/datasets.md), you can use the following commands to manage the dataset. ## Dataset Format All dataset should be provided in a `.csv` file (or `parquet.gzip` to save space), which is used for both training and data preprocessing. The columns should follow the words below: - `path`: the relative/absolute path or url to the image or video file. Required. - `text`: the caption or description of the image or video. Required for training. - `num_frames`: the number of frames in the video. Required for training. - `width`: the width of the video frame. Required for dynamic bucket. - `height`: the height of the video frame. Required for dynamic bucket. - `aspect_ratio`: the aspect ratio of the video frame (height / width). Required for dynamic bucket. - `resolution`: height x width. For analysis. - `text_len`: the number of tokens in the text. For analysis. - `aes`: aesthetic score calculated by [asethetic scorer](/tools/aesthetic/README.md). For filtering. - `flow`: optical flow score calculated by [UniMatch](/tools/scoring/README.md). For filtering. - `match`: matching score of a image-text/video-text pair calculated by [CLIP](/tools/scoring/README.md). For filtering. - `fps`: the frame rate of the video. Optional. - `cmotion`: the camera motion. An example ready for training: ```csv path, text, num_frames, width, height, aspect_ratio /absolute/path/to/image1.jpg, caption, 1, 720, 1280, 0.5625 /absolute/path/to/video1.mp4, caption, 120, 720, 1280, 0.5625 /absolute/path/to/video2.mp4, caption, 20, 256, 256, 1 ``` We use pandas to manage the `.csv` or `.parquet` files. The following code is for reading and writing files: ```python df = pd.read_csv(input_path) df = df.to_csv(output_path, index=False) # or use parquet, which is smaller df = pd.read_parquet(input_path) df = df.to_parquet(output_path, index=False) ``` ## Dataset to CSV As a start point, `convert.py` is used to convert the dataset to a CSV file. You can use the following commands to convert the dataset to a CSV file: ```bash python -m tools.datasets.convert DATASET-TYPE DATA_FOLDER # general video folder python -m tools.datasets.convert video VIDEO_FOLDER --output video.csv # general image folder python -m tools.datasets.convert image IMAGE_FOLDER --output image.csv # imagenet python -m tools.datasets.convert imagenet IMAGENET_FOLDER --split train # ucf101 python -m tools.datasets.convert ucf101 UCF101_FOLDER --split videos # vidprom python -m tools.datasets.convert vidprom VIDPROM_FOLDER --info VidProM_semantic_unique.csv ``` ## Manage datasets Use `datautil` to manage the dataset. ### Requirement Follow our [installation guide](../../docs/installation.md)'s "Data Dependencies" and "Datasets" section to install the required packages. ### Basic Usage You can use the following commands to process the `csv` or `parquet` files. The output file will be saved in the same directory as the input, with different suffixes indicating the processed method. ```bash # datautil takes multiple CSV files as input and merge them into one CSV file # output: DATA1+DATA2.csv python -m tools.datasets.datautil DATA1.csv DATA2.csv # shard CSV files into multiple CSV files # output: DATA1_0.csv, DATA1_1.csv, ... python -m tools.datasets.datautil DATA1.csv --shard 10 # filter frames between 128 and 256, with captions # output: DATA1_fmin_128_fmax_256.csv python -m tools.datasets.datautil DATA.csv --fmin 128 --fmax 256 # Disable parallel processing python -m tools.datasets.datautil DATA.csv --fmin 128 --fmax 256 --disable-parallel # Compute num_frames, height, width, fps, aspect_ratio for videos or images # output: IMG_DATA+VID_DATA_vinfo.csv python -m tools.datasets.datautil IMG_DATA.csv VID_DATA.csv --video-info # You can run multiple operations at the same time. python -m tools.datasets.datautil DATA.csv --video-info --remove-empty-caption --remove-url --lang en ``` ### Score filtering To examine and filter the quality of the dataset by aesthetic score and clip score, you can use the following commands: ```bash # sort the dataset by aesthetic score # output: DATA_sort.csv python -m tools.datasets.datautil DATA.csv --sort aesthetic_score # View examples of high aesthetic score head -n 10 DATA_sort.csv # View examples of low aesthetic score tail -n 10 DATA_sort.csv # sort the dataset by clip score # output: DATA_sort.csv python -m tools.datasets.datautil DATA.csv --sort clip_score # filter the dataset by aesthetic score # output: DATA_aesmin_0.5.csv python -m tools.datasets.datautil DATA.csv --aesmin 0.5 # filter the dataset by clip score # output: DATA_matchmin_0.5.csv python -m tools.datasets.datautil DATA.csv --matchmin 0.5 ``` ### Documentation You can also use `python -m tools.datasets.datautil --help` to see usage. | Args | File suffix | Description | | --------------------------- | -------------- | ------------------------------------------------------------- | | `--output OUTPUT` | | Output path | | `--format FORMAT` | | Output format (csv, parquet, parquet.gzip) | | `--disable-parallel` | | Disable `pandarallel` | | `--seed SEED` | | Random seed | | `--shard SHARD` | `_0`,`_1`, ... | Shard the dataset | | `--sort KEY` | `_sort` | Sort the dataset by KEY | | `--sort-descending KEY` | `_sort` | Sort the dataset by KEY in descending order | | `--difference DATA.csv` | | Remove the paths in DATA.csv from the dataset | | `--intersection DATA.csv` | | Keep the paths in DATA.csv from the dataset and merge columns | | `--info` | `_info` | Get the basic information of each video and image (cv2) | | `--ext` | `_ext` | Remove rows if the file does not exist | | `--relpath` | `_relpath` | Modify the path to relative path by root given | | `--abspath` | `_abspath` | Modify the path to absolute path by root given | | `--remove-empty-caption` | `_noempty` | Remove rows with empty caption | | `--remove-url` | `_nourl` | Remove rows with url in caption | | `--lang LANG` | `_lang` | Remove rows with other language | | `--remove-path-duplication` | `_noduppath` | Remove rows with duplicated path | | `--remove-text-duplication` | `_noduptext` | Remove rows with duplicated caption | | `--refine-llm-caption` | `_llm` | Modify the caption generated by LLM | | `--clean-caption MODEL` | `_clean` | Modify the caption according to T5 pipeline to suit training | | `--unescape` | `_unescape` | Unescape the caption | | `--merge-cmotion` | `_cmotion` | Merge the camera motion to the caption | | `--count-num-token` | `_ntoken` | Count the number of tokens in the caption | | `--load-caption EXT` | `_load` | Load the caption from the file | | `--fmin FMIN` | `_fmin` | Filter the dataset by minimum number of frames | | `--fmax FMAX` | `_fmax` | Filter the dataset by maximum number of frames | | `--hwmax HWMAX` | `_hwmax` | Filter the dataset by maximum height x width | | `--aesmin AESMIN` | `_aesmin` | Filter the dataset by minimum aesthetic score | | `--matchmin MATCHMIN` | `_matchmin` | Filter the dataset by minimum clip score | | `--flowmin FLOWMIN` | `_flowmin` | Filter the dataset by minimum optical flow score | ## Transform datasets The `tools.datasets.transform` module provides a set of tools to transform the dataset. The general usage is as follows: ```bash python -m tools.datasets.transform TRANSFORM_TYPE META.csv ORIGINAL_DATA_FOLDER DATA_FOLDER_TO_SAVE_RESULTS --additional-args ``` ### Resize Sometimes you may need to resize the images or videos to a specific resolution. You can use the following commands to resize the dataset: ```bash python -m tools.datasets.transform meta.csv /path/to/raw/data /path/to/new/data --length 2160 ``` ### Frame extraction To extract frames from videos, you can use the following commands: ```bash python -m tools.datasets.transform vid_frame_extract meta.csv /path/to/raw/data /path/to/new/data --points 0.1 0.5 0.9 ``` ### Crop Midjourney 4 grid Randomly select one of the 4 images in the 4 grid generated by Midjourney. ```bash python -m tools.datasets.transform img_rand_crop meta.csv /path/to/raw/data /path/to/new/data ``` ## Analyze datasets You can easily get basic information about a `.csv` dataset by using the following commands: ```bash # examine the first 10 rows of the CSV file head -n 10 DATA1.csv # count the number of data in the CSV file (approximately) wc -l DATA1.csv ``` For the dataset provided in a `.csv` or `.parquet` file, you can easily analyze the dataset using the following commands. Plots will be automatically saved. ```python pyhton -m tools.datasets.analyze DATA_info.csv ``` ## Data Process Pipeline ```bash # Suppose videos and images under ~/dataset/ # 1. Convert dataset to CSV python -m tools.datasets.convert video ~/dataset --output meta.csv # 2. Get video information python -m tools.datasets.datautil meta.csv --info --fmin 1 # 3. Get caption # 3.1. generate caption torchrun --nproc_per_node 8 --standalone -m tools.caption.caption_llava meta_info_fmin1.csv --dp-size 8 --tp-size 1 --model-path liuhaotian/llava-v1.6-mistral-7b --prompt video # merge generated results python -m tools.datasets.datautil meta_info_fmin1_caption_part*.csv --output meta_caption.csv # merge caption and info python -m tools.datasets.datautil meta_info_fmin1.csv --intersection meta_caption.csv --output meta_caption_info.csv # clean caption python -m tools.datasets.datautil meta_caption_info.csv --clean-caption --refine-llm-caption --remove-empty-caption --output meta_caption_processed.csv # 3.2. extract caption python -m tools.datasets.datautil meta_info_fmin1.csv --load-caption json --remove-empty-caption --clean-caption # 4. Scoring # aesthetic scoring torchrun --standalone --nproc_per_node 8 -m tools.scoring.aesthetic.inference meta_caption_processed.csv python -m tools.datasets.datautil meta_caption_processed_part*.csv --output meta_caption_processed_aes.csv # optical flow scoring torchrun --standalone --nproc_per_node 8 -m tools.scoring.optical_flow.inference meta_caption_processed.csv # matching scoring torchrun --standalone --nproc_per_node 8 -m tools.scoring.matching.inference meta_caption_processed.csv # camera motion python -m tools.caption.camera_motion_detect meta_caption_processed.csv ```