History

hailin cb54502fae first commit		2025-08-04 11:58:13 +00:00
..
finetune	first commit	2025-08-04 11:58:13 +00:00
inference	first commit	2025-08-04 11:58:13 +00:00
README.md	first commit	2025-08-04 11:58:13 +00:00
requirements.txt	first commit	2025-08-04 11:58:13 +00:00

README.md

Fitting Into Any Shape: A Flexible LLM-Based Re-Ranker With Configurable Depth and Width (Matroyshka Re-Ranker) [paper]

Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. You can get a relevance score by inputting query and passage to the reranker. And the score can be mapped to a float value in [0,1] by sigmoid function.

Here is Matroyshka Re-Ranker, which is designed to facilitate runtime customization of model layers and sequence lengths at each layer based on users' configurations, it supports flexible lightweight configuration.

The training method have the following features:

cascaded self-distillation
factorized compensation

Environment

You can install the environment by:

conda create -n reranker python=3.10
conda activate reranker
pip install -r requirements.txt

Model List

Model	Introduction
BAAI/Matroyshka-ReRanker-passage	The Matroyshka Re-Ranker fine-tuned on MS MARCO passage
BAAI/Matroyshka-ReRanker-document	The Matroyshka Re-Ranker fine-tuned on MS MARCO document
BAAI/Matroyshka-ReRanker-beir	The Matroyshka Re-Ranker fine-tuned for general retrieval

Usage

You can use Matroyshka Re-Ranker with the following code:

cd ./inference
python

And then:

from rank_model import MatroyshkaReranker

compress_ratio = 2 # config your compress ratio
compress_layers = [8, 16] # cofig your layers to compress
cutoff_layers = [20, 24] # config your layers to output

reranker = MatroyshkaReranker(
    model_name_or_path='BAAI/Matroyshka-ReRanker-passage',
    peft_path=[
        './models/Matroyshka-ReRanker-passage/compensate/layer/full'
    ]
    use_fp16=True,
    cache_dir='./model_cache',
    compress_ratio=compress_ratio,
    compress_layers=compress_layers,
    cutoff_layers=cutoff_layers
)

score = reranker.compute_score(['query', 'passage'])
print(score)

scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']])
print(scores)

Fine-tune

Cascaded Self-distillation

For cascaded self-distillation, you can use the following script:

cd self_distillation

train_data_path="..."
your_huggingface_token="..."

torchrun --nproc_per_node 8 \
run.py \
--output_dir ./result_self_distillation \
--model_name_or_path mistralai/Mistral-7B-v0.1 \
--train_data ${train_data_path} \
--learning_rate 2e-4 \
--num_train_epochs 1 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--dataloader_drop_last True \
--query_max_len 32 \
--passage_max_len 192 \
--train_group_size 16 \
--logging_steps 1 \
--save_steps 100 \
--save_total_limit 50 \
--ddp_find_unused_parameters False \
--gradient_checkpointing \
--deepspeed /share/chaofan/code/stage/stage1.json \
--warmup_ratio 0.1 \
--bf16 \
--use_lora True \
--lora_rank 32 \
--lora_alpha 64 \
--loss_type 'only logits' \
--use_flash_attn False \
--target_modules q_proj k_proj v_proj o_proj down_proj up_proj gate_proj linear_head \
--token ${your_huggingface_token} \
--cache_dir ../../model_cache \
--cache_path ../../data_cache \
--padding_side right \
--start_layer 4 \
--layer_sep 1 \
--layer_wise True \
--compress_ratios 1 2 4 8 \
--compress_layers 4 8 12 16 20 24 28 \
--train_method distill_fix_layer_teacher

Factorized Compensation

For layer compensation, you can use the following script:

cd finetune/compensation

train_data_path="..."
your_huggingface_token="..."
raw_peft_path="../../self_distillation/result_self_distillation"

torchrun --nproc_per_node 8 \
run.py \
--output_dir ./result_compensation_layer \
--model_name_or_path mistralai/Mistral-7B-v0.1 \
--raw_peft ${raw_peft_path} \
--train_data ${train_data_path} \
--learning_rate 2e-5 \
--num_train_epochs 1 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--dataloader_drop_last True \
--query_max_len 32 \
--passage_max_len 192 \
--train_group_size 16 \
--logging_steps 1 \
--save_steps 500 \
--save_total_limit 50 \
--ddp_find_unused_parameters False \
--gradient_checkpointing \
--deepspeed stage1.json \
--warmup_ratio 0.1 \
--bf16 \
--use_lora True \
--lora_rank 32 \
--lora_alpha 64 \
--loss_type 'only logits' \
--use_flash_attn False \
--target_modules q_proj k_proj v_proj o_proj down_proj up_proj gate_proj linear_head \
--token ${your_huggingface_token} \
--cache_dir ../../model_cache \
--cache_path ../../data_cache \
--padding_side right \
--start_layer 4 \
--layer_sep 1 \
--layer_wise True \
--compress_ratios 1 \
--compress_layers 4 8 12 16 20 24 28 \
--train_method normal \
--finetune_type layer

For token compression, you can use the following script:

cd finetune/compensation

train_data_path="..."
your_huggingface_token="..."
raw_peft_path="../../self_distillation/result_self_distillation"
compress_ratio=2

torchrun --nproc_per_node 8 \
run.py \
--output_dir ./result_compensation_token_compress_ratio_${compress_ratio} \
--model_name_or_path mistralai/Mistral-7B-v0.1 \
--raw_peft ${raw_peft_path} \
--train_data ${train_data_path} \
--learning_rate 2e-5 \
--num_train_epochs 1 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--dataloader_drop_last True \
--query_max_len 32 \
--passage_max_len 192 \
--train_group_size 16 \
--logging_steps 1 \
--save_steps 500 \
--save_total_limit 50 \
--ddp_find_unused_parameters False \
--gradient_checkpointing \
--deepspeed stage1.json \
--warmup_ratio 0.1 \
--bf16 \
--use_lora True \
--lora_rank 32 \
--lora_alpha 64 \
--loss_type 'only logits' \
--use_flash_attn False \
--target_modules q_proj k_proj v_proj o_proj down_proj up_proj gate_proj linear_head \
--token ${your_huggingface_token} \
--cache_dir ../../model_cache \
--cache_path ../../data_cache \
--padding_side right \
--start_layer 4 \
--layer_sep 1 \
--layer_wise True \
--compress_ratios ${compress_ratio} \
--compress_layers 4 8 12 16 20 24 28 \
--train_method normal \
--finetune_type token

Inference

You can use self finetuned Matroyshka Re-Ranker with the following code:

cd ./inference
python