History

hailin cb54502fae first commit		2025-08-04 11:58:13 +00:00
..
data	first commit	2025-08-04 11:58:13 +00:00
examples	first commit	2025-08-04 11:58:13 +00:00
main	first commit	2025-08-04 11:58:13 +00:00
src	first commit	2025-08-04 11:58:13 +00:00
README.md	first commit	2025-08-04 11:58:13 +00:00

README.md

Activation-Beacon

Activation Beacon is a plug-in module to transformer-based LLMs that enables effective, efficient, and flexible compression of long contexts.

This folder contains the newer code for activation beacon. It supports more LLMs, including Mistral, Llama-3, and Qwen-2. It also supports more features, including Deepspeed Zero3 training, Flash-Attention-2, adding chat template in training and inference, and evaluating on more tasks. However, code in this folder are under development and subject to change in the future.

Environment

conda create beacon python=3.10.14

conda activate beacon

# You may need to adjust the cuda version
conda install pytorch pytorch-cuda=12.1 -c pytorch -c nvidia
pip install transformers deepspeed accelerate datasets peft pandas seaborn rouge fuzzywuzzy jieba python-Levenshtein
pip install flash-attn --no-build-isolation

Usage

import json
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "namespace-Pt/beacon-qwen-2-7b-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    trust_remote_code=True, 
    torch_dtype=torch.bfloat16, 
    attn_implementation="flash_attention_2"
)

model = model.cuda().eval()

with torch.no_grad():
  # short context
  messages = [{"role": "user", "content": "Tell me about yourself."}]
  inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True).to("cuda")
  outputs = model.generate(**inputs, max_new_tokens=50)
  print(f"Input Length: {inputs['input_ids'].shape[1]}")
  print(f"Output:       {repr(tokenizer.decode(outputs[0], skip_special_tokens=True))}")

  # reset memory before new generation task
  model.memory.reset()

  # long context
  with open("data/toy/infbench.json", encoding="utf-8") as f:
    example = json.load(f)
  messages = [{"role": "user", "content": example["context"]}]
  inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True).to("cuda")
  outputs = model.generate(**inputs, do_sample=False, top_p=1, temperature=1, max_new_tokens=20)[:, inputs["input_ids"].shape[1]:]
  print("*"*20)
  print(f"Input Length: {inputs['input_ids'].shape[1]}")
  print(f"Answers:      {example['answer']}")
  print(f"Prediction:   {tokenizer.decode(outputs[0], skip_special_tokens=True)}")

NOTE: It's okay to see warnings like This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (32768). Depending on the model, you may observe exceptions, performance degradation, or nothing at all. Just ignore it.

Data

You should download the data for fine-tuning & evaluation then untar the file at anywhere you prefer, e.g. /data:

# feel free to alternate /data to your prefered location
wget https://huggingface.co/datasets/namespace-Pt/projects/resolve/main/long-llm.tar.gz?download=true -O /data/long-llm.tar.gz

cd /data
tar -xzvf long-llm.tar.gz

IMPORTANT NOTE

For any path specified for train_data and eval_data: if it is prefixed with long-llm:, it will be solved to the relative path against data_root.

e.g. long-llm:lm/pg19.json becomes ${data_root}/lm/pg19.json
you can modify the default value of data_root, so that you don't need to type it for each command.

Training

See training section.

Evaluation

See evaluation section.

Citation

If you find this repository useful, please give us a star ⭐.

To cite our work:

@misc{zhang2024soaring,
    title={Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon}, 
    author={Peitian Zhang and Zheng Liu and Shitao Xiao and Ninglu Shao and Qiwei Ye and Zhicheng Dou},
    year={2024},
    eprint={2401.03462},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}