# Needle in a Haystack

## Description
The "Needle in a Haystack" task involves identifying specific, often minimal, relevant information within a text filled with a large amount of irrelevant data.

The implementation logic of this code is inspired by [LLMTest_NeedleInAHaystack](https://github.com/gkamradt/LLMTest_NeedleInAHaystack).

## Dataset

Dataset [link](https://modelscope.cn/datasets/AI-ModelScope/Needle-in-a-Haystack-Corpus/summary), contains texts in both Chinese and English:
- **Chinese Text**: Extracted from "Journey to the West"
- **English Text**: Extracted from articles by Paul Graham

## Usage

### Single Needle Task

Run the following code to initiate the Needle in a Haystack task. The example below evaluates the qwen-plus model on a range of 1k-128k lengths for the single needle task.

**Note: You must set the judge model, otherwise the evaluation will fail.**

```python
import os
from evalscope import TaskConfig, run_task

task_cfg = TaskConfig(
    model='qwen-plus',
    api_url='https://dashscope.aliyuncs.com/compatible-mode/v1',
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    eval_type='service',  # Using API model service
    datasets=['needle_haystack'],
    eval_batch_size=10,
    dataset_args={
        'needle_haystack': {
            'subset_list': ['chinese', 'english'],  # Optional, specify to use Chinese or English subset
            # Supported configuration parameters
            'extra_params':{
                # Question
                'retrieval_question': 'What is the best thing to do in San Francisco?',
                # Inserted text (can be multiple)
                'needles':['\nThe best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.\n'],
                # Minimum length of the corpus
                'context_lengths_min': 1000,
                # Maximum length of the corpus
                'context_lengths_max': 128000,
                # Number of intervals for corpus length
                'context_lengths_num_intervals': 20,
                # Minimum position for inserted text (percentage)
                'document_depth_percent_min': 0,
                # Maximum position for inserted text (percentage)
                'document_depth_percent_max': 100,
                # Number of intervals for inserted text position
                'document_depth_percent_intervals': 10,
                # Path to tokenizer (can specify modelscope ID)
                'tokenizer_path': 'Qwen/Qwen3-0.6B',
                'show_score': True,  # Whether to display scores on the heatmap
            }
        }
    },
    generation_config={
        'max_tokens': 512,  # Maximum number of tokens to generate
    },
    judge_worker_num=5,
    judge_model_args={
        'model_id': 'qwen2.5-72b-instruct',
        'api_url': 'https://dashscope.aliyuncs.com/compatible-mode/v1',
        'api_key': os.getenv('DASHSCOPE_API_KEY'),
    }
)
run_task(task_cfg=task_cfg)
```

The output is as follows (truncated):
```text
+-----------+-----------------+--------------------------+----------+-------+---------+---------+
| Model     | Dataset         | Metric                   | Subset   |   Num |   Score | Cat.0   |
+===========+=================+==========================+==========+=======+=========+=========+
| qwen-plus | needle_haystack | Context#128000 Depth#33  | english  |     1 |     0.1 | default |
+-----------+-----------------+--------------------------+----------+-------+---------+---------+
| qwen-plus | needle_haystack | Context#128000 Depth#11  | chinese  |     1 |     1   | default |
+-----------+-----------------+--------------------------+----------+-------+---------+---------+
| qwen-plus | needle_haystack | Context#128000 Depth#11  | english  |     1 |     0.1 | default |
+-----------+-----------------+--------------------------+----------+-------+---------+---------+
| qwen-plus | needle_haystack | Context#121316 Depth#56  | chinese  |     1 |     1   | default |
+-----------+-----------------+--------------------------+----------+-------+---------+---------+
| qwen-plus | needle_haystack | Context#121316 Depth#56  | english  |     1 |     0.1 | default |
+-----------+-----------------+--------------------------+----------+-------+---------+---------+
| qwen-plus | needle_haystack | Context#128000 Depth#44  | chinese  |     1 |     1   | default |
+-----------+-----------------+--------------------------+----------+-------+---------+---------+
| qwen-plus | needle_haystack | Context#128000 Depth#44  | english  |     1 |     0.1 | default |
+-----------+-----------------+--------------------------+----------+-------+---------+---------+
| qwen-plus | needle_haystack | Context#128000 Depth#67  | chinese  |     1 |     1   | default |
+-----------+-----------------+--------------------------+----------+-------+---------+---------+
| qwen-plus | needle_haystack | Context#128000 Depth#67  | english  |     1 |     0.1 | default |
+-----------+-----------------+--------------------------+----------+-------+---------+---------+
| qwen-plus | needle_haystack | Context#128000 Depth#0   | chinese  |     1 |     1   | default |
+-----------+-----------------+--------------------------+----------+-------+---------+---------+
| qwen-plus | needle_haystack | Context#128000 Depth#0   | english  |     1 |     0.1 | default |
+-----------+-----------------+--------------------------+----------+-------+---------+---------+
| qwen-plus | needle_haystack | Context#128000 Depth#22  | chinese  |     1 |     1   | default |
+-----------+-----------------+--------------------------+----------+-------+---------+---------+
| qwen-plus | needle_haystack | Context#128000 Depth#22  | english  |     1 |     0.1 | default |
+-----------+-----------------+--------------------------+----------+-------+---------+---------+
| qwen-plus | needle_haystack | Context#128000 Depth#100 | chinese  |     1 |     1   | default |
+-----------+-----------------+--------------------------+----------+-------+---------+---------+
| qwen-plus | needle_haystack | Context#128000 Depth#100 | english  |     1 |     0.1 | default |
+-----------+-----------------+--------------------------+----------+-------+---------+---------+
| qwen-plus | needle_haystack | Context#128000 Depth#89  | chinese  |     1 |     1   | default |
+-----------+-----------------+--------------------------+----------+-------+---------+---------+
| qwen-plus | needle_haystack | Context#128000 Depth#89  | english  |     1 |     0.1 | default |
+-----------+-----------------+--------------------------+----------+-------+---------+---------+ 
```

Evaluation reports and corresponding heatmaps will be generated in the outputs/xxx/reports directory, as shown below:

![needle_haystack_report](./images/needle_haystack_heatmap_chinese.png)
*Chinese Test*

![needle_haystack_report](./images/needle_haystack_heatmap_english.png)
*English Test*

It can be observed that the model's performance varies with different context lengths and insertion positions, and it performs better in the Chinese context.

### Multi-Needle Task
The multi-needle task is similar to the single needle task, with the only difference being that you set multiple texts in `extra_params` under `needles`. Below is a 3-needle test conducted over a context length of 1k-32k.

```python
import os
from evalscope import TaskConfig, run_task

task_cfg = TaskConfig(
    model='qwen-plus',
    api_url='https://dashscope.aliyuncs.com/compatible-mode/v1',
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    eval_type='service',  # Using API model service
    datasets=['needle_haystack'],
    eval_batch_size=10,
    dataset_args={
        'needle_haystack': {
            'subset_list': ['chinese', 'english'],  # Optional, specify to use Chinese or English subset
            'extra_params':{
                'retrieval_question': 'What is the secret ingredient needed to build the perfect pizza?',
                'needles': [
                    " Figs are one of the secret ingredients needed to build the perfect pizza. ", 
                    " Prosciutto is one of the secret ingredients needed to build the perfect pizza. ", 
                    " Goat cheese is one of the secret ingredients needed to build the perfect pizza. "
                ],
                'context_lengths_min': 1000,
                'context_lengths_max': 32000,
                'context_lengths_num_intervals': 10,
                'document_depth_percent_min': 0,
                'document_depth_percent_max': 100,
                'document_depth_percent_intervals': 10,
                'tokenizer_path': 'Qwen/Qwen3-0.6B',
                'show_score': True,
            }
        }
    },
    generation_config={
        'max_tokens': 512,  # Maximum number of tokens to generate
    },
    judge_worker_num=5,
    judge_model_args={
        'model_id': 'qwen2.5-72b-instruct',
        'api_url': 'https://dashscope.aliyuncs.com/compatible-mode/v1',
        'api_key': os.getenv('DASHSCOPE_API_KEY'),
    }
)
run_task(task_cfg=task_cfg)
```

The output example is similar to the one above.