evalscope_v0.17.0/evalscope.0.17.0/docs/en/get_started/introduction.md

3.9 KiB

Introduction

EvalScope is a comprehensive model evaluation and benchmarking framework meticulously crafted by the ModelScope community. It offers an all-in-one solution for your model assessment needs, regardless of the type of model you are developing:

  • 🧠 Large Language Models
  • 🎨 Multimodal Models
  • 🔍 Embedding Models
  • 🏆 Reranker Models
  • 🖼️ CLIP Models
  • 🎭 AIGC Models (Text-to-Image/Video)
  • ...and more!

EvalScope is not merely an evaluation tool; it is a valuable ally in your model optimization journey:

  • 🏅 Equipped with multiple industry-recognized benchmarks and evaluation metrics such as MMLU, CMMLU, C-Eval, GSM8K, and others.
  • 📊 Performance stress testing for model inference to ensure your model excels in real-world applications.
  • 🚀 Seamlessly integrates with the ms-swift training framework, enabling one-click evaluations and providing end-to-end support from training to assessment for your model development.

Overall Architecture

EvalScope Architecture Diagram EvalScope Architecture Diagram.

The architecture includes the following modules:

  1. Input Layer
  • Model Sources: API models (OpenAI API), local models (ModelScope)
  • Datasets: Standard evaluation benchmarks (MMLU/GSM8k, etc.), custom data (MCQ/QA)
  1. Core Functions
  • Multi-backend Evaluation

    • Native backends: Unified evaluation for LLM/VLM/Embedding/T2I models
    • Integrated frameworks: OpenCompass/MTEB/VLMEvalKit/RAGAS
  • Performance Monitoring

    • Model plugins: Supports various model service APIs
    • Data plugins: Supports multiple data formats
    • Metric tracking: TTFT/TPOP/Stability and other metrics
  • Tool Extensions

    • Integration: Tool-Bench/Needle-in-a-Haystack/BFCL-v3
  1. Output Layer
  • Structured Reports: Supports JSON/Tables/Logs
  • Visualization Platforms: Supports Gradio/Wandb/SwanLab

Framework Features

  • Benchmark Datasets: Preloaded with several commonly used test benchmarks, including MMLU, CMMLU, C-Eval, GSM8K, ARC, HellaSwag, TruthfulQA, MATH, HumanEval, etc.
  • Evaluation Metrics: Implements various commonly used evaluation metrics.
  • Model Access: A unified model access mechanism that is compatible with the Generate and Chat interfaces of multiple model families.
  • Automated Evaluation: Includes automatic evaluation of objective questions and complex task evaluation using expert models.
  • Evaluation Reports: Automatically generates evaluation reports.
  • Arena Mode: Used for comparisons between models and objective evaluation of models, supporting various evaluation modes, including:
    • Single mode: Scoring a single model.
    • Pairwise-baseline mode: Comparing against a baseline model.
    • Pairwise (all) mode: Pairwise comparison among all models.
  • Visualization Tools: Provides intuitive displays of evaluation results.
  • Model Performance Evaluation: Offers a performance testing tool for model inference services and detailed statistics, see Model Performance Evaluation Documentation.
  • OpenCompass Integration: Supports OpenCompass as the evaluation backend, providing advanced encapsulation and task simplification, allowing for easier task submission for evaluation.
  • VLMEvalKit Integration: Supports VLMEvalKit as the evaluation backend, facilitating the initiation of multi-modal evaluation tasks, supporting various multi-modal models and datasets.
  • Full-Link Support: Through seamless integration with the ms-swift training framework, provides a one-stop development process for model training, model deployment, model evaluation, and report viewing, enhancing user development efficiency.