MedOdyssey
MedOdyssey is a medical‑domain long‑context evaluation benchmark co‑created by East China University of Science and Technology and Shanghai Artificial Intelligence Laboratory, comprising 10 complex datasets covering medical corpora such as books, guidelines, case reports, and knowledge graphs. The datasets are built from open‑source and royalty‑free medical data to assess large language models’ performance on long‑context tasks, particularly in medical applications like electronic health‑record analysis and biomedical terminology standardisation.
Description
MedOdyssey: A Medical Domain Benchmark for Long‑Context Evaluation Up to 200K Tokens
Introduction
MedOdyssey is a medical long‑context benchmark consisting of seven length levels ranging from 4K to 200K tokens. It comprises two main components: a "needle‑in‑a‑haystack" medical context retrieval task and a suite of medical‑specific tasks, together covering 10 datasets.
Dataset Statistics
| Task | Annotations | #Examples | Avg. Length | MIC | NFI | CIR | Evaluation Metric |
|---|---|---|---|---|---|---|---|
| En.NIAH | Auto & Human | 20×7×5 | 179.2k/32 | ✔ | ✔ | ✘ | Accuracy |
| Zh.NIAH | Auto & Human | 20×7×5 | 45.6k/10.2 | ✔ | ✔ | ✘ | Accuracy |
| En.Counting | Auto | 4×7 | 179.0k/13.6 | ✔ | ✘ | ✔ | Accuracy |
| Zh.Counting | Auto | 4×7 | 45.6k/12.3 | ✔ | ✘ | ✔ | Accuracy |
| En.KG | Auto & Human | 100 | 186.4k/68.8 | ✔ | ✘ | ✔ | Precision, Recall, F1 |
| Zh.KG | Auto & Human | 100 | 42.5k/2.0 | ✔ | ✘ | ✔ | Precision, Recall, F1 |
| En.Term | Auto | 100 | 183.1k/11.7 | ✔ | ✘ | ✘ | Accuracy |
| Zh.Term | Auto | 100 | 32.6k/7.0 | ✔ | ✘ | ✘ | Accuracy |
| Zh.Case | Auto & Human | 100 | 47.7k/1.3 | ✔ | ✘ | ✘ | Accuracy |
| Zh.Table | Auto & Human | 100 | 53.6k/1.4 | ✔ | ✘ | ✘ | Precision, Recall, F1 |
MIC: Maximum Identical Context, NFI: New Fact Injection, CIR: Counter‑Intuitive Reasoning.
Baseline Models
We evaluate current state‑of‑the‑art long‑context LLMs on MedOdyssey.
- GPT‑4: Released March 2023 by OpenAI, original 8,192‑token window, extended to 128 k in November 2023 (gpt‑4‑turbo‑2024‑04‑09).
- GPT‑4o: Optimised variant released May 2024, 128 k window, knowledge cut‑off Oct 2023 (gpt‑4o‑2024‑05‑13).
- Claude 3: Anthropic’s March 2024 release; three models (Haiku, Sonnet, Opus) with 200 k windows (claude‑3‑haiku‑20240307, claude‑3‑sonnet‑20240229).
- Moonshot‑v1: Moonshot AI, 2023 release, 128 k window (moonshot‑v1‑128k).
- ChatGLM3‑6B‑128k: ZHIPU‑AI, 2024 release, 128 k context.
- InternLM2: Shanghai AI Lab, 2024 release, supports 200 k inference.
- Yi‑6B‑200k: 01.AI, 2023 release, 200 k window.
- Yarn‑Mistral‑7B‑128k: NousResearch, 2023 release, 128 k window using YaRN.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: arXiv
Created: 6/21/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.