MedOdyssey
MedOdyssey is a medical‑domain long‑context evaluation benchmark co‑created by East China University of Science and Technology and Shanghai Artificial Intelligence Laboratory, comprising 10 complex datasets covering medical corpora such as books, guidelines, case reports, and knowledge graphs. The datasets are built from open‑source and royalty‑free medical data to assess large language models’ performance on long‑context tasks, particularly in medical applications like electronic health‑record analysis and biomedical terminology standardisation.
Dataset description and usage context
MedOdyssey: A Medical Domain Benchmark for Long‑Context Evaluation Up to 200K Tokens
Introduction
MedOdyssey is a medical long‑context benchmark consisting of seven length levels ranging from 4K to 200K tokens. It comprises two main components: a "needle‑in‑a‑haystack" medical context retrieval task and a suite of medical‑specific tasks, together covering 10 datasets.
Dataset Statistics
| Task | Annotations | #Examples | Avg. Length | MIC | NFI | CIR | Evaluation Metric |
|---|---|---|---|---|---|---|---|
| En.NIAH | Auto & Human | 20×7×5 | 179.2k/32 | ✔ | ✔ | ✘ | Accuracy |
| Zh.NIAH | Auto & Human | 20×7×5 | 45.6k/10.2 | ✔ | ✔ | ✘ | Accuracy |
| En.Counting | Auto | 4×7 | 179.0k/13.6 | ✔ | ✘ | ✔ | Accuracy |
| Zh.Counting | Auto | 4×7 | 45.6k/12.3 | ✔ | ✘ | ✔ | Accuracy |
| En.KG | Auto & Human | 100 | 186.4k/68.8 | ✔ | ✘ | ✔ | Precision, Recall, F1 |
| Zh.KG | Auto & Human | 100 | 42.5k/2.0 | ✔ | ✘ | ✔ | Precision, Recall, F1 |
| En.Term | Auto | 100 | 183.1k/11.7 | ✔ | ✘ | ✘ | Accuracy |
| Zh.Term | Auto | 100 | 32.6k/7.0 | ✔ | ✘ | ✘ | Accuracy |
| Zh.Case | Auto & Human | 100 | 47.7k/1.3 | ✔ | ✘ | ✘ | Accuracy |
| Zh.Table | Auto & Human | 100 | 53.6k/1.4 | ✔ | ✘ | ✘ | Precision, Recall, F1 |
MIC: Maximum Identical Context, NFI: New Fact Injection, CIR: Counter‑Intuitive Reasoning.
Baseline Models
We evaluate current state‑of‑the‑art long‑context LLMs on MedOdyssey.
- GPT‑4: Released March 2023 by OpenAI, original 8,192‑token window, extended to 128 k in November 2023 (gpt‑4‑turbo‑2024‑04‑09).
- GPT‑4o: Optimised variant released May 2024, 128 k window, knowledge cut‑off Oct 2023 (gpt‑4o‑2024‑05‑13).
- Claude 3: Anthropic’s March 2024 release; three models (Haiku, Sonnet, Opus) with 200 k windows (claude‑3‑haiku‑20240307, claude‑3‑sonnet‑20240229).
- Moonshot‑v1: Moonshot AI, 2023 release, 128 k window (moonshot‑v1‑128k).
- ChatGLM3‑6B‑128k: ZHIPU‑AI, 2024 release, 128 k context.
- InternLM2: Shanghai AI Lab, 2024 release, supports 200 k inference.
- Yi‑6B‑200k: 01.AI, 2023 release, 200 k window.
- Yarn‑Mistral‑7B‑128k: NousResearch, 2023 release, 128 k window using YaRN.
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.