Dataset assetOpen Source CommunityArtificial IntelligenceMedical Data

MedOdyssey

MedOdyssey is a medical‑domain long‑context evaluation benchmark co‑created by East China University of Science and Technology and Shanghai Artificial Intelligence Laboratory, comprising 10 complex datasets covering medical corpora such as books, guidelines, case reports, and knowledge graphs. The datasets are built from open‑source and royalty‑free medical data to assess large language models’ performance on long‑context tasks, particularly in medical applications like electronic health‑record analysis and biomedical terminology standardisation.

Source

arXiv

Created

Jun 21, 2024

Updated

Jun 21, 2024

Signals

1,079 views

Availability

Linked source ready

Overview

Dataset description and usage context

MedOdyssey: A Medical Domain Benchmark for Long‑Context Evaluation Up to 200K Tokens

Introduction

MedOdyssey is a medical long‑context benchmark consisting of seven length levels ranging from 4K to 200K tokens. It comprises two main components: a "needle‑in‑a‑haystack" medical context retrieval task and a suite of medical‑specific tasks, together covering 10 datasets.

Dataset Statistics

Task	Annotations	#Examples	Avg. Length	MIC	NFI	CIR	Evaluation Metric
En.NIAH	Auto & Human	20×7×5	179.2k/32	✔	✔	✘	Accuracy
Zh.NIAH	Auto & Human	20×7×5	45.6k/10.2	✔	✔	✘	Accuracy
En.Counting	Auto	4×7	179.0k/13.6	✔	✘	✔	Accuracy
Zh.Counting	Auto	4×7	45.6k/12.3	✔	✘	✔	Accuracy
En.KG	Auto & Human	100	186.4k/68.8	✔	✘	✔	Precision, Recall, F1
Zh.KG	Auto & Human	100	42.5k/2.0	✔	✘	✔	Precision, Recall, F1
En.Term	Auto	100	183.1k/11.7	✔	✘	✘	Accuracy
Zh.Term	Auto	100	32.6k/7.0	✔	✘	✘	Accuracy
Zh.Case	Auto & Human	100	47.7k/1.3	✔	✘	✘	Accuracy
Zh.Table	Auto & Human	100	53.6k/1.4	✔	✘	✘	Precision, Recall, F1

MIC: Maximum Identical Context, NFI: New Fact Injection, CIR: Counter‑Intuitive Reasoning.

Baseline Models

We evaluate current state‑of‑the‑art long‑context LLMs on MedOdyssey.

GPT‑4: Released March 2023 by OpenAI, original 8,192‑token window, extended to 128 k in November 2023 (gpt‑4‑turbo‑2024‑04‑09).
GPT‑4o: Optimised variant released May 2024, 128 k window, knowledge cut‑off Oct 2023 (gpt‑4o‑2024‑05‑13).
Claude 3: Anthropic’s March 2024 release; three models (Haiku, Sonnet, Opus) with 200 k windows (claude‑3‑haiku‑20240307, claude‑3‑sonnet‑20240229).
Moonshot‑v1: Moonshot AI, 2023 release, 128 k window (moonshot‑v1‑128k).
ChatGLM3‑6B‑128k: ZHIPU‑AI, 2024 release, 128 k context.
InternLM2: Shanghai AI Lab, 2024 release, supports 200 k inference.
Yi‑6B‑200k: 01.AI, 2023 release, 200 k window.
Yarn‑Mistral‑7B‑128k: NousResearch, 2023 release, 128 k window using YaRN.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio