Back to datasets
Dataset assetOpen Source CommunityArtificial IntelligenceMedical Data

MedOdyssey

MedOdyssey is a medical‑domain long‑context evaluation benchmark co‑created by East China University of Science and Technology and Shanghai Artificial Intelligence Laboratory, comprising 10 complex datasets covering medical corpora such as books, guidelines, case reports, and knowledge graphs. The datasets are built from open‑source and royalty‑free medical data to assess large language models’ performance on long‑context tasks, particularly in medical applications like electronic health‑record analysis and biomedical terminology standardisation.

Source
arXiv
Created
Jun 21, 2024
Updated
Jun 21, 2024
Signals
1,079 views
Availability
Linked source ready
Overview

Dataset description and usage context

MedOdyssey: A Medical Domain Benchmark for Long‑Context Evaluation Up to 200K Tokens

Introduction

MedOdyssey is a medical long‑context benchmark consisting of seven length levels ranging from 4K to 200K tokens. It comprises two main components: a "needle‑in‑a‑haystack" medical context retrieval task and a suite of medical‑specific tasks, together covering 10 datasets.

Dataset Statistics

TaskAnnotations#ExamplesAvg. LengthMICNFICIREvaluation Metric
En.NIAHAuto & Human20×7×5179.2k/32Accuracy
Zh.NIAHAuto & Human20×7×545.6k/10.2Accuracy
En.CountingAuto4×7179.0k/13.6Accuracy
Zh.CountingAuto4×745.6k/12.3Accuracy
En.KGAuto & Human100186.4k/68.8Precision, Recall, F1
Zh.KGAuto & Human10042.5k/2.0Precision, Recall, F1
En.TermAuto100183.1k/11.7Accuracy
Zh.TermAuto10032.6k/7.0Accuracy
Zh.CaseAuto & Human10047.7k/1.3Accuracy
Zh.TableAuto & Human10053.6k/1.4Precision, Recall, F1

MIC: Maximum Identical Context, NFI: New Fact Injection, CIR: Counter‑Intuitive Reasoning.

Baseline Models

We evaluate current state‑of‑the‑art long‑context LLMs on MedOdyssey.

  • GPT‑4: Released March 2023 by OpenAI, original 8,192‑token window, extended to 128 k in November 2023 (gpt‑4‑turbo‑2024‑04‑09).
  • GPT‑4o: Optimised variant released May 2024, 128 k window, knowledge cut‑off Oct 2023 (gpt‑4o‑2024‑05‑13).
  • Claude 3: Anthropic’s March 2024 release; three models (Haiku, Sonnet, Opus) with 200 k windows (claude‑3‑haiku‑20240307, claude‑3‑sonnet‑20240229).
  • Moonshot‑v1: Moonshot AI, 2023 release, 128 k window (moonshot‑v1‑128k).
  • ChatGLM3‑6B‑128k: ZHIPU‑AI, 2024 release, 128 k context.
  • InternLM2: Shanghai AI Lab, 2024 release, supports 200 k inference.
  • Yi‑6B‑200k: 01.AI, 2023 release, 200 k window.
  • Yarn‑Mistral‑7B‑128k: NousResearch, 2023 release, 128 k window using YaRN.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio