JUHE API Marketplace
DATASET
Open Source Community

MedOdyssey

MedOdyssey is a medical‑domain long‑context evaluation benchmark co‑created by East China University of Science and Technology and Shanghai Artificial Intelligence Laboratory, comprising 10 complex datasets covering medical corpora such as books, guidelines, case reports, and knowledge graphs. The datasets are built from open‑source and royalty‑free medical data to assess large language models’ performance on long‑context tasks, particularly in medical applications like electronic health‑record analysis and biomedical terminology standardisation.

Updated 6/21/2024
arXiv

Description

MedOdyssey: A Medical Domain Benchmark for Long‑Context Evaluation Up to 200K Tokens

Introduction

MedOdyssey is a medical long‑context benchmark consisting of seven length levels ranging from 4K to 200K tokens. It comprises two main components: a "needle‑in‑a‑haystack" medical context retrieval task and a suite of medical‑specific tasks, together covering 10 datasets.

Dataset Statistics

TaskAnnotations#ExamplesAvg. LengthMICNFICIREvaluation Metric
En.NIAHAuto & Human20×7×5179.2k/32Accuracy
Zh.NIAHAuto & Human20×7×545.6k/10.2Accuracy
En.CountingAuto4×7179.0k/13.6Accuracy
Zh.CountingAuto4×745.6k/12.3Accuracy
En.KGAuto & Human100186.4k/68.8Precision, Recall, F1
Zh.KGAuto & Human10042.5k/2.0Precision, Recall, F1
En.TermAuto100183.1k/11.7Accuracy
Zh.TermAuto10032.6k/7.0Accuracy
Zh.CaseAuto & Human10047.7k/1.3Accuracy
Zh.TableAuto & Human10053.6k/1.4Precision, Recall, F1

MIC: Maximum Identical Context, NFI: New Fact Injection, CIR: Counter‑Intuitive Reasoning.

Baseline Models

We evaluate current state‑of‑the‑art long‑context LLMs on MedOdyssey.

  • GPT‑4: Released March 2023 by OpenAI, original 8,192‑token window, extended to 128 k in November 2023 (gpt‑4‑turbo‑2024‑04‑09).
  • GPT‑4o: Optimised variant released May 2024, 128 k window, knowledge cut‑off Oct 2023 (gpt‑4o‑2024‑05‑13).
  • Claude 3: Anthropic’s March 2024 release; three models (Haiku, Sonnet, Opus) with 200 k windows (claude‑3‑haiku‑20240307, claude‑3‑sonnet‑20240229).
  • Moonshot‑v1: Moonshot AI, 2023 release, 128 k window (moonshot‑v1‑128k).
  • ChatGLM3‑6B‑128k: ZHIPU‑AI, 2024 release, 128 k context.
  • InternLM2: Shanghai AI Lab, 2024 release, supports 200 k inference.
  • Yi‑6B‑200k: 01.AI, 2023 release, 200 k window.
  • Yarn‑Mistral‑7B‑128k: NousResearch, 2023 release, 128 k window using YaRN.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Medical Data
Artificial Intelligence

Source

Organization: arXiv

Created: 6/21/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.