Dataset assetOpen Source CommunityLogical ReasoningLarge Language Model Evaluation

LogicGame

LogicGame is a benchmark for evaluating large language models' (LLMs) understanding, execution, and planning of logical rules. It includes a diverse set of games with predefined rules, specifically designed to assess logical reasoning independently of factual knowledge. The benchmark measures model performance across varying difficulty levels, aiming for a comprehensive evaluation of rule‑based reasoning and multi‑step execution and planning capabilities.

Source

github

Created

Sep 28, 2024

Updated

Oct 10, 2024

Signals

254 views

Availability

Linked source ready

Overview

Dataset description and usage context

LogicGame-Data Dataset Overview

Introduction

LogicGame is a benchmark for evaluating large language models' (LLMs) understanding, execution, and planning of logical rules. The benchmark comprises various games with predefined rules, designed to assess logical reasoning independent of factual knowledge. Model performance is measured across different difficulty levels to comprehensively evaluate rule‑based reasoning and multi‑step execution and planning.

Data Description

The project includes four .jsonl files: en_dev, zh_dev, en_all, and zh_all, representing development and full sets for English and Chinese respectively. The development set contains 10 entries per language, while the full set contains 304 entries per language.

zh_all and en_all are used for Codabench submissions; the contexts field can serve as prompts for model responses during evaluation.
The development set is intended for detailed illustration.

Development Set Fields

qid: Unique identifier for each entry.
contexts: Benchmark question, combining rules, the problem, and output constraints.
reference: Reference JSON answer and process.
level: Difficulty level ranging from 0 to 3.
examples: Few‑shot examples.
category: Data type/category.

Full Set Fields

qid: Unique identifier.
contexts: Benchmark question.
level: Difficulty level.
category: Data category/task.

Leaderboard

The table below shows the performance of 14 models on the LogicGame benchmark, with the best scores highlighted in bold.

Chinese Models Performance

Model	AP-Acc%	A-Acc%	P-Acc%	IFError%	JSError%
o1-preview	54.93	67.11	66.85	0.00	0.00
o1-mini	51.97	63.49	64.97	0.00	0.00
... (remaining rows omitted for brevity)

English Models Performance

Model	AP-Acc%	A-Acc%	P-Acc%	IFError%	JSError%
o1-preview	53.29	65.46	64.82	0.33	0.00
o1-mini	49.67	61.18	63.25	0.66	0.33
... (remaining rows omitted for brevity)

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio