Back to datasets
Dataset assetOpen Source CommunityLogical ReasoningLarge Language Model Evaluation

LogicGame

LogicGame is a benchmark for evaluating large language models' (LLMs) understanding, execution, and planning of logical rules. It includes a diverse set of games with predefined rules, specifically designed to assess logical reasoning independently of factual knowledge. The benchmark measures model performance across varying difficulty levels, aiming for a comprehensive evaluation of rule‑based reasoning and multi‑step execution and planning capabilities.

Source
github
Created
Sep 28, 2024
Updated
Oct 10, 2024
Signals
254 views
Availability
Linked source ready
Overview

Dataset description and usage context

LogicGame-Data Dataset Overview

Introduction

LogicGame is a benchmark for evaluating large language models' (LLMs) understanding, execution, and planning of logical rules. The benchmark comprises various games with predefined rules, designed to assess logical reasoning independent of factual knowledge. Model performance is measured across different difficulty levels to comprehensively evaluate rule‑based reasoning and multi‑step execution and planning.

Data Description

The project includes four .jsonl files: en_dev, zh_dev, en_all, and zh_all, representing development and full sets for English and Chinese respectively. The development set contains 10 entries per language, while the full set contains 304 entries per language.

  • zh_all and en_all are used for Codabench submissions; the contexts field can serve as prompts for model responses during evaluation.
  • The development set is intended for detailed illustration.

Development Set Fields

  • qid: Unique identifier for each entry.
  • contexts: Benchmark question, combining rules, the problem, and output constraints.
  • reference: Reference JSON answer and process.
  • level: Difficulty level ranging from 0 to 3.
  • examples: Few‑shot examples.
  • category: Data type/category.

Full Set Fields

  • qid: Unique identifier.
  • contexts: Benchmark question.
  • level: Difficulty level.
  • category: Data category/task.

Leaderboard

The table below shows the performance of 14 models on the LogicGame benchmark, with the best scores highlighted in bold.

Chinese Models Performance

ModelAP-Acc%A-Acc%P-Acc%IFError%JSError%
o1-preview54.9367.1166.850.000.00
o1-mini51.9763.4964.970.000.00
... (remaining rows omitted for brevity)

English Models Performance

ModelAP-Acc%A-Acc%P-Acc%IFError%JSError%
o1-preview53.2965.4664.820.330.00
o1-mini49.6761.1863.250.660.33
... (remaining rows omitted for brevity)
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio