LogicGame
LogicGame is a benchmark for evaluating large language models' (LLMs) understanding, execution, and planning of logical rules. It includes a diverse set of games with predefined rules, specifically designed to assess logical reasoning independently of factual knowledge. The benchmark measures model performance across varying difficulty levels, aiming for a comprehensive evaluation of rule‑based reasoning and multi‑step execution and planning capabilities.
Description
LogicGame-Data Dataset Overview
Introduction
LogicGame is a benchmark for evaluating large language models' (LLMs) understanding, execution, and planning of logical rules. The benchmark comprises various games with predefined rules, designed to assess logical reasoning independent of factual knowledge. Model performance is measured across different difficulty levels to comprehensively evaluate rule‑based reasoning and multi‑step execution and planning.
Data Description
The project includes four .jsonl files: en_dev, zh_dev, en_all, and zh_all, representing development and full sets for English and Chinese respectively. The development set contains 10 entries per language, while the full set contains 304 entries per language.
zh_allanden_allare used for Codabench submissions; the contexts field can serve as prompts for model responses during evaluation.- The development set is intended for detailed illustration.
Development Set Fields
- qid: Unique identifier for each entry.
- contexts: Benchmark question, combining rules, the problem, and output constraints.
- reference: Reference JSON answer and process.
- level: Difficulty level ranging from 0 to 3.
- examples: Few‑shot examples.
- category: Data type/category.
Full Set Fields
- qid: Unique identifier.
- contexts: Benchmark question.
- level: Difficulty level.
- category: Data category/task.
Leaderboard
The table below shows the performance of 14 models on the LogicGame benchmark, with the best scores highlighted in bold.
Chinese Models Performance
| Model | AP-Acc% | A-Acc% | P-Acc% | IFError% | JSError% |
|---|---|---|---|---|---|
| o1-preview | 54.93 | 67.11 | 66.85 | 0.00 | 0.00 |
| o1-mini | 51.97 | 63.49 | 64.97 | 0.00 | 0.00 |
| ... (remaining rows omitted for brevity) |
English Models Performance
| Model | AP-Acc% | A-Acc% | P-Acc% | IFError% | JSError% |
|---|---|---|---|---|---|
| o1-preview | 53.29 | 65.46 | 64.82 | 0.33 | 0.00 |
| o1-mini | 49.67 | 61.18 | 63.25 | 0.66 | 0.33 |
| ... (remaining rows omitted for brevity) |
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: github
Created: 9/28/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.