JUHE API Marketplace
DATASET
Open Source Community

LogicGame

LogicGame is a benchmark for evaluating large language models' (LLMs) understanding, execution, and planning of logical rules. It includes a diverse set of games with predefined rules, specifically designed to assess logical reasoning independently of factual knowledge. The benchmark measures model performance across varying difficulty levels, aiming for a comprehensive evaluation of rule‑based reasoning and multi‑step execution and planning capabilities.

Updated 10/10/2024
github

Description

LogicGame-Data Dataset Overview

Introduction

LogicGame is a benchmark for evaluating large language models' (LLMs) understanding, execution, and planning of logical rules. The benchmark comprises various games with predefined rules, designed to assess logical reasoning independent of factual knowledge. Model performance is measured across different difficulty levels to comprehensively evaluate rule‑based reasoning and multi‑step execution and planning.

Data Description

The project includes four .jsonl files: en_dev, zh_dev, en_all, and zh_all, representing development and full sets for English and Chinese respectively. The development set contains 10 entries per language, while the full set contains 304 entries per language.

  • zh_all and en_all are used for Codabench submissions; the contexts field can serve as prompts for model responses during evaluation.
  • The development set is intended for detailed illustration.

Development Set Fields

  • qid: Unique identifier for each entry.
  • contexts: Benchmark question, combining rules, the problem, and output constraints.
  • reference: Reference JSON answer and process.
  • level: Difficulty level ranging from 0 to 3.
  • examples: Few‑shot examples.
  • category: Data type/category.

Full Set Fields

  • qid: Unique identifier.
  • contexts: Benchmark question.
  • level: Difficulty level.
  • category: Data category/task.

Leaderboard

The table below shows the performance of 14 models on the LogicGame benchmark, with the best scores highlighted in bold.

Chinese Models Performance

ModelAP-Acc%A-Acc%P-Acc%IFError%JSError%
o1-preview54.9367.1166.850.000.00
o1-mini51.9763.4964.970.000.00
... (remaining rows omitted for brevity)

English Models Performance

ModelAP-Acc%A-Acc%P-Acc%IFError%JSError%
o1-preview53.2965.4664.820.330.00
o1-mini49.6761.1863.250.660.33
... (remaining rows omitted for brevity)

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Logical Reasoning
Large Language Model Evaluation

Source

Organization: github

Created: 9/28/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.