Explore high-quality datasets for your AI and machine learning projects.
LogicGame is a benchmark for evaluating large language models' (LLMs) understanding, execution, and planning of logical rules. It includes a diverse set of games with predefined rules, specifically designed to assess logical reasoning independently of factual knowledge. The benchmark measures model performance across varying difficulty levels, aiming for a comprehensive evaluation of rule‑based reasoning and multi‑step execution and planning capabilities.
CTIBench is a comprehensive benchmark suite and dataset designed to evaluate large language models (LLMs) on cyber‑threat intelligence (CTI) tasks. The dataset includes multiple tasks such as multiple‑choice questions (CTI‑MCQ), vulnerability classification (CTI‑RCM), vulnerability scoring (CTI‑VSP), and threat‑report analysis (CTI‑TAA). Each task is provided as a TSV file containing prompts and the correct answer. The data were curated by Md Tanvirul Alam and Dipkamal Bhusal, sourced from authoritative standards such as NIST, MITRE, and GDPR.