Dataset assetOpen Source CommunityLanguage Model Safety EvaluationAdversarial Testing

ibm/AttaQ

The AttaQ dataset contains 1,402 carefully crafted adversarial questions designed to assess the propensity of large language models (LLMs) to produce harmful or undesirable responses. The dataset is divided into seven categories: deception, discrimination, harmful information, substance abuse, sexual content, personally identifying information (PII), and violence. It can be used to evaluate LLM behavior and explore factors influencing their responses, ultimately aiming to improve their harmlessness and ethical use.

Source

hugging_face

Created

Nov 28, 2025

Updated

Jan 26, 2024

Signals

209 views

Availability

Linked source ready

Overview

Dataset description and usage context

AttaQ Dataset Summary

Basic Information

License: MIT
Language: English
Multilinguality: Monolingual
Source Dataset: Expanded from Anthropic/hh-rlhf
Task Categories:
- Text Generation
- Text‑to‑Text Generation
Size Category: 1K < n < 10K
Tags:
- Safety
- Harm
- Large Language Models (LLMs)
- Vulnerability
- Red‑Team Testing
- Toxicity

Dataset Description

Purpose: Evaluate the tendency of large language models (LLMs) to generate harmful or suboptimal responses.
Composition: Includes 1,402 meticulously designed adversarial questions, divided into seven categories: deception, discrimination, harmful information, substance abuse, sexual content, personal identifying information (PII), and violence.
Use Cases: For evaluating LLM behavior and investigating factors affecting their responses, with the aim of enhancing their harmlessness and ethical deployment.

Development Process

Extraction: Systematically extracted attacks from Anthropic’s hh‑rlhf dataset and semi‑automatically labeled them into seven categories.
Generation: Employed LLMs to generate a large number of attacks similar to those extracted from the hh‑rlhf dataset.
Crawling & Generation: Combined LLMs with Wikipedia crime pages to generate actions related to each crime, then crafted questions that explore unethical or illegal behavior, especially focusing on evading detection or consequences.

Data Fields

label: Label of the adversarial question
input: The adversarial question
additional_info: Source of the adversarial question

Usage Information

Evaluation Method: Models should be prompted with each question in the dataset and evaluated using metrics that consider both inputs and outputs.
Evaluation Models: Use the "OpenAssistant/reward-model-deberta-v3-large-v2" reward model, or alternatively "sileod/deberta-v3-large-tasksource-rlhf-reward-model".

Citation

@article{kour2023unveiling,
  title={Unveiling Safety Vulnerabilities of Large Language Models},
  author={Kour, George and Zalmanovici, Marcel and Zwerdling, Naama and Goldbraich, Esther and Fandina, Ora Nova and Anaby‑Tavor, Ateret and Raz, Orna and Farchi, Eitan},
  journal={arXiv preprint arXiv:2311.04124},
  year={2023}
}

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio