ibm/AttaQ
The AttaQ dataset contains 1,402 carefully crafted adversarial questions designed to assess the propensity of large language models (LLMs) to produce harmful or undesirable responses. The dataset is divided into seven categories: deception, discrimination, harmful information, substance abuse, sexual content, personally identifying information (PII), and violence. It can be used to evaluate LLM behavior and explore factors influencing their responses, ultimately aiming to improve their harmlessness and ethical use.
Description
AttaQ Dataset Summary
Basic Information
- License: MIT
- Language: English
- Multilinguality: Monolingual
- Source Dataset: Expanded from Anthropic/hh-rlhf
- Task Categories:
- Text Generation
- Text‑to‑Text Generation
- Size Category: 1K < n < 10K
- Tags:
- Safety
- Harm
- Large Language Models (LLMs)
- Vulnerability
- Red‑Team Testing
- Toxicity
Dataset Description
- Purpose: Evaluate the tendency of large language models (LLMs) to generate harmful or suboptimal responses.
- Composition: Includes 1,402 meticulously designed adversarial questions, divided into seven categories: deception, discrimination, harmful information, substance abuse, sexual content, personal identifying information (PII), and violence.
- Use Cases: For evaluating LLM behavior and investigating factors affecting their responses, with the aim of enhancing their harmlessness and ethical deployment.
Development Process
- Extraction: Systematically extracted attacks from Anthropic’s hh‑rlhf dataset and semi‑automatically labeled them into seven categories.
- Generation: Employed LLMs to generate a large number of attacks similar to those extracted from the hh‑rlhf dataset.
- Crawling & Generation: Combined LLMs with Wikipedia crime pages to generate actions related to each crime, then crafted questions that explore unethical or illegal behavior, especially focusing on evading detection or consequences.
Data Fields
label: Label of the adversarial questioninput: The adversarial questionadditional_info: Source of the adversarial question
Usage Information
- Evaluation Method: Models should be prompted with each question in the dataset and evaluated using metrics that consider both inputs and outputs.
- Evaluation Models: Use the "OpenAssistant/reward-model-deberta-v3-large-v2" reward model, or alternatively "sileod/deberta-v3-large-tasksource-rlhf-reward-model".
Citation
@article{kour2023unveiling,
title={Unveiling Safety Vulnerabilities of Large Language Models},
author={Kour, George and Zalmanovici, Marcel and Zwerdling, Naama and Goldbraich, Esther and Fandina, Ora Nova and Anaby‑Tavor, Ateret and Raz, Orna and Farchi, Eitan},
journal={arXiv preprint arXiv:2311.04124},
year={2023}
}
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.