JUHE API Marketplace
DATASET
Open Source Community

ibm/AttaQ

The AttaQ dataset contains 1,402 carefully crafted adversarial questions designed to assess the propensity of large language models (LLMs) to produce harmful or undesirable responses. The dataset is divided into seven categories: deception, discrimination, harmful information, substance abuse, sexual content, personally identifying information (PII), and violence. It can be used to evaluate LLM behavior and explore factors influencing their responses, ultimately aiming to improve their harmlessness and ethical use.

Updated 1/26/2024
hugging_face

Description

AttaQ Dataset Summary

Basic Information

  • License: MIT
  • Language: English
  • Multilinguality: Monolingual
  • Source Dataset: Expanded from Anthropic/hh-rlhf
  • Task Categories:
    • Text Generation
    • Text‑to‑Text Generation
  • Size Category: 1K < n < 10K
  • Tags:
    • Safety
    • Harm
    • Large Language Models (LLMs)
    • Vulnerability
    • Red‑Team Testing
    • Toxicity

Dataset Description

  • Purpose: Evaluate the tendency of large language models (LLMs) to generate harmful or suboptimal responses.
  • Composition: Includes 1,402 meticulously designed adversarial questions, divided into seven categories: deception, discrimination, harmful information, substance abuse, sexual content, personal identifying information (PII), and violence.
  • Use Cases: For evaluating LLM behavior and investigating factors affecting their responses, with the aim of enhancing their harmlessness and ethical deployment.

Development Process

  1. Extraction: Systematically extracted attacks from Anthropic’s hh‑rlhf dataset and semi‑automatically labeled them into seven categories.
  2. Generation: Employed LLMs to generate a large number of attacks similar to those extracted from the hh‑rlhf dataset.
  3. Crawling & Generation: Combined LLMs with Wikipedia crime pages to generate actions related to each crime, then crafted questions that explore unethical or illegal behavior, especially focusing on evading detection or consequences.

Data Fields

  • label: Label of the adversarial question
  • input: The adversarial question
  • additional_info: Source of the adversarial question

Usage Information

  • Evaluation Method: Models should be prompted with each question in the dataset and evaluated using metrics that consider both inputs and outputs.
  • Evaluation Models: Use the "OpenAssistant/reward-model-deberta-v3-large-v2" reward model, or alternatively "sileod/deberta-v3-large-tasksource-rlhf-reward-model".

Citation

@article{kour2023unveiling,
  title={Unveiling Safety Vulnerabilities of Large Language Models},
  author={Kour, George and Zalmanovici, Marcel and Zwerdling, Naama and Goldbraich, Esther and Fandina, Ora Nova and Anaby‑Tavor, Ateret and Raz, Orna and Farchi, Eitan},
  journal={arXiv preprint arXiv:2311.04124},
  year={2023}
}

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Language Model Safety Evaluation
Adversarial Testing

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.