CreativeLang/SARC_Sarcasm

This dataset is a large‑scale corpus for sarcasm research and for training and evaluating sarcasm detection systems. It contains 1.3 million sarcastic statements—ten times larger than any previous dataset—and a larger number of non‑sarcastic statements, enabling learning under both balanced and imbalanced label regimes. Each statement is self‑annotated (the sarcasm label is provided by the author rather than an external annotator) and includes user, topic, and dialogue context. The dataset’s accuracy has been evaluated, a sarcasm detection benchmark established, and baseline methods assessed.

Updated 7/11/2023

hugging_face

Description

Dataset Overview

Dataset Name

Name: SARC_Sarcasm

Dataset Features

Feature List:
- text: string
- author: string
- score: int64
- ups: int64
- downs: int64
- date: string
- created_utc: int64
- subreddit: string
- id: string

Dataset Splits

Training Set:
- Samples: 12,704,751
- Size: 1,764,500,045 bytes

Dataset Size

Download Size: 903,559,115 bytes
Total Size: 1,764,500,045 bytes

License

License Type: cc-by-2.0

Dataset Description

Purpose: For sarcasm research and training/evaluating sarcasm detection systems
Scale: 1.3 million sarcastic statements, ten times larger than any previous dataset
Annotation Method: Self‑annotation by authors
Content: Includes user, topic, and dialogue context information
Evaluation & Benchmark: Accuracy evaluated, sarcasm detection benchmark established

Dataset Metadata

Type: Sarcasm
Task Type: Detection
Creation Year: 2018

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Sarcasm Detection

Natural Language Processing

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →