Explore high-quality datasets for your AI and machine learning projects.
This dataset is a large‑scale corpus for sarcasm research and for training and evaluating sarcasm detection systems. It contains 1.3 million sarcastic statements—ten times larger than any previous dataset—and a larger number of non‑sarcastic statements, enabling learning under both balanced and imbalanced label regimes. Each statement is self‑annotated (the sarcasm label is provided by the author rather than an external annotator) and includes user, topic, and dialogue context. The dataset’s accuracy has been evaluated, a sarcasm detection benchmark established, and baseline methods assessed.