Back to datasets
Dataset assetOpen Source CommunityMental HealthText Classification

tartuNLP/reddit-anhedonia

The PRIMATE dataset focuses on detecting anhedonia (loss of interest or pleasure) in mental‑health contexts. Re‑annotation by mental‑health professionals provides finer‑grained labels and textual evidence, revealing many false‑positive cases and resulting in a higher‑quality test set for anhedonia detection. The study highlights the necessity of addressing annotation quality in mental‑health datasets and advocates improved methods to enhance the reliability of NLP models for mental‑health assessment. Access to the PRIMATE dataset is required first, after which provided scripts can be used for label mapping. The dataset was created by extracting Reddit posts from the original PRIMATE collection and annotating them by mental‑health professionals. Only labels are included; the original post content is omitted.

Source
hugging_face
Created
Nov 28, 2025
Updated
Jul 1, 2024
Signals
140 views
Availability
Linked source ready
Overview

Dataset description and usage context

PRIMATE Dataset Overview

Dataset Description

Dataset Summary

  • Topic: Addresses annotation quality issues in the PRIMATE dataset, especially the lack of annotations for anhedonia symptoms.
  • Improvement: Re‑annotation by mental‑health professionals introduces finer‑grained labels and textual spans as evidence, identifying a large number of false positives.
  • Purpose: Provides a higher‑quality test set for detecting anhedonia, emphasizing the need to resolve annotation quality problems in mental‑health datasets.

Using the Dataset

  • Dataset File: Place the primate_dataset.json file in the script directory.
  • Code Example: Use Python code to map dataset labels to PRIMATE posts.

Language

  • Language: English

Dataset Structure

Data Instances

  • Example:
    {
      "primate_id": 1394,
      "answerable": 0,
      "mentioned": 1,
      "writer_symptom": 1,
      "quote": [
        [
          1537,
          1710
        ]
      ]
    }
    

Data Fields

  • Information pending

Data Splits

  • Information pending

Dataset Creation

Source Data

  • Source: Extracted Reddit posts from the original PRIMATE dataset.
  • Acquisition: Requires agreement to PRIMATE's terms and conditions and follows a specified acquisition workflow.

Annotation

  • Process: Mental‑health professionals read all posts and label the presence of anhedonia symptoms.
  • Label Definitions:
    • "mentioned": Symptom is mentioned in the text but duration or intensity cannot be inferred.
    • "answerable": Clear evidence of anhedonia is present.
    • "writer_symptom": The author discusses their own or a third‑party's symptom.
  • Annotator: The second author, who is also a clinical‑psychology intern.

Personal and Sensitive Information

  • Protection: No original posts are released; only annotation results are published.

Usage Considerations

Bias Discussion

  • Limitation: Manual annotation serves as a proxy for clinical evaluation of Reddit posts as depressive indicators.
  • Annotator Single‑Source: Only one mental‑health professional performed the annotation, precluding inter‑annotator agreement analysis.
  • Label Limitation: Binary labels may not suit cases where symptom presence/absence is ambiguous; a Likert scale is recommended.

Citation Information

  • Citation: If used in research, please cite the associated paper.
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio