Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingMental Health

PsyDTCorpus

PsyDTCorpus is a high‑quality multi‑turn psychological‑health dialogue dataset created by a team at South China University of Technology. It aims to simulate the personalized counseling style of a specific therapist. The dataset contains 5,000 single‑turn long‑text dialogues generated in a single pass with GPT‑4, modeling the five major personality traits of clients and synthesizing multi‑turn conversations. The creation process combines real‑world counseling cases to ensure complexity and diversity. PsyDTCorpus is mainly applied in psychological counseling, seeking to improve the performance of LLMs for mental‑health support by providing personalized counseling styles, addressing the lack of personalization in existing models.

Source
arXiv
Created
Dec 18, 2024
Updated
Dec 18, 2024
Signals
1,128 views
Availability
Linked source ready
Overview

Dataset description and usage context

Digital Twin of a Psychotherapist Dataset (PsyDTCorpus)

Dataset Overview

  • Dataset Name: PsyDTCorpus
  • Source: Based on real multi‑turn counseling cases of a specific therapist, synthesized via a digital‑twin data generation framework.
  • Scale:
    • Training set: 4,760 dialogues, total 86,054 turns, average 18 turns per dialogue.
    • Test set: 240 dialogues, total 4,311 turns, average 18 turns per dialogue.
  • Format: OpenAI format.
  • Topic Distribution: The dataset covers various topics; the distribution is shown in the topic distribution chart.

Data Generation Method

  • Framework: Using a small number of real counseling cases, combined with the Big Five personality analysis and LLM summarization capabilities, to generate multi‑turn dialogues that reflect the therapist's language style and counseling techniques.
  • Generation Scale:
    • Single‑turn counseling database size: 5,000.
    • Specific therapist case count: 12 (typically not more than 20).

Dataset Download

  • Download Methods:
    1. Using git-lfs:
      cd <project_path>/data
      git lfs install
      git clone https://www.modelscope.cn/datasets/YIRONGCHEN/PsyDTCorpus.git
      
    2. Using modelscope download:
      cd <project_path>/data
      mkdir PsyDTCorpus
      modelscope download --dataset YIRONGCHEN/PsyDTCorpus --include *
      

Sample Entry

{
    "id": 0,
    "normalizedTag": "婚恋",
    "messages": [
        {
            "role": "system",
            "content": "You are a psychotherapist proficient in Rational Emotive Behavior Therapy (REBT), capable of providing professional guidance and support to alleviate clients' negative emotions and behavioral responses, helping them achieve personal growth and mental health. REBT includes several stages, listed below with brief descriptions of each stage..."
        },
        ...
    ]
}
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio