JUHE API Marketplace
DATASET
Open Source Community

GHPR Dataset

The GHPR dataset is used for empirical research and evaluation of software defect prediction. It is built from GitHub Pull Requests (PRs) and identifies 3,026 defect‑fix records. Each fix is treated as a record, yielding 6,052 learning instances (3,026 defective and 3,026 non‑defective). The dataset is provided in CSV and SQL formats and includes 16 features such as project name, project owner, project description, tags, programming language, pre‑ and post‑fix version IDs, defective code, commit description, commit time, pre‑ and post‑fix file contents, file‑path changes, PR title and description, etc.

Updated 4/29/2024
github

Description

GHPR Dataset Overview

Dataset Description

  • Name: GHPR Dataset
  • Purpose: Empirical research and evaluation of software defect prediction
  • Record Count: 3,026 defect‑fix records based on GitHub Pull Requests (PRs)
  • Learning Instances: 6,052 total (3,026 defective, 3,026 non‑defective)

Data Format

  • File Formats: Two formats are provided
    • ghprdata.csv: Compatible with Python's NumPy or pandas
    • ghprdata.sql: UTF‑8 encoded, suitable for large‑scale databases

Data Features

  • Record Features: Each record contains 16 features, including project name, project owner, project description, project tags, programming language, pre‑ and post‑fix version IDs, defective code, commit description, commit time, pre‑ and post‑fix file contents, file‑path changes, PR title and description, etc.

Dataset Metrics

  • Static Metrics: 21 static metrics are calculated for the 6,052 instances, such as coupling, method complexity, inheritance depth, response classes, method cohesion, etc., computed using the open‑source tool mauricioaniche/ck.

Citation Information

  • Citation Requirement: When using this dataset in publications, please cite the following paper:
    • Authors: Jiaxi Xu, Fei Wang, Jun Ai
    • Journal: IEEE Transactions on Reliability
    • Title: Defect Prediction With Semantics and Context Features of Codes Based on Graph Representation Learning
    • Year: 2021

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Software Defect Prediction
GitHub Data Analysis

Source

Organization: github

Created: 10/1/2019

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.