Back to datasets
Dataset assetOpen Source CommunityDeep LearningSource Code Vulnerability Detection

claudios/Draper

The Draper VDISC dataset is a source‑code vulnerability detection dataset containing 1.27 million functions mined from open‑source software, each annotated with potential vulnerabilities via static analysis. The data are split into training, validation, and test sets in an 80:10:10 ratio, stored in HDF5 format. Each function's source code is stored as a variable‑length UTF‑8 string and includes five binary vulnerability labels corresponding to four common CWEs (CWE‑120, CWE‑119, CWE‑469, CWE‑476) and an “other CWE”. The dataset is sponsored by the U.S. Air Force Research Laboratory as part of the DARPA MUSE program.

Source
hugging_face
Created
Nov 28, 2025
Updated
Jan 5, 2024
Signals
193 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Information

  • Feature List:

    • functionSource: string type
    • CWE-119: boolean type
    • CWE-120: boolean type
    • CWE-469: boolean type
    • CWE-476: boolean type
    • CWE-other: boolean type
    • combine: integer type
  • Data Splits:

    • train: 832,092,463 bytes, 1,019,471 samples
    • validation: 104,260,416 bytes, 127,476 samples
    • test: 104,097,361 bytes, 127,419 samples
  • Data Size:

    • Download size: 535,360,739 bytes
    • Dataset size: 1,040,450,240 bytes

Configuration

  • Default Paths:
    • train: data/train-*
    • validation: data/validation-*
    • test: data/test-*

Task Category

  • Text Classification

Labels

  • Code
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio