JUHE API Marketplace
DATASET
Open Source Community

claudios/Draper

The Draper VDISC dataset is a source‑code vulnerability detection dataset containing 1.27 million functions mined from open‑source software, each annotated with potential vulnerabilities via static analysis. The data are split into training, validation, and test sets in an 80:10:10 ratio, stored in HDF5 format. Each function's source code is stored as a variable‑length UTF‑8 string and includes five binary vulnerability labels corresponding to four common CWEs (CWE‑120, CWE‑119, CWE‑469, CWE‑476) and an “other CWE”. The dataset is sponsored by the U.S. Air Force Research Laboratory as part of the DARPA MUSE program.

Updated 1/5/2024
hugging_face

Description

Dataset Overview

Dataset Information

  • Feature List:

    • functionSource: string type
    • CWE-119: boolean type
    • CWE-120: boolean type
    • CWE-469: boolean type
    • CWE-476: boolean type
    • CWE-other: boolean type
    • combine: integer type
  • Data Splits:

    • train: 832,092,463 bytes, 1,019,471 samples
    • validation: 104,260,416 bytes, 127,476 samples
    • test: 104,097,361 bytes, 127,419 samples
  • Data Size:

    • Download size: 535,360,739 bytes
    • Dataset size: 1,040,450,240 bytes

Configuration

  • Default Paths:
    • train: data/train-*
    • validation: data/validation-*
    • test: data/test-*

Task Category

  • Text Classification

Labels

  • Code

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Source Code Vulnerability Detection
Deep Learning

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.