claudios/Draper

The Draper VDISC dataset is a source‑code vulnerability detection dataset containing 1.27 million functions mined from open‑source software, each annotated with potential vulnerabilities via static analysis. The data are split into training, validation, and test sets in an 80:10:10 ratio, stored in HDF5 format. Each function's source code is stored as a variable‑length UTF‑8 string and includes five binary vulnerability labels corresponding to four common CWEs (CWE‑120, CWE‑119, CWE‑469, CWE‑476) and an “other CWE”. The dataset is sponsored by the U.S. Air Force Research Laboratory as part of the DARPA MUSE program.

Updated 1/5/2024

hugging_face

Description

Dataset Overview

Dataset Information

Feature List:
- functionSource: string type
- CWE-119: boolean type
- CWE-120: boolean type
- CWE-469: boolean type
- CWE-476: boolean type
- CWE-other: boolean type
- combine: integer type
Data Splits:
- train: 832,092,463 bytes, 1,019,471 samples
- validation: 104,260,416 bytes, 127,476 samples
- test: 104,097,361 bytes, 127,419 samples
Data Size:
- Download size: 535,360,739 bytes
- Dataset size: 1,040,450,240 bytes

Configuration

Default Paths:
- train: data/train-*
- validation: data/validation-*
- test: data/test-*

Task Category

Text Classification

Labels

Code

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Source Code Vulnerability Detection

Deep Learning

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →