claudios/Draper
The Draper VDISC dataset is a source‑code vulnerability detection dataset containing 1.27 million functions mined from open‑source software, each annotated with potential vulnerabilities via static analysis. The data are split into training, validation, and test sets in an 80:10:10 ratio, stored in HDF5 format. Each function's source code is stored as a variable‑length UTF‑8 string and includes five binary vulnerability labels corresponding to four common CWEs (CWE‑120, CWE‑119, CWE‑469, CWE‑476) and an “other CWE”. The dataset is sponsored by the U.S. Air Force Research Laboratory as part of the DARPA MUSE program.
Dataset description and usage context
Dataset Overview
Dataset Information
-
Feature List:
functionSource: string typeCWE-119: boolean typeCWE-120: boolean typeCWE-469: boolean typeCWE-476: boolean typeCWE-other: boolean typecombine: integer type
-
Data Splits:
train: 832,092,463 bytes, 1,019,471 samplesvalidation: 104,260,416 bytes, 127,476 samplestest: 104,097,361 bytes, 127,419 samples
-
Data Size:
- Download size: 535,360,739 bytes
- Dataset size: 1,040,450,240 bytes
Configuration
- Default Paths:
train:data/train-*validation:data/validation-*test:data/test-*
Task Category
- Text Classification
Labels
- Code
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.