claudios/Draper
The Draper VDISC dataset is a source‑code vulnerability detection dataset containing 1.27 million functions mined from open‑source software, each annotated with potential vulnerabilities via static analysis. The data are split into training, validation, and test sets in an 80:10:10 ratio, stored in HDF5 format. Each function's source code is stored as a variable‑length UTF‑8 string and includes five binary vulnerability labels corresponding to four common CWEs (CWE‑120, CWE‑119, CWE‑469, CWE‑476) and an “other CWE”. The dataset is sponsored by the U.S. Air Force Research Laboratory as part of the DARPA MUSE program.
Description
Dataset Overview
Dataset Information
-
Feature List:
functionSource: string typeCWE-119: boolean typeCWE-120: boolean typeCWE-469: boolean typeCWE-476: boolean typeCWE-other: boolean typecombine: integer type
-
Data Splits:
train: 832,092,463 bytes, 1,019,471 samplesvalidation: 104,260,416 bytes, 127,476 samplestest: 104,097,361 bytes, 127,419 samples
-
Data Size:
- Download size: 535,360,739 bytes
- Dataset size: 1,040,450,240 bytes
Configuration
- Default Paths:
train:data/train-*validation:data/validation-*test:data/test-*
Task Category
- Text Classification
Labels
- Code
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.