LorenzH/juliet_test_suite_c_1_3
This dataset includes all test cases from NIST's Juliet test suite for the C and C++ programming languages. Each sample provides a good and a defective implementation, extracted via the Juliet suite's OMITGOOD and OMITBAD preprocessor macros. The dataset supports software defect prediction and code clone detection tasks. Its structure comprises data instances, fields, and splits. Fields include index, filename, defect class, good code, and bad code. Splits contain training and test set sizes. The dataset is synthetic, with all samples handcrafted, and therefore does not fully represent real‑world software defects.
Dataset description and usage context
Dataset Card: Juliet Test Suite 1.3
Dataset Overview
This dataset contains all test cases from NIST's Juliet Test Suite for the C and C++ programming languages. Each sample includes a good and a defective implementation, extracted using the Juliet suite's OMITGOOD and OMITBAD preprocessor macros.
Supported Tasks and Leaderboards
- Software defect prediction
- Code clone detection
Languages
C and C++ programming languages
Dataset Structure
Data Instances
Data Fields
| Index | Name | Type | Description |
|---|---|---|---|
| 0 | index | int | Index of each sample in the dataset |
| 1 | filename | str | Path of the test case file, including the filename |
| 2 | class | int | Defect category, i.e., the CWE identifier set for the sample |
| 3 | good | str | Source code of the benign implementation |
| 4 | bad | str | Source code of the defective implementation |
Data Splits
| Type | Size |
|---|---|
| train | 80,706 cases |
| test | 20,177 cases |
Dataset Creation
Source
https://samate.nist.gov/SARD/test-suites/112
Usage Considerations
Societal Impact
Bias Discussion
Other Known Limitations
The Juliet test suite is a synthetic dataset; all samples are handcrafted and therefore do not fully represent real software defects. Applying classifiers trained on these samples to real‑world environments may lead to degraded performance and severe misclassifications, potentially overlooking critical software defects.
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.