Explore high-quality datasets for your AI and machine learning projects.
The dataset contains 5,000 phishing URLs and 5,000 legitimate URLs for training machine learning models to predict phishing websites. Features of URLs and site content such as domain, IP, URL length, etc., were extracted, resulting in a dataset with 18 features.
We have constructed a relatively large Falco alert dataset for Kubernetes, containing both normal and APT attack alerts to facilitate the training of attack prediction models and support future research. Attack alerts were generated by applying CALDERA, an adversary simulation platform developed by MITRE, to simulate attacks in a Kubernetes cluster using MITRE ATT&CK tactic sequences. Normal alerts were obtained from Falco's routine alerts generated in the absence of attacks. All alerts were labeled as 'attack' or 'normal'. The dataset comprises 231 K alerts, including 2,314 attack alerts and 228,686 normal alerts.
The dataset comprises labeled network traffic data, encompassing various attacks (e.g., DoS, brute‑force, SQL injection, botnet) and normal traffic.
The NF‑CSE‑CIC‑IDS2018‑v2 dataset is a NetFlow version derived from the original CSE‑CIC‑IDS2018 pcaps, intended for network intrusion detection systems. It includes 18,893,708 flow records, of which 2,258,141 (11.95 %) are attack samples and 16,635,567 (88.05 %) are benign. The dataset is stratified by attack type and split into training (95 %) and testing (5 %) sets. Features include source/destination IPs, ports, protocol, byte/packet counts, flow duration, and many derived statistics. ## Dataset Structure - **Classes**: Benign, BruteForce, Bot, DoS, DDoS, Infiltration, Web Attacks, etc. - **Feature List**: Includes fields such as IPV4_SRC_ADDR, IPV4_DST_ADDR, L4_SRC_PORT, PROTOCOL, IN_BYTES, OUT_BYTES, FLOW_DURATION_MILLISECONDS, TCP_FLAGS, and many others. - **Splits**: Train (≈17.9 M samples), Test (≈0.94 M samples). The dataset is publicly available for academic research; commercial use requires author permission.
The project uses the CSIC 2010 Dataset, a comprehensive collection of HTTP request logs that includes both normal and malicious traffic. It is designed for network intrusion detection research and contains various attack types such as SQL injection, buffer overflow, and directory traversal.
The Marine Regions dataset provides global oceanic and coastal boundaries, including marine areas, marine protected areas, and marine management zones. It offers geographic data in formats such as Shapefile, KML, and GeoJSON, suitable for GIS and marine science research.
This foundational dataset is a collection of question‑answer pairs focused on the cybersecurity domain, primarily concerning threat hunting, threat intelligence, and malware content. The answers in the foundational dataset are concise, roughly 10% the length of those in the main dataset. The Q‑A pairs are generated from 2023–2024 data and selected semi‑randomly. The (unreleased) main dataset is expected to contain about 75,000–80,000 Q‑A pairs on its launch day, covering data from 2020 to present, with approximately 500 new pairs added weekly, and its answers are more detailed than those in the foundational dataset.
Collected 500 phishing sites from PhishTank and 500 legitimate sites from Alexa. The dataset is split with 70% for training and 30% for testing.
This dataset is intended for training and testing malicious URL detectors. It contains multiple URLs together with detailed attributes such as domain name, registrar, registrar address, organization, Alexa traffic rank, etc.
The NSL‑KDD dataset is a benchmark for network intrusion detection, containing multiple attack types and normal traffic. It provides files in various formats, including ARFF and CSV, for training and testing.
NSL‑KDD is an improved dataset designed to address several inherent problems of the KDD99 dataset. It removes redundant records from the training set, eliminates duplicate records from the test set, and provides a moderate number of records, enabling consistent and comparable evaluation across different research works.
The Acti dataset, created by Beihang University, focuses on mining cybersecurity threat intelligence entities and their relations for autonomous driving vehicles. It contains 908 real automotive cybersecurity reports, comprising 3,678 sentences, 8,195 security entities, and 4,852 semantic relations. Data were collected from the National Vulnerability Database and specific automotive threat intelligence platforms, and annotated using a BIOES joint labeling scheme. The dataset is primarily used for modeling automotive cybersecurity threat intelligence, aiming to extract valuable information from large volumes of cybersecurity data for proactive defense.
The CICIDS2017 dataset is used for cybersecurity tasks and contains several days of network traffic data for malicious traffic detection. The data have been read, cleaned, merged, and a random‑forest model has been applied for classification.
CTFAIA is a benchmark dataset designed to evaluate next‑generation large language models on cybersecurity tasks, especially CTF competition problems. It contains over 100 non‑trivial challenges categorized into three difficulty levels based on required tool usage and logical reasoning. Each challenge has a public development split and a private test split.
The Cyber Security Attack Analysis project provides a dataset containing 25 different indicators and 40,000 records, aimed at helping cybersecurity professionals, researchers, and analysts understand trends and patterns in the cybersecurity domain.
The CSE‑CIC‑IDS2018 dataset contains network traffic data captured for brute‑force attack scenarios, specifically FTP and SSH attacks. It comprises 1,048,575 records with 80 traffic‑related features, including both benign traffic and brute‑force attack data.
The NF‑UNSW‑NB15‑v2 dataset is an extension of the UNSW‑NB15 dataset in NetFlow format, adding extra NetFlow features and labeling corresponding attack categories. It contains 2,390,275 flows, of which 95,053 (3.98%) are attack samples and 2,295,222 (96.02%) are benign. Attack samples are divided into nine sub‑categories: Fuzzers, Analysis, Backdoor, DoS, Exploits, Generic, Reconnaissance, Shellcode, and Worms. The dataset is primarily used for network‑traffic intrusion detection system research.
The dataset provides customized data for SQL injection jailbreak experiments, including harmful behavior samples, affirmative prefix data, and context‑learning prefix data. These datasets are used to evaluate and improve large language models' defenses against malicious attacks.
This dataset is primarily for token‑classification tasks and includes three features: id (string), tokens (list of strings), and ner_tags (list of named‑entity labels). The ner_tags cover 11 categories to label different entity types such as indicators, malware, organizations, systems, and vulnerabilities. The dataset is split into training, testing, and validation subsets, each with different numbers of samples and byte sizes. The download size is 385,026 bytes and the total size is 1,873,973 bytes. It uses the default configuration with file paths for each split. The license is Apache 2.0.