HDFS, BGL, OpenStack, Hadoop, Thunderbird, ADFA, AWSCTD
The repository contains scripts for analyzing publicly available log datasets commonly used in anomaly detection (HDFS, BGL, OpenStack, Hadoop, Thunderbird, ADFA, AWSCTD). These datasets are used to evaluate sequence‑based anomaly detection techniques.
Description
Dataset Overview
Dataset Name
anomaly-detection-log-datasets
Dataset Content
The dataset includes publicly available log datasets (HDFS, BGL, OpenStack, Hadoop, Thunderbird, ADFA, AWSCTD) used for evaluating sequence‑based anomaly detection techniques. It provides scripts for parsing these logs and grouping them into event‑type sequences, along with basic anomaly detection method implementations.
Dataset Structure
The dataset directory contains some pre‑processed samples, with filenames <dataset>_train (approximately 1 % normal log sequences for training), <dataset>_test_normal (the remaining normal sequences for testing), and <dataset>_test_abnormal (all anomalous sequences).
Dataset Processing
Processing includes parsing and sampling. Parsing uses the specific <dataset>_parse.py script, while sampling is performed with sample.py, allowing specification of sampling ratios and other parameters such as time windows.
Anomaly Detection Technique Evaluation
The dataset supports evaluation of various anomaly detection methods, including those based on new event types, sequence length, event count vectors, n‑grams, edit distance, and inter‑event arrival times. Results show that count‑vector based detection achieved the highest F1 score of 95.76 % on HDFS logs.
Citation Information
If you use this dataset, please cite the following publication:
- Landauer, M., Skopik, F., & Wurzenberger, M. (2023): A Critical Review of Common Log Data Sets Used for Evaluation of Sequence‑based Anomaly Detection Techniques. arxiv:2309.02854. [PDF]
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: github
Created: 4/28/2023
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.