HDFS, BGL, OpenStack, Hadoop, Thunderbird, ADFA, AWSCTD
The repository contains scripts for analyzing publicly available log datasets commonly used in anomaly detection (HDFS, BGL, OpenStack, Hadoop, Thunderbird, ADFA, AWSCTD). These datasets are used to evaluate sequence‑based anomaly detection techniques.
Dataset description and usage context
Dataset Overview
Dataset Name
anomaly-detection-log-datasets
Dataset Content
The dataset includes publicly available log datasets (HDFS, BGL, OpenStack, Hadoop, Thunderbird, ADFA, AWSCTD) used for evaluating sequence‑based anomaly detection techniques. It provides scripts for parsing these logs and grouping them into event‑type sequences, along with basic anomaly detection method implementations.
Dataset Structure
The dataset directory contains some pre‑processed samples, with filenames <dataset>_train (approximately 1 % normal log sequences for training), <dataset>_test_normal (the remaining normal sequences for testing), and <dataset>_test_abnormal (all anomalous sequences).
Dataset Processing
Processing includes parsing and sampling. Parsing uses the specific <dataset>_parse.py script, while sampling is performed with sample.py, allowing specification of sampling ratios and other parameters such as time windows.
Anomaly Detection Technique Evaluation
The dataset supports evaluation of various anomaly detection methods, including those based on new event types, sequence length, event count vectors, n‑grams, edit distance, and inter‑event arrival times. Results show that count‑vector based detection achieved the highest F1 score of 95.76 % on HDFS logs.
Citation Information
If you use this dataset, please cite the following publication:
- Landauer, M., Skopik, F., & Wurzenberger, M. (2023): A Critical Review of Common Log Data Sets Used for Evaluation of Sequence‑based Anomaly Detection Techniques. arxiv:2309.02854. [PDF]
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.