Back to datasets
Dataset assetOpen Source CommunityAnomaly DetectionInformation Technology

HDFS, BGL, OpenStack, Hadoop, Thunderbird, ADFA, AWSCTD

The repository contains scripts for analyzing publicly available log datasets commonly used in anomaly detection (HDFS, BGL, OpenStack, Hadoop, Thunderbird, ADFA, AWSCTD). These datasets are used to evaluate sequence‑based anomaly detection techniques.

Source
github
Created
Apr 28, 2023
Updated
Apr 25, 2024
Signals
394 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Name

anomaly-detection-log-datasets

Dataset Content

The dataset includes publicly available log datasets (HDFS, BGL, OpenStack, Hadoop, Thunderbird, ADFA, AWSCTD) used for evaluating sequence‑based anomaly detection techniques. It provides scripts for parsing these logs and grouping them into event‑type sequences, along with basic anomaly detection method implementations.

Dataset Structure

The dataset directory contains some pre‑processed samples, with filenames <dataset>_train (approximately 1 % normal log sequences for training), <dataset>_test_normal (the remaining normal sequences for testing), and <dataset>_test_abnormal (all anomalous sequences).

Dataset Processing

Processing includes parsing and sampling. Parsing uses the specific <dataset>_parse.py script, while sampling is performed with sample.py, allowing specification of sampling ratios and other parameters such as time windows.

Anomaly Detection Technique Evaluation

The dataset supports evaluation of various anomaly detection methods, including those based on new event types, sequence length, event count vectors, n‑grams, edit distance, and inter‑event arrival times. Results show that count‑vector based detection achieved the highest F1 score of 95.76 % on HDFS logs.

Citation Information

If you use this dataset, please cite the following publication:

  • Landauer, M., Skopik, F., & Wurzenberger, M. (2023): A Critical Review of Common Log Data Sets Used for Evaluation of Sequence‑based Anomaly Detection Techniques. arxiv:2309.02854. [PDF]
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio