Explore high-quality datasets for your AI and machine learning projects.
UNSW‑NB15 is a comprehensive network intrusion detection system dataset for academic research. The dataset was compared statistically with the KDD99 dataset and was released by Nour Moustafa and Jill Slay; publications must be cited when using it.
The Elsevier OA CC‑By dataset is a corpus of 40,091 open‑access articles released under a CC‑BY license, covering multiple disciplines from Elsevier journals. The articles were published between 2014 and 2020 and are classified into 27 mid‑level ASJC codes. The dataset supports various NLP tasks such as fill‑mask, summarization, and text classification. It includes fields such as document ID, metadata, abstract, body text, bibliography entries, and author highlights. The corpus is intended to facilitate NLP and ML research by providing a large, multidisciplinary collection of full‑text articles.