orieg/elsevier-oa-cc-by

The Elsevier OA CC‑By dataset is a corpus of 40,091 open‑access articles released under a CC‑BY license, covering multiple disciplines from Elsevier journals. The articles were published between 2014 and 2020 and are classified into 27 mid‑level ASJC codes. The dataset supports various NLP tasks such as fill‑mask, summarization, and text classification. It includes fields such as document ID, metadata, abstract, body text, bibliography entries, and author highlights. The corpus is intended to facilitate NLP and ML research by providing a large, multidisciplinary collection of full‑text articles.

Updated 7/1/2022

hugging_face

Description

Dataset Overview

Dataset Name

Elsevier OA CC‑By

Dataset Summary

Elsevier OA CC‑By is a corpus containing 40,091 open‑access CC‑BY articles published between 2014 and 2020, spanning 27 mid‑level ASJC subject codes. The dataset supports NLP and ML research, providing full‑text articles.

Language

English (en)

License

CC BY 4.0

Dataset Structure

Data Instances

Each instance includes the following fields:

docId: Unique document identifier.
metadata: Title, author list, ISSN, volume, page range, publication year, DOI, PMID, open‑access status, subject area, keywords, and ASJC code.
abstract: Author‑provided abstract.
body_text: Full text segmented into sentences.
bib_entries: Complete list of references with metadata.
author_highlights: Highlights provided by authors (covers 61.31% of articles).

Data Fields

title: 100% coverage.
abstract: 99.25% coverage.
keywords: 100% coverage.
asjc: 100% coverage.
subjareas: 100% coverage.
body_text: 100% coverage.
author_highlights: 61.31% coverage.

Data Splits

train: 32,072 articles.
test: 4,009 articles.
validation: 4,008 articles.

Supported Tasks

fill-mask
summarization
text-classification

Dataset Creation

Source Data

Initial collection and normalization: Data gathered on 25 June 2020.
Source language producers: See the original paper for details.

Annotation

Process: Details to be added.
Annotators: Details to be added.

Considerations for Use

Social impact: Details to be added.
Bias discussion: Details to be added.
Other known limitations: Details to be added.

Additional Information

Dataset curator: Details to be added.
License information: CC BY 4.0.
Citation: Refer to the provided citation format.
Contributors: Thanks to @orieg for adding this dataset.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Academic Research

Text Mining

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →