Back to datasets
Dataset assetOpen Source CommunityText MiningAcademic Research

orieg/elsevier-oa-cc-by

The Elsevier OA CC‑By dataset is a corpus of 40,091 open‑access articles released under a CC‑BY license, covering multiple disciplines from Elsevier journals. The articles were published between 2014 and 2020 and are classified into 27 mid‑level ASJC codes. The dataset supports various NLP tasks such as fill‑mask, summarization, and text classification. It includes fields such as document ID, metadata, abstract, body text, bibliography entries, and author highlights. The corpus is intended to facilitate NLP and ML research by providing a large, multidisciplinary collection of full‑text articles.

Source
hugging_face
Created
Nov 28, 2025
Updated
Jul 1, 2022
Signals
121 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Name

Elsevier OA CC‑By

Dataset Summary

Elsevier OA CC‑By is a corpus containing 40,091 open‑access CC‑BY articles published between 2014 and 2020, spanning 27 mid‑level ASJC subject codes. The dataset supports NLP and ML research, providing full‑text articles.

Language

English (en)

License

CC BY 4.0

Dataset Structure

Data Instances

Each instance includes the following fields:

  • docId: Unique document identifier.
  • metadata: Title, author list, ISSN, volume, page range, publication year, DOI, PMID, open‑access status, subject area, keywords, and ASJC code.
  • abstract: Author‑provided abstract.
  • body_text: Full text segmented into sentences.
  • bib_entries: Complete list of references with metadata.
  • author_highlights: Highlights provided by authors (covers 61.31% of articles).

Data Fields

  • title: 100% coverage.
  • abstract: 99.25% coverage.
  • keywords: 100% coverage.
  • asjc: 100% coverage.
  • subjareas: 100% coverage.
  • body_text: 100% coverage.
  • author_highlights: 61.31% coverage.

Data Splits

  • train: 32,072 articles.
  • test: 4,009 articles.
  • validation: 4,008 articles.

Supported Tasks

  • fill-mask
  • summarization
  • text-classification

Dataset Creation

Source Data

  • Initial collection and normalization: Data gathered on 25 June 2020.
  • Source language producers: See the original paper for details.

Annotation

  • Process: Details to be added.
  • Annotators: Details to be added.

Considerations for Use

  • Social impact: Details to be added.
  • Bias discussion: Details to be added.
  • Other known limitations: Details to be added.

Additional Information

  • Dataset curator: Details to be added.
  • License information: CC BY 4.0.
  • Citation: Refer to the provided citation format.
  • Contributors: Thanks to @orieg for adding this dataset.
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio