orieg/elsevier-oa-cc-by
The Elsevier OA CC‑By dataset is a corpus of 40,091 open‑access articles released under a CC‑BY license, covering multiple disciplines from Elsevier journals. The articles were published between 2014 and 2020 and are classified into 27 mid‑level ASJC codes. The dataset supports various NLP tasks such as fill‑mask, summarization, and text classification. It includes fields such as document ID, metadata, abstract, body text, bibliography entries, and author highlights. The corpus is intended to facilitate NLP and ML research by providing a large, multidisciplinary collection of full‑text articles.
Description
Dataset Overview
Dataset Name
Elsevier OA CC‑By
Dataset Summary
Elsevier OA CC‑By is a corpus containing 40,091 open‑access CC‑BY articles published between 2014 and 2020, spanning 27 mid‑level ASJC subject codes. The dataset supports NLP and ML research, providing full‑text articles.
Language
English (en)
License
CC BY 4.0
Dataset Structure
Data Instances
Each instance includes the following fields:
- docId: Unique document identifier.
- metadata: Title, author list, ISSN, volume, page range, publication year, DOI, PMID, open‑access status, subject area, keywords, and ASJC code.
- abstract: Author‑provided abstract.
- body_text: Full text segmented into sentences.
- bib_entries: Complete list of references with metadata.
- author_highlights: Highlights provided by authors (covers 61.31% of articles).
Data Fields
- title: 100% coverage.
- abstract: 99.25% coverage.
- keywords: 100% coverage.
- asjc: 100% coverage.
- subjareas: 100% coverage.
- body_text: 100% coverage.
- author_highlights: 61.31% coverage.
Data Splits
- train: 32,072 articles.
- test: 4,009 articles.
- validation: 4,008 articles.
Supported Tasks
- fill-mask
- summarization
- text-classification
Dataset Creation
Source Data
- Initial collection and normalization: Data gathered on 25 June 2020.
- Source language producers: See the original paper for details.
Annotation
- Process: Details to be added.
- Annotators: Details to be added.
Considerations for Use
- Social impact: Details to be added.
- Bias discussion: Details to be added.
- Other known limitations: Details to be added.
Additional Information
- Dataset curator: Details to be added.
- License information: CC BY 4.0.
- Citation: Refer to the provided citation format.
- Contributors: Thanks to @orieg for adding this dataset.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.