Explore high-quality datasets for your AI and machine learning projects.
MASSW is a comprehensive text dataset summarizing multiple aspects of scientific workflows. It contains over 152,000 peer‑reviewed publications from 17 leading computer‑science conferences spanning the past 50 years. The dataset defines five core aspects of a scientific workflow: context, key idea, method, outcome, and projected impact, and systematically extracts and structures these aspects from each publication using large language models (LLMs). MASSW is large‑scale and high‑accuracy, verified through extensive checks and comparisons with human annotations and alternative methods. It supports various novel and benchmarkable machine‑learning tasks such as idea generation and outcome prediction, providing a benchmark for evaluating LLM agents in scientific research.