High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

FinMR

FinQA is a dataset specifically designed for financial reasoning and question answering. It comprises questions, financial background information, and corresponding answers. The dataset combines text and visual data, with visual data presented as images stored in JSON files. Its structure includes a unique identifier, shared background, shared image path, question text, multiple‑choice options, correct answer, and detailed explanation. Annotations are performed by financial experts to ensure high accuracy and consistency. The dataset may contain inherent biases from source financial documents; users should exercise caution when generalizing model outputs and consider domain‑specific adaptation.

huggingface

View Details

DGraph

Finance

Dynamic Graph Data

DGraph is a collection of large‑scale dynamic graph datasets composed of events and labels that evolve over time in real financial scenarios.

github

View Details

FinancialDatasets

Finance

NLP

The SmoothNLP Financial Text Dataset comprises multiple sub‑datasets covering corporate business information, financial news, column articles, investment institution data, investment events, and 36Kr news, suitable for NLP research.

github

View Details

FinLang/investopedia-embedding-dataset

Finance

Language Model Training

This dataset consists of financial data collected from the Investopedia website and is transformed from unstructured to structured format using a novel technique, making it suitable for fine‑tuning embedding models. The generation process employs a self‑verification method to ensure that the generated question‑answer pairs are not hallucinated by LLMs. Each data point contains four fields: Topic, Title, Question, and Answer. The dataset is in English and released under the CC‑BY‑NC‑4.0 license.

hugging_face

View Details

FinPile

Finance

Corpus

FinPile is a secure, high‑quality, open‑source Chinese financial corpus for generating and auditing financial data.

github

View Details

takala/financial_phrasebank

Finance

Sentiment Analysis

The FinancialPhrasebank is a dataset of financial news sentences for sentiment classification. It contains 4,840 English sentences, each classified according to the agreement rate of 5–8 annotators. The dataset is provided in four configurations based on annotator agreement levels (50%, 66%, 75%, and 100%). The purpose of creating the dataset is to address the lack of high‑quality training data for financial sentiment analysis. The dataset was annotated by 16 individuals with background knowledge of financial markets, including researchers and master's students. Use of the dataset is governed by the Creative Commons Attribution‑NonCommercial‑ShareAlike 3.0 Unported License.

hugging_face

View Details