JUHE API Marketplace
High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

FinMR

Finance
Question Answering Reasoning

FinQA is a dataset specifically designed for financial reasoning and question answering. It comprises questions, financial background information, and corresponding answers. The dataset combines text and visual data, with visual data presented as images stored in JSON files. Its structure includes a unique identifier, shared background, shared image path, question text, multiple‑choice options, correct answer, and detailed explanation. Annotations are performed by financial experts to ensure high accuracy and consistency. The dataset may contain inherent biases from source financial documents; users should exercise caution when generalizing model outputs and consider domain‑specific adaptation.

huggingface
View Details

DGraph

Finance
Dynamic Graph Data

DGraph is a collection of large‑scale dynamic graph datasets composed of events and labels that evolve over time in real financial scenarios.

github
View Details

FinancialDatasets

Finance
NLP

The SmoothNLP Financial Text Dataset comprises multiple sub‑datasets covering corporate business information, financial news, column articles, investment institution data, investment events, and 36Kr news, suitable for NLP research.

github
View Details

FinLang/investopedia-embedding-dataset

Finance
Language Model Training

This dataset consists of financial data collected from the Investopedia website and is transformed from unstructured to structured format using a novel technique, making it suitable for fine‑tuning embedding models. The generation process employs a self‑verification method to ensure that the generated question‑answer pairs are not hallucinated by LLMs. Each data point contains four fields: Topic, Title, Question, and Answer. The dataset is in English and released under the CC‑BY‑NC‑4.0 license.

hugging_face
View Details

FinPile

Finance
Corpus

FinPile is a secure, high‑quality, open‑source Chinese financial corpus for generating and auditing financial data.

github
View Details

takala/financial_phrasebank

Finance
Sentiment Analysis

The FinancialPhrasebank is a dataset of financial news sentences for sentiment classification. It contains 4,840 English sentences, each classified according to the agreement rate of 5–8 annotators. The dataset is provided in four configurations based on annotator agreement levels (50%, 66%, 75%, and 100%). The purpose of creating the dataset is to address the lack of high‑quality training data for financial sentiment analysis. The dataset was annotated by 16 individuals with background knowledge of financial markets, including researchers and master's students. Use of the dataset is governed by the Creative Commons Attribution‑NonCommercial‑ShareAlike 3.0 Unported License.

hugging_face
View Details