Dataset assetOpen Source CommunityFinanceLanguage Model Training

FinLang/investopedia-embedding-dataset

This dataset consists of financial data collected from the Investopedia website and is transformed from unstructured to structured format using a novel technique, making it suitable for fine‑tuning embedding models. The generation process employs a self‑verification method to ensure that the generated question‑answer pairs are not hallucinated by LLMs. Each data point contains four fields: Topic, Title, Question, and Answer. The dataset is in English and released under the CC‑BY‑NC‑4.0 license.

Source

hugging_face

Created

Nov 28, 2025

Updated

May 6, 2024

Signals

178 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Card - investopedia-embedding Dataset

Dataset Description

Overview

investopedia-embedding is a large‑scale dataset collected from Investopedia covering the financial domain. It utilizes a novel technique to convert unstructured scraped data and large language model (LLM) outputs into structured data suitable for fine‑tuning embedding models. The dataset adopts a new self‑verification method to ensure that the generated QA pairs have a high probability of not being hallucinated by LLMs.

Data Point Structure

Each data point includes the following fields:

Topic: General category around which the QA pair is generated.
Title: More detailed description or title of the paragraph used to generate the QA pair.
Question: Sentence 1 in the embedding model training dataset, also called the anchor.
Answer: Sentence 3 in the embedding model training dataset, also called the positive sample.

Example

json { "Topic": "mortgage", "Title": "", "Question": "与个人贷款等无担保选项相比，使用房屋净值贷款进行家庭改造有哪些优势？", "Answer": "该段落强调了两个主要优势：房屋净值贷款通常提供比个人贷款等无担保选项更低的利率，这有助于节省家庭改造成本。此外，它们具有固定利率，为每月还款提供稳定性，并在整个还款期限内防止利率变化。" }

Dataset Information

Creation Team: FinLang Team
Language: English
License: cc‑by‑nc‑4.0

Dataset Structure

The dataset is split into 90 % training and 10 % test.

Dataset Creation

Motivation

In finance, language models face three major limitations:

No large publicly available dataset (million‑scale tokens) suitable for language and embedding model fine‑tuning, as internal data are protected by companies such as Bloomberg for monetary and privacy reasons.
Current language models perform poorly on complex financial abbreviations, again pointing to insufficient training data.
Although plentiful financial data exist online (Investopedia, Yahoo Finance, etc.), extracting it in a format suitable for instruction tuning or embedding training is difficult because annotating unstructured datasets incurs huge costs requiring highly paid financial experts.

Source Data

Source data collected from Investopedia.

License

Since the data used to generate the dataset are non‑commercial, we release this dataset under the cc‑by‑nc‑4.0 license.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio