FinLang/investopedia-embedding-dataset
This dataset consists of financial data collected from the Investopedia website and is transformed from unstructured to structured format using a novel technique, making it suitable for fine‑tuning embedding models. The generation process employs a self‑verification method to ensure that the generated question‑answer pairs are not hallucinated by LLMs. Each data point contains four fields: Topic, Title, Question, and Answer. The dataset is in English and released under the CC‑BY‑NC‑4.0 license.
Dataset description and usage context
Dataset Card - investopedia-embedding Dataset
Dataset Description
Overview
investopedia-embedding is a large‑scale dataset collected from Investopedia covering the financial domain. It utilizes a novel technique to convert unstructured scraped data and large language model (LLM) outputs into structured data suitable for fine‑tuning embedding models. The dataset adopts a new self‑verification method to ensure that the generated QA pairs have a high probability of not being hallucinated by LLMs.
Data Point Structure
Each data point includes the following fields:
Topic: General category around which the QA pair is generated.Title: More detailed description or title of the paragraph used to generate the QA pair.Question: Sentence 1 in the embedding model training dataset, also called the anchor.Answer: Sentence 3 in the embedding model training dataset, also called the positive sample.
Example
json { "Topic": "mortgage", "Title": "", "Question": "与个人贷款等无担保选项相比,使用房屋净值贷款进行家庭改造有哪些优势?", "Answer": "该段落强调了两个主要优势:房屋净值贷款通常提供比个人贷款等无担保选项更低的利率,这有助于节省家庭改造成本。此外,它们具有固定利率,为每月还款提供稳定性,并在整个还款期限内防止利率变化。" }
Dataset Information
- Creation Team: FinLang Team
- Language: English
- License: cc‑by‑nc‑4.0
Dataset Structure
The dataset is split into 90 % training and 10 % test.
Dataset Creation
Motivation
In finance, language models face three major limitations:
- No large publicly available dataset (million‑scale tokens) suitable for language and embedding model fine‑tuning, as internal data are protected by companies such as Bloomberg for monetary and privacy reasons.
- Current language models perform poorly on complex financial abbreviations, again pointing to insufficient training data.
- Although plentiful financial data exist online (Investopedia, Yahoo Finance, etc.), extracting it in a format suitable for instruction tuning or embedding training is difficult because annotating unstructured datasets incurs huge costs requiring highly paid financial experts.
Source Data
Source data collected from Investopedia.
License
Since the data used to generate the dataset are non‑commercial, we release this dataset under the cc‑by‑nc‑4.0 license.
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.