Back to datasets
Dataset assetOpen Source CommunitySentiment AnalysisFinance

takala/financial_phrasebank

The FinancialPhrasebank is a dataset of financial news sentences for sentiment classification. It contains 4,840 English sentences, each classified according to the agreement rate of 5–8 annotators. The dataset is provided in four configurations based on annotator agreement levels (50%, 66%, 75%, and 100%). The purpose of creating the dataset is to address the lack of high‑quality training data for financial sentiment analysis. The dataset was annotated by 16 individuals with background knowledge of financial markets, including researchers and master's students. Use of the dataset is governed by the Creative Commons Attribution‑NonCommercial‑ShareAlike 3.0 Unported License.

Source
hugging_face
Created
Nov 28, 2025
Updated
Jan 18, 2024
Signals
1,189 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

  • Name: FinancialPhrasebank
  • Language: English
  • License: Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License
  • Multilinguality: Monolingual
  • Size: 1K<n<10K
  • Source Dataset: Original Data
  • Task Category: Text Classification
  • Task ID: Multi‑class Classification, Sentiment Classification
  • Label Creator: Expert Generated
  • Language Creator: Discovery

Dataset Structure

Data Instance

json { "sentence": "Pharmaceuticals group Orion Corp reported a fall in its third-quarter earnings that were hit by larger expenditures on R&D and marketing .", "label": "negative" }

Data Fields

  • sentence: The tokenized sentence in the dataset, data type is string.
  • label: The label corresponding to the class, data type is categorical label, categories include negative, neutral, positive.

Data Split

  • sentences_allagree: All annotators (100% agreement), 2,264 instances.
  • sentences_75agree: >=75% annotator agreement, 3,453 instances.
  • sentences_66agree: >=66% annotator agreement, 4,217 instances.
  • sentences_50agree: >=50% annotator agreement, 4,846 instances.

Dataset Creation

Source Data

  • Initial Data Collection and Normalization: English financial news downloaded from the LexisNexis database; 10,000 articles were randomly selected, resulting in approximately 5,000 sentences after filtering.
  • Source Language Producers: Multiple financial journalists.

Annotation

  • Annotation Process: 4,840 sentences were annotated by 16 individuals with financial background knowledge.
  • Annotators: Three researchers and thirteen master's students from Aalto University's School of Business, primarily specializing in finance, accounting, and economics.

Dataset Usage Considerations

  • Bias Discussion: All annotators are from the same institution, so annotator consistency should consider this factor.
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio