takala/financial_phrasebank
The FinancialPhrasebank is a dataset of financial news sentences for sentiment classification. It contains 4,840 English sentences, each classified according to the agreement rate of 5–8 annotators. The dataset is provided in four configurations based on annotator agreement levels (50%, 66%, 75%, and 100%). The purpose of creating the dataset is to address the lack of high‑quality training data for financial sentiment analysis. The dataset was annotated by 16 individuals with background knowledge of financial markets, including researchers and master's students. Use of the dataset is governed by the Creative Commons Attribution‑NonCommercial‑ShareAlike 3.0 Unported License.
Description
Dataset Overview
- Name: FinancialPhrasebank
- Language: English
- License: Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License
- Multilinguality: Monolingual
- Size: 1K<n<10K
- Source Dataset: Original Data
- Task Category: Text Classification
- Task ID: Multi‑class Classification, Sentiment Classification
- Label Creator: Expert Generated
- Language Creator: Discovery
Dataset Structure
Data Instance
json { "sentence": "Pharmaceuticals group Orion Corp reported a fall in its third-quarter earnings that were hit by larger expenditures on R&D and marketing .", "label": "negative" }
Data Fields
- sentence: The tokenized sentence in the dataset, data type is string.
- label: The label corresponding to the class, data type is categorical label, categories include negative, neutral, positive.
Data Split
- sentences_allagree: All annotators (100% agreement), 2,264 instances.
- sentences_75agree: >=75% annotator agreement, 3,453 instances.
- sentences_66agree: >=66% annotator agreement, 4,217 instances.
- sentences_50agree: >=50% annotator agreement, 4,846 instances.
Dataset Creation
Source Data
- Initial Data Collection and Normalization: English financial news downloaded from the LexisNexis database; 10,000 articles were randomly selected, resulting in approximately 5,000 sentences after filtering.
- Source Language Producers: Multiple financial journalists.
Annotation
- Annotation Process: 4,840 sentences were annotated by 16 individuals with financial background knowledge.
- Annotators: Three researchers and thirteen master's students from Aalto University's School of Business, primarily specializing in finance, accounting, and economics.
Dataset Usage Considerations
- Bias Discussion: All annotators are from the same institution, so annotator consistency should consider this factor.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.