DATASET
Open Source Community
FinPile
FinPile is a secure, high‑quality, open‑source Chinese financial corpus for generating and auditing financial data.
Updated 9/20/2024
github
Description
FinPile
Dataset Overview
FinPile is a secure, high‑quality, open‑source Chinese financial corpus.
Environment Requirements
- Recommended Python version: 3.11.4
- Dependency installation:
pip install -r requirements.txt
Data Pre‑processing Steps
1. Removal of Personal Information
- Function: Remove IP addresses, email addresses, phone numbers, and other personal data.
- Example usage:
python 1_pii.py \
--input_path input.jsonl \
--output_path output.jsonl \
--text_column text \
--num_proc 4 \
--batch_size 100
2. Sensitive Word Filtering
- Function: Filter texts containing specific sensitive keywords.
- Sensitive word file location:
2_toxic_filter/sensitive_words - Example usage:
python 2_toxic_filter/2_toxic_filter.py \
--input_path input.jsonl \
--output_path output.jsonl \
--text_column text
3. Rule‑Based Filtering
- Function: Apply multiple rules for data filtering.
- Language filter: Retain texts in specific languages (e.g., Chinese or English).
- Punctuation and whitespace normalization: Standardize punctuation and whitespace characters.
- Consecutive punctuation deduplication: Replace consecutive punctuation marks with a single one.
- Punctuation‑ratio filter: Remove texts with an excessively high punctuation ratio.
- Length filter: Remove overly short texts.
- Example usage:
python 3_rule_filter.py \
--input_path input.jsonl \
--output_path output.jsonl \
--text_column text \
--language zh-cn \
--punctuation_ratio_threshold 0.5 \
--text_length_threshold 128
4. Perplexity Filtering
- Function: Filter data based on a perplexity model.
- Model download: link
- Example usage:
python 4_perplexity_filter/kenlm/run.py \
--input_path input.jsonl \
--output_path output.jsonl \
--text_column text \
--language zh
5. Exact Deduplication
- Function: Remove completely identical text entries.
- Example usage:
python 5_text_dedup/5_clean.py \
--input_path input.jsonl \
--output_path output.jsonl \
--text_column text \
--cache cache_dir \
--num_proc 2 \
--batch_size 100
6. Fuzzy Deduplication
- Function: Remove near‑duplicate text entries.
- Example usage:
python 6_text_dedup/text_dedup/minhash.py \
--input_path input.jsonl \
--output_path output.jsonl \
--column text \
--cache_dir cache_dir \
--threshold 0.8 \
--false_positive_weight 0.5 \
--false_negative_weight 0.5
Data Evaluation
- Evaluation dimensions:
- Language Quality: Grammar, spelling, vocabulary usage, and fluency.
- Information Content: Amount of knowledge and concepts.
- Novelty: Presence of new vocabulary, information, or viewpoints.
- Coherence: Clear theme, logical argumentation, and rigorous reasoning.
- Purity: Absence of irrelevant content (e.g., ads, spam).
- Example usage:
python 7_DataAnalysis/eval_pipeline.py \
--data_path input.jsonl \
--eval_path output.jsonl \
--text_column text \
--tiktoken_cache cache_dir \
--figure_dir figure_dir \
--model gpt-3.5-turbo-1106 \
--api_key xxxx \
--organization xxxx \
--num_proc 1
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Finance
Corpus
Source
Organization: github
Created: 9/12/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.