FinPile

Dataset Overview

FinPile is a secure, high‑quality, open‑source Chinese financial corpus.

Environment Requirements

Recommended Python version: 3.11.4
Dependency installation: pip install -r requirements.txt

Data Pre‑processing Steps

1. Removal of Personal Information

Function: Remove IP addresses, email addresses, phone numbers, and other personal data.
Example usage:

python 1_pii.py \
    --input_path input.jsonl \
    --output_path output.jsonl \
    --text_column text \
    --num_proc 4 \
    --batch_size 100

2. Sensitive Word Filtering

Function: Filter texts containing specific sensitive keywords.
Sensitive word file location: 2_toxic_filter/sensitive_words
Example usage:

python 2_toxic_filter/2_toxic_filter.py \
    --input_path input.jsonl \
    --output_path output.jsonl \
    --text_column text

3. Rule‑Based Filtering

Function: Apply multiple rules for data filtering.
- Language filter: Retain texts in specific languages (e.g., Chinese or English).
- Punctuation and whitespace normalization: Standardize punctuation and whitespace characters.
- Consecutive punctuation deduplication: Replace consecutive punctuation marks with a single one.
- Punctuation‑ratio filter: Remove texts with an excessively high punctuation ratio.
- Length filter: Remove overly short texts.
Example usage:

python 3_rule_filter.py \
    --input_path input.jsonl \
    --output_path output.jsonl \
    --text_column text \
    --language zh-cn \
    --punctuation_ratio_threshold 0.5 \
    --text_length_threshold 128

4. Perplexity Filtering

Function: Filter data based on a perplexity model.
Model download: link
Example usage:

python 4_perplexity_filter/kenlm/run.py \
    --input_path input.jsonl \
    --output_path output.jsonl \
    --text_column text \
    --language zh

5. Exact Deduplication

Function: Remove completely identical text entries.
Example usage:

python 5_text_dedup/5_clean.py \
    --input_path input.jsonl \
    --output_path output.jsonl \
    --text_column text \
    --cache cache_dir \
    --num_proc 2 \
    --batch_size 100

6. Fuzzy Deduplication

Function: Remove near‑duplicate text entries.
Example usage:

python 6_text_dedup/text_dedup/minhash.py \
    --input_path input.jsonl \
    --output_path output.jsonl \
    --column text \
    --cache_dir cache_dir \
    --threshold 0.8 \
    --false_positive_weight 0.5 \
    --false_negative_weight 0.5

Data Evaluation

Evaluation dimensions:
- Language Quality: Grammar, spelling, vocabulary usage, and fluency.
- Information Content: Amount of knowledge and concepts.
- Novelty: Presence of new vocabulary, information, or viewpoints.
- Coherence: Clear theme, logical argumentation, and rigorous reasoning.
- Purity: Absence of irrelevant content (e.g., ads, spam).
Example usage:

python 7_DataAnalysis/eval_pipeline.py \
    --data_path input.jsonl \
    --eval_path output.jsonl \
    --text_column text \
    --tiktoken_cache cache_dir \
    --figure_dir figure_dir \
    --model gpt-3.5-turbo-1106 \
    --api_key xxxx \
    --organization xxxx \
    --num_proc 1

FinPile

Description

FinPile

Dataset Overview

Environment Requirements

Data Pre‑processing Steps

1. Removal of Personal Information

2. Sensitive Word Filtering

3. Rule‑Based Filtering

4. Perplexity Filtering

5. Exact Deduplication

6. Fuzzy Deduplication

Data Evaluation

AI studio

Access Dataset

Topics

Source