Back to datasets
Dataset assetOpen Source CommunityCultural HeritageClassical Literature

Chinese-Poetry-Dataset

The most comprehensive Chinese classical literature database, containing 55,000 Tang poems, 260,000 Song poems, and 21,000 Song lyrics. It covers roughly 14,000 poets from the Tang and Song dynasties and about 1,500 lyricists from the Song era. Sources are collected from the Internet.

Source
github
Created
Dec 18, 2017
Updated
Apr 16, 2024
Signals
761 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Name

chinese-poetry: Most Comprehensive Classical Chinese Poetry Database

Dataset Content

  • Tang Poetry: 55,000 poems
  • Song Poetry: 260,000 poems
  • Song Lyrics: 21,000 poems
  • Tang & Song Poets: Approximately 14,000
  • Song Lyricists: 1.5 K
  • Other Collections: Include Five Dynasties Huajian Collection, Southern Tang Two Masters' Lyrics, Analects, Book of Songs, Dream of the Red Chamber, the Four Books and Five Classics, etc.

Data Formats

  • Complete Tang Poetry: JSON
  • Complete Song Poetry: JSON
  • Complete Song Lyrics: CI format
  • Other Collections: Various formats

Applications

The dataset can be used for education, research, cultural heritage preservation, and other beneficial purposes.

Analysis

  • High‑Frequency Word Analysis: Provides statistics of frequent words in Tang poetry, Song poetry, and Song lyrics.
  • Author Works Ranking: Shows ranking of authors by number of works.
  • Ci‑Tune Statistics: Statistics of popular ci‑tunes during the Song period.

Contribution

Contributions are welcome via pull requests or issue discussions to improve and expand the database.

License

The dataset is released under the MIT License.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio