DATASET
Open Source Community
Silesia Compression Corpus
The Silesia dataset is a collection of files with varying characteristics used to test compression algorithms.
Updated 5/18/2024
github
Description
Silesia Compression Corpus Overview
Dataset Description
The Silesia Corpus is a dataset for testing compression algorithms, containing files with diverse characteristics.
File Details
| Size (bytes) | Filename | Description |
|---|---|---|
| 10,192,446 | dickens | English novel, ASCII plain text |
| 51,220,480 | mozilla | Program, UNIX executables and others, tar archive |
| 9,970,564 | mr | 3‑D MRI image, DICOM format |
| 33,553,445 | nci | Chemical database, text |
| 6,152,192 | ooffice | Windows DLL |
| 10,085,684 | osdb | Database, synthetic data, binary |
| 6,627,202 | reymont | Polish text, uncompressed PDF |
| 21,606,400 | samba | Source code and graphics, tar archive |
| 7,251,944 | sao | Database, star catalog, binary |
| 41,458,703 | webster | English dictionary, HTML format |
| 8,474,240 | x-ray | 16‑bit grayscale image, DICOM format |
| 5,345,280 | xml | XML file, text, tar archive |
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Compression Algorithms
Performance Testing
Source
Organization: github
Created: 9/2/2018
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.