Back to datasets
Dataset assetOpen Source CommunityCompression AlgorithmsPerformance Testing
Silesia Compression Corpus
The Silesia dataset is a collection of files with varying characteristics used to test compression algorithms.
Source
github
Created
Sep 2, 2018
Updated
May 18, 2024
Signals
334 views
Availability
Linked source ready
Overview
Dataset description and usage context
Silesia Compression Corpus Overview
Dataset Description
The Silesia Corpus is a dataset for testing compression algorithms, containing files with diverse characteristics.
File Details
| Size (bytes) | Filename | Description |
|---|---|---|
| 10,192,446 | dickens | English novel, ASCII plain text |
| 51,220,480 | mozilla | Program, UNIX executables and others, tar archive |
| 9,970,564 | mr | 3‑D MRI image, DICOM format |
| 33,553,445 | nci | Chemical database, text |
| 6,152,192 | ooffice | Windows DLL |
| 10,085,684 | osdb | Database, synthetic data, binary |
| 6,627,202 | reymont | Polish text, uncompressed PDF |
| 21,606,400 | samba | Source code and graphics, tar archive |
| 7,251,944 | sao | Database, star catalog, binary |
| 41,458,703 | webster | English dictionary, HTML format |
| 8,474,240 | x-ray | 16‑bit grayscale image, DICOM format |
| 5,345,280 | xml | XML file, text, tar archive |
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.