Dataset assetOpen Source CommunityCompression AlgorithmsPerformance Testing

Silesia Compression Corpus

The Silesia dataset is a collection of files with varying characteristics used to test compression algorithms.

Source

github

Created

Sep 2, 2018

Updated

May 18, 2024

Signals

334 views

Availability

Linked source ready

Overview

Dataset description and usage context

Silesia Compression Corpus Overview

Dataset Description

The Silesia Corpus is a dataset for testing compression algorithms, containing files with diverse characteristics.

File Details

Size (bytes)	Filename	Description
10,192,446	dickens	English novel, ASCII plain text
51,220,480	mozilla	Program, UNIX executables and others, tar archive
9,970,564	mr	3‑D MRI image, DICOM format
33,553,445	nci	Chemical database, text
6,152,192	ooffice	Windows DLL
10,085,684	osdb	Database, synthetic data, binary
6,627,202	reymont	Polish text, uncompressed PDF
21,606,400	samba	Source code and graphics, tar archive
7,251,944	sao	Database, star catalog, binary
41,458,703	webster	English dictionary, HTML format
8,474,240	x-ray	16‑bit grayscale image, DICOM format
5,345,280	xml	XML file, text, tar archive

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio