Back to datasets
Dataset assetOpen Source CommunityCompression AlgorithmsPerformance Testing

Silesia Compression Corpus

The Silesia dataset is a collection of files with varying characteristics used to test compression algorithms.

Source
github
Created
Sep 2, 2018
Updated
May 18, 2024
Signals
334 views
Availability
Linked source ready
Overview

Dataset description and usage context

Silesia Compression Corpus Overview

Dataset Description

The Silesia Corpus is a dataset for testing compression algorithms, containing files with diverse characteristics.

File Details

Size (bytes)FilenameDescription
10,192,446dickensEnglish novel, ASCII plain text
51,220,480mozillaProgram, UNIX executables and others, tar archive
9,970,564mr3‑D MRI image, DICOM format
33,553,445nciChemical database, text
6,152,192oofficeWindows DLL
10,085,684osdbDatabase, synthetic data, binary
6,627,202reymontPolish text, uncompressed PDF
21,606,400sambaSource code and graphics, tar archive
7,251,944saoDatabase, star catalog, binary
41,458,703websterEnglish dictionary, HTML format
8,474,240x-ray16‑bit grayscale image, DICOM format
5,345,280xmlXML file, text, tar archive
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio