JUHE API Marketplace
DATASET
Open Source Community

Silesia Compression Corpus

The Silesia dataset is a collection of files with varying characteristics used to test compression algorithms.

Updated 5/18/2024
github

Description

Silesia Compression Corpus Overview

Dataset Description

The Silesia Corpus is a dataset for testing compression algorithms, containing files with diverse characteristics.

File Details

Size (bytes)FilenameDescription
10,192,446dickensEnglish novel, ASCII plain text
51,220,480mozillaProgram, UNIX executables and others, tar archive
9,970,564mr3‑D MRI image, DICOM format
33,553,445nciChemical database, text
6,152,192oofficeWindows DLL
10,085,684osdbDatabase, synthetic data, binary
6,627,202reymontPolish text, uncompressed PDF
21,606,400sambaSource code and graphics, tar archive
7,251,944saoDatabase, star catalog, binary
41,458,703websterEnglish dictionary, HTML format
8,474,240x-ray16‑bit grayscale image, DICOM format
5,345,280xmlXML file, text, tar archive

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Compression Algorithms
Performance Testing

Source

Organization: github

Created: 9/2/2018

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.