Back to datasets
Dataset assetOpen Source CommunityBlockchain SecuritySmart Contracts

Zellic/smart-contract-fiesta

The Zellic 2023 Smart Contract Source Index dataset is a publicly available collection of Ethereum main‑net smart contract source code, intended to provide an easily downloadable resource that advances smart contract security research. It includes address and bytecode hash indices for all deployed contracts up to block 16860349, along with source code gathered from public resources. The dataset de‑duplicates source code by bytecode hash and supplies organized contract directories and metadata.

Source
hugging_face
Created
Nov 28, 2025
Updated
Apr 23, 2023
Signals
209 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Name

  • Name: Zellic 2023 Smart Contract Source Index
  • Alias: Zellic Smart Contract Source Index

Dataset Description

  • Purpose: Provide a publicly downloadable Ethereum mainnet smart contract source code dataset to advance smart contract security research.
  • Applications: Includes static analysis, machine learning, etc.

Dataset Content

  • Methodology:
    • Collect all contract addresses deployed on the Ethereum mainnet and their EVM bytecode Keccak256 hashes.
    • Build the index by fully syncing from the genesis block using a modified Geth instance.
    • De‑duplicate source code based on bytecode hash.
  • Statistics:
    • Unique Source Codes: 149,386
    • Contracts with Code: 3,897,319
    • Total Smart Contracts in Global Index: 30,586,657
    • Character Count: 6,473,548,073
    • Word Count: 712,444,206
    • Lines of Code: 90,562,628
    • Comment Lines: 62,503,873
    • Blank Lines: 24,485,549
    • Total Lines: 177,552,050
    • Unique Words: 939,288

Dataset Structure

  • Index:
    • Filename: address_bytecodehash_index
    • Content: Mapping of all deployed contract addresses to the Keccak256 hash of their EVM bytecode.
  • Contract Source Code:
    • Storage Location: Under the organized_contracts directory, organized by bytecode hash.
    • Contents: Source files and metadata.json (includes compiler version, optimization settings, etc.).
    • Source Formats: Single‑file, multi‑file, Solidity compiler JSON input.

Additional Information

  • Contract Languages: Not limited to Solidity; includes Vyper and other languages.
  • Source Extraction: A Bash script is provided to extract all source code.
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.