Back to datasets
Dataset assetOpen Source CommunityMachine LearningNatural Language Processing

lilacai/lilac-wikitext-2-raw-v1

This dataset was generated by Lilac for a HuggingFace Space. The original source dataset is wikitext. The configuration includes the namespace, name, source dataset name, configuration name, as well as the signal‑processing path and embedding method. Signal processing covers various signals such as near‑duplicate detection, PII detection, language detection, text statistics, sentiment analysis, code detection, and toxicity detection.

Source
hugging_face
Created
Nov 28, 2025
Updated
Dec 7, 2023
Signals
266 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Namespace and Name

  • Namespace: lilac
  • Name: wikitext-2-raw-v1

Data Source

  • Dataset Name: wikitext
  • Configuration Name: wikitext-2-raw-v1
  • Source Name: huggingface

Embedding and Signals

  • Embedding Path: text
  • Embedding Type: gte-small

Signal List

  • near_dup: Near‑duplicate detection
  • pii: Personally Identifiable Information detection
  • lang_detection: Language detection
  • text_statistics: Text statistics
  • concept_score: Concept scoring, including the following concepts:
    • legal-termination: Legal termination
    • negative-sentiment: Negative sentiment
    • non-english: Non‑English
    • positive-sentiment: Positive sentiment
    • profanity: Profanity
    • question: Question
    • source-code: Source code
    • toxicity: Toxicity
  • cluster_dbscan: DBSCAN clustering
  • cluster_hdbscan: HDBSCAN clustering

Settings

  • UI Media Path: text
  • Tag: machine-learning
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio