Dataset assetOpen Source CommunityMachine LearningNatural Language Processing

lilacai/lilac-wikitext-2-raw-v1

This dataset was generated by Lilac for a HuggingFace Space. The original source dataset is wikitext. The configuration includes the namespace, name, source dataset name, configuration name, as well as the signal‑processing path and embedding method. Signal processing covers various signals such as near‑duplicate detection, PII detection, language detection, text statistics, sentiment analysis, code detection, and toxicity detection.

Source

hugging_face

Created

Nov 28, 2025

Updated

Dec 7, 2023

Signals

266 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Namespace and Name

Namespace: lilac
Name: wikitext-2-raw-v1

Data Source

Dataset Name: wikitext
Configuration Name: wikitext-2-raw-v1
Source Name: huggingface

Embedding and Signals

Embedding Path: text
Embedding Type: gte-small

Signal List

near_dup: Near‑duplicate detection
pii: Personally Identifiable Information detection
lang_detection: Language detection
text_statistics: Text statistics
concept_score: Concept scoring, including the following concepts:
- legal-termination: Legal termination
- negative-sentiment: Negative sentiment
- non-english: Non‑English
- positive-sentiment: Positive sentiment
- profanity: Profanity
- question: Question
- source-code: Source code
- toxicity: Toxicity
cluster_dbscan: DBSCAN clustering
cluster_hdbscan: HDBSCAN clustering

Settings

UI Media Path: text
Tag: machine-learning

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio