Back to datasets
Dataset assetOpen Source CommunityMachine LearningNatural Language Processing
lilacai/lilac-wikitext-2-raw-v1
This dataset was generated by Lilac for a HuggingFace Space. The original source dataset is wikitext. The configuration includes the namespace, name, source dataset name, configuration name, as well as the signal‑processing path and embedding method. Signal processing covers various signals such as near‑duplicate detection, PII detection, language detection, text statistics, sentiment analysis, code detection, and toxicity detection.
Source
hugging_face
Created
Nov 28, 2025
Updated
Dec 7, 2023
Signals
266 views
Availability
Linked source ready
Overview
Dataset description and usage context
Dataset Overview
Namespace and Name
- Namespace: lilac
- Name: wikitext-2-raw-v1
Data Source
- Dataset Name: wikitext
- Configuration Name: wikitext-2-raw-v1
- Source Name: huggingface
Embedding and Signals
- Embedding Path: text
- Embedding Type: gte-small
Signal List
- near_dup: Near‑duplicate detection
- pii: Personally Identifiable Information detection
- lang_detection: Language detection
- text_statistics: Text statistics
- concept_score: Concept scoring, including the following concepts:
- legal-termination: Legal termination
- negative-sentiment: Negative sentiment
- non-english: Non‑English
- positive-sentiment: Positive sentiment
- profanity: Profanity
- question: Question
- source-code: Source code
- toxicity: Toxicity
- cluster_dbscan: DBSCAN clustering
- cluster_hdbscan: HDBSCAN clustering
Settings
- UI Media Path: text
- Tag: machine-learning
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.