lilacai/lilac-wikitext-2-raw-v1
This dataset was generated by Lilac for a HuggingFace Space. The original source dataset is wikitext. The configuration includes the namespace, name, source dataset name, configuration name, as well as the signal‑processing path and embedding method. Signal processing covers various signals such as near‑duplicate detection, PII detection, language detection, text statistics, sentiment analysis, code detection, and toxicity detection.
Description
Dataset Overview
Namespace and Name
- Namespace: lilac
- Name: wikitext-2-raw-v1
Data Source
- Dataset Name: wikitext
- Configuration Name: wikitext-2-raw-v1
- Source Name: huggingface
Embedding and Signals
- Embedding Path: text
- Embedding Type: gte-small
Signal List
- near_dup: Near‑duplicate detection
- pii: Personally Identifiable Information detection
- lang_detection: Language detection
- text_statistics: Text statistics
- concept_score: Concept scoring, including the following concepts:
- legal-termination: Legal termination
- negative-sentiment: Negative sentiment
- non-english: Non‑English
- positive-sentiment: Positive sentiment
- profanity: Profanity
- question: Question
- source-code: Source code
- toxicity: Toxicity
- cluster_dbscan: DBSCAN clustering
- cluster_hdbscan: HDBSCAN clustering
Settings
- UI Media Path: text
- Tag: machine-learning
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.