JUHE API Marketplace
DATASET
Open Source Community

lilacai/lilac-wikitext-2-raw-v1

This dataset was generated by Lilac for a HuggingFace Space. The original source dataset is wikitext. The configuration includes the namespace, name, source dataset name, configuration name, as well as the signal‑processing path and embedding method. Signal processing covers various signals such as near‑duplicate detection, PII detection, language detection, text statistics, sentiment analysis, code detection, and toxicity detection.

Updated 12/7/2023
hugging_face

Description

Dataset Overview

Namespace and Name

  • Namespace: lilac
  • Name: wikitext-2-raw-v1

Data Source

  • Dataset Name: wikitext
  • Configuration Name: wikitext-2-raw-v1
  • Source Name: huggingface

Embedding and Signals

  • Embedding Path: text
  • Embedding Type: gte-small

Signal List

  • near_dup: Near‑duplicate detection
  • pii: Personally Identifiable Information detection
  • lang_detection: Language detection
  • text_statistics: Text statistics
  • concept_score: Concept scoring, including the following concepts:
    • legal-termination: Legal termination
    • negative-sentiment: Negative sentiment
    • non-english: Non‑English
    • positive-sentiment: Positive sentiment
    • profanity: Profanity
    • question: Question
    • source-code: Source code
    • toxicity: Toxicity
  • cluster_dbscan: DBSCAN clustering
  • cluster_hdbscan: HDBSCAN clustering

Settings

  • UI Media Path: text
  • Tag: machine-learning

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Natural Language Processing
Machine Learning

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.