JUHE API Marketplace
DATASET
Open Source Community

AnanthZeke/oscar_tamil_clean

The dataset oscar_tamil_clean may involve Tamil text data cleaning or processing, containing text and sentence token features.

Updated 4/5/2023
hugging_face

Description

Dataset Overview

Dataset Name

  • Name: oscar_tamil_clean

Dataset Features

  • Feature 1: text
    • Data Type: string
  • Feature 2: sent_token
    • Data Type: string
    • Attribute: sequence

Dataset Splits

  • Training Set:
    • Number of Samples: 1263180
    • Data Size: 19533337624 bytes

Dataset Size

  • Download Size: 6504957774 bytes
  • Total Data Size: 19533337624 bytes

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Tamil
Natural Language Processing

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.