JUHE API Marketplace
DATASET
Open Source Community

mteb/stsb_multi_mt

STSb Multi MT is a multilingual semantic textual similarity benchmark containing sentence pairs and similarity scores for German, English, Spanish, French, Italian, Dutch, Polish, Portuguese, Russian, and Chinese. Built from the STS‑benchmark dataset and translated via deepl.com, it can be used to train sentence‑embedding models such as T‑Systems‑onsite/cross‑en‑de‑roberta‑sentence‑transformer. The collection includes a training set (5,749 pairs), development set (1,500 pairs), and test set (1,379 pairs).

Updated 5/4/2025
hugging_face

Description

Dataset Overview

Dataset Name

  • Name: STSb Multi MT

Languages

  • Supported Languages: de, en, es, fr, it, nl, pl, pt, ru, zh

License

  • License Type: other

Size

  • Size Range: 10K < n < 100K

Task Category

  • Task Category: text‑classification

Specific Tasks

  • Task IDs: text‑scoring, semantic‑similarity‑scoring

Dataset Structure

  • Data File Configuration:
    • Default Configuration:
      • Training Set: train/*.parquet
      • Validation Set: dev/*.parquet
      • Test Set: test/*.parquet
    • Language‑Specific Configurations:
      • German: de.parquet (train, dev, test)
      • French: fr.parquet (train, dev, test)
      • Russian: ru.parquet (train, dev, test)
      • Chinese: zh.parquet (train, dev, test)
      • Spanish: es.parquet (train, dev, test)
      • Italian: it.parquet (train, dev, test)
      • English: en.parquet (train, dev, test)
      • Portuguese: pt.parquet (train, dev, test)
      • Dutch: nl.parquet (train, dev, test)
      • Polish: pl.parquet (train, dev, test)

Data Example

  • Fields:
    • sentence1: first sentence text
    • sentence2: second sentence text
    • similarity_score: similarity score (float from 0.0 to 5.0)

Dataset Creation

  • Language Creators: crowdsourced, found, machine‑generated
  • Annotation Creators: crowdsourced
  • Source Dataset: extended|other‑sts‑b

Usage Example

  • Load German validation set:
from datasets import load_dataset
dataset = load_dataset("stsb_multi_mt", name="de", split="dev")
  • Load English training set:
from datasets import load_dataset
dataset = load_dataset("stsb_multi_mt", name="en", split="train")

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Multilingual Processing
Semantic Similarity Evaluation

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.