Back to datasets
Dataset assetOpen Source CommunityMultilingual ProcessingSemantic Similarity Evaluation
mteb/stsb_multi_mt
STSb Multi MT is a multilingual semantic textual similarity benchmark containing sentence pairs and similarity scores for German, English, Spanish, French, Italian, Dutch, Polish, Portuguese, Russian, and Chinese. Built from the STS‑benchmark dataset and translated via deepl.com, it can be used to train sentence‑embedding models such as T‑Systems‑onsite/cross‑en‑de‑roberta‑sentence‑transformer. The collection includes a training set (5,749 pairs), development set (1,500 pairs), and test set (1,379 pairs).
Source
hugging_face
Created
Nov 28, 2025
Updated
May 4, 2025
Signals
169 views
Availability
Linked source ready
Overview
Dataset description and usage context
Dataset Overview
Dataset Name
- Name: STSb Multi MT
Languages
- Supported Languages: de, en, es, fr, it, nl, pl, pt, ru, zh
License
- License Type: other
Size
- Size Range: 10K < n < 100K
Task Category
- Task Category: text‑classification
Specific Tasks
- Task IDs: text‑scoring, semantic‑similarity‑scoring
Dataset Structure
- Data File Configuration:
- Default Configuration:
- Training Set: train/*.parquet
- Validation Set: dev/*.parquet
- Test Set: test/*.parquet
- Language‑Specific Configurations:
- German: de.parquet (train, dev, test)
- French: fr.parquet (train, dev, test)
- Russian: ru.parquet (train, dev, test)
- Chinese: zh.parquet (train, dev, test)
- Spanish: es.parquet (train, dev, test)
- Italian: it.parquet (train, dev, test)
- English: en.parquet (train, dev, test)
- Portuguese: pt.parquet (train, dev, test)
- Dutch: nl.parquet (train, dev, test)
- Polish: pl.parquet (train, dev, test)
- Default Configuration:
Data Example
- Fields:
sentence1: first sentence textsentence2: second sentence textsimilarity_score: similarity score (float from 0.0 to 5.0)
Dataset Creation
- Language Creators: crowdsourced, found, machine‑generated
- Annotation Creators: crowdsourced
- Source Dataset: extended|other‑sts‑b
Usage Example
- Load German validation set:
from datasets import load_dataset
dataset = load_dataset("stsb_multi_mt", name="de", split="dev")
- Load English training set:
from datasets import load_dataset
dataset = load_dataset("stsb_multi_mt", name="en", split="train")
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.