bea2019st/wi_locness
The Cambridge English Write & Improve + LOCNESS dataset is an English corpus for grammatical error correction. Write & Improve is an online platform that helps non‑native English learners improve their writing; after a student submits an essay, the system provides instant feedback and human annotators assign a CEFR level. The LOCNESS corpus contains essays written by native English students and is annotated by Write & Improve annotators so that researchers can evaluate their systems across different English proficiency levels. The dataset supports tasks of correcting grammatical, lexical, and spelling errors. It provides two configurations, wi and locness, corresponding to different data sources and annotation methods.
Description
Dataset Card: Cambridge English Write & Improve + LOCNESS Dataset
Dataset Description
Dataset Summary
Write & Improve (Yannakoudakis et al., 2018) is an online platform aimed at helping non‑native English students improve their writing skills. Specifically, students from around the world submit letters, stories, essays, and papers on various topics, and the W&I system provides instant feedback. Since its launch in 2014, W&I annotators have manually annotated a subset of submissions and assigned them CEFR levels.
LOCNESS (Granger, 1998) is a corpus of essays written by native English students, originally compiled by scholars at the Centre for Corpus Linguistics, University of Leuven. Because native speakers also make errors, we asked W&I annotators to annotate a subset of LOCNESS so that researchers can test their systems across a range of English proficiency levels.
Supported Tasks and Leaderboards
The grammatical error correction (GEC) task automatically corrects errors in text; for example [I follows his advices -> I followed his advice]. It can be used both to help language learners improve writing and to alert native speakers to accidental typos.
The dataset targets correction of all error types in written text, including grammatical, lexical, and spelling errors.
The latest leaderboards and submissions are available at the following Codalab competition: https://competitions.codalab.org/competitions/20228
Language
The dataset is in English.
Dataset Structure
Data Instances
wi configuration example:
{
"id": "1-140178",
"userid": "21251",
"cefr": "A2.i",
"text": "My town is a medium size city with eighty thousand inhabitants. It has a high density population because its small territory. Despite of it is an industrial city, there are many shops and department stores. I recommend visiting the artificial lake in the certer of the city which is surrounded by a park. Pasteries are very common and most of them offer the special dessert from the city. There are a comercial zone along the widest street of the city where you can find all kind of establishments: banks, bars, chemists, cinemas, pet shops, restaurants, fast food restaurants, groceries, travel agencies, supermarkets and others. Most of the shops have sales and offers at least three months of the year: January, June and August. The quality of the products and services are quite good, because there are a huge competition, however I suggest you taking care about some fakes or cheats.",
"edits": {
"start": [13, 77, 104, 126, 134, 256, 306, 375, 396, 402, 476, 484, 579, 671, 774, 804, 808, 826, 838, 850, 857, 862, 868],
"end": [24, 78, 104, 133, 136, 262, 315, 379, 399, 411, 480, 498, 588, 671, 777, 807, 810, 835, 845, 856, 861, 867, 873],
"text": ["medium-sized", "-", " of", "Although", "", "center", null, "of", "is", "commercial", "kinds", "businesses", "grocers", " in", "is", "is", "", ". However,", "recommend", "be", "careful", "of", ""]
}
}
locness configuration example:
{
"id": "7-5819177",
"cefr": "N",
"text": "Boxing is a common, well known and well loved sport amongst most countries in the world however it is also punishing, dangerous and disliked to the extent that many people want it banned, possibly with good reason.\nBoxing is a dangerous sport, there are relatively common deaths, tragic injuries and even disease. All professional boxers are at risk from being killed in his next fight. If not killed then more likely paralysed. There have been a number of cases in the last ten years of the top few boxers having tragic losses throughout their ranks. This is just from the elite few, and theres more from those below them.\nMore deaths would occur through boxing if it were banned. The sport would go underground, there would be no safety measures like gloves, a doctor, paramedics or early stopping of the fight if someone looked unable to continue. With this going on the people taking part will be dangerous, and on the streets. Dangerous dogs who were trained to kill and maim in similar underound dog fights have already proved deadly to innocent people, the new boxers could be even more at risk.\nOnce boxing is banned and no-one grows up knowing it as acceptable there will be no interest in boxing and hopefully less all round interest in violence making towns and cities much safer places to live in, there will be less fighting outside pubs and clubs and less violent attacks with little or no reason.\nchange the rules of boxing slightly would much improve the safety risks of the sport and not detract form the entertainment. There are all sorts of proposals, lighter and more cushioning gloves could be worn, ban punches to the head, headguards worn or make fights shorter, as most of the serious injuries occur in the latter rounds, these would all show off the boxers skill and tallent and still be entertaining to watch.\nEven if a boxer is a success and manages not to be seriously hurt he still faces serious consequences in later life diseases that attack the brains have been known to set in as a direct result of boxing, even Muhamed Ali, who was infamous(?) both for his boxing and his quick-witted intelligence now has Alzheimer disease and can no longer do many everyday acts.\nMany other sports are more dangerous than boxing, motor sports and even mountaineering has risks that are real. Boxers chose to box, just as racing drivers drive.",
"edits": {
"start": [24, 39, 52, 87, 242, 371, 400, 528, 589, 713, 869, 992, 1058, 1169, 1209, 1219, 1255, 1308, 1386, 1412, 1513, 1569, 1661, 1731, 1744, 1781, 1792, 1901, 1951, 2038, 2131, 2149, 2247, 2286],
"end": [25, 40, 59, 95, 249, 374, 400, 538, 595, 713, 869, 1001, 1063, 1169, 1209, 1219, 1255, 1315, 1390, 1418, 1517, 1570, 1661, 1737, 1751, 1781, 1799, 1901, 1960, 2044, 2131, 2149, 2248, 2289],
"text": ["-", "-", "in", ". However,", ". There", "their", ",", "among", "theres", " and", ",", "underground", ". The", ",", ",", ",", ",", ",", ". There", "for", "Changing", "from", ";", ",", "later", ". These", "", "talent", ",", ". Diseases", ". Even", ",", "s", ";", "have"]
}
}
Data Fields
The dataset fields include:
id: text ID (string)cefr: CEFR level (string) with link to https://www.cambridgeenglish.org/exams-and-tests/cefr/userid: user IDtext: submitted text (string)edits: W&I edits containing:start: list of start indices (int)end: list of end indices (int)text: list of edited text strings (string)from: list of original text strings (string)
Data Splits
| Split | Train | Validation |
|---|---|---|
| wi | 3000 | 300 |
| locness | N/A | 50 |
Dataset Creation
Rationale
[More information needed]
Source Data
Initial Data Collection and Normalization
[More information needed]
Who are the source language producers?
[More information needed]
Annotation
Annotation Process
[More information needed]
Who are the annotators?
[More information needed]
Personal and Sensitive Information
[More information needed]
Considerations for Using the Dataset
Social Impact
[More information needed]
Discussion of Biases
[More information needed]
Other Known Limitations
[More information needed]
Additional Information
Dataset Curators
[More information needed]
License Information
Write & Improve License:
Cambridge English Write & Improve (CEWI) Dataset Licence Agreement
1. By downloading this dataset and licence, this licence agreement takes effect, dated on the download date, between you, the licensee, and Cambridge University, the licensor.
2. The licensor retains all copyright to the entire licensed dataset. The licensee does not own or transfer any ownership or rights to the dataset.
3. The licensor grants the licensee a non‑exclusive, non‑transferable right to use the licensed dataset for non‑commercial research and educational purposes.
4. Non‑commercial purposes exclude using the dataset or derived information as part of a product or service that is sold, offered for sale, licensed, rented or otherwise provided to third parties.
5. The licensee should acknowledge the use of the licensed dataset in all publications derived from it by citing the following publication:
Helen Yannakoudakis, Øistein E. Andersen, Ardeshir Geranpayeh, Ted Briscoe and Diane Nicholls. 2018. Developing an automated writing placement system for ESL learners. Applied Measurement in Education.
6. The licensee may publish excerpts less than 100 words from the dataset as permitted by clause 3.
7. The licensor grants the licensee "as‑is" rights to use the licensed dataset. The licensor makes no express or implied warranties, representations or endorsements.
8. This agreement shall be governed and interpreted in accordance with English law, and the courts of England shall have exclusive jurisdiction.
LOCNESS License:
LOCNESS Dataset Licence Agreement
1. The corpus is for non‑commercial use only.
2. Publications based on the corpus should acknowledge the Centre for Corpus Linguistics at the University of Leuven (CECL). A scanned or printed copy of the publication should also be sent to <sylviane.granger@uclouvain.be>.
3. Unless expressly authorized by CECL, no part of the corpus may be distributed to third parties. The corpus may only be used by individuals who agree to the licence terms, or by researchers who work closely with them or under their supervision, provided they belong to the same institution and the work is within the scope of a research project.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.