High Quality Spanish Corpus
Foundation Model Datasets
Tags and Keywords
Trusted By




"No reviews yet"
£12,000
About
13M docs, 15B Tokens, 4+ FineWeb-edu score collection of high-quality Spanish text data with their metadata.
Buy the Multilingual Pack for £28,000 instead of ~~£36,000~~ and save £8,000.
Save over $20,000 in GPU costs with our ready-to-use dataset.
Dataset License
Creation
The dataset was created by filtering all
English common crawl data for high-quality text using the FineWeb-Edu classifier with education score of 4 or higher over 5.
The data is source from the v1.0.0 of the HuggingFaceFW/fineweb-edu dataset which corresponds to CC-MAIN-2024-10 from common crawl.
The data was also fully deduplicated and labeled for Topic and Format using the WebOrganizer Classifiers, and then we only keep documents with a specific format (list below).
All documents were then translated from English to Spanish using the Qwen3-235B-A22B LLM model, while also removing any webscraping artifacts and reformating the output text using markdown (added headings, lists, or other formatting elements to improve readability), ensuring the text is high quality and clean.
The LLM was also used to generate a title if the document did not have one.Data Statistics
- Total Documents: 13,312,491
- Total Tokens: 14.5B GPT-4o Tokens (14,509,575,325 Tokens)
- Total Size: ~52GB
- Total GPU Hours Needed: 15,000 H100 Hours per language
Data Fields
id: (str) Unique identifier for the document.title: (str) Title of the document.text: (str) The main content of the document, translated to Spanish.metadata: (dict) Additional metadata about the document, including:url: (str) The original URL of the document.dump: (str) The common crawl dump from which the document was extracted.date: (str) The date when the document was scraped.file_path: (str) The path to the original file in the common crawl dataset.language: (str) The language of the original document (always "English"en).language_score: (float) The language quality score of the document, ranging from 0 to 1.minhash_cluster_size: (int) The size of the deduplication cluster the document belongs to.fw_edu_int_score: (int) The rounded FineWeb-Edu classifier score for the document, indicating its educational quality (0-5).fw_edu_score: (float) The FineWeb-Edu classifier score for the document, indicating its educational quality (0-5).wo_format_label: (str) The format label assigned by the WebOrganizer classifier, indicating the type of content. Check the WebOrganizer Classifiers for more details.wo_format_score: (float) The confidence score for the format label assigned by the WebOrganizer classifier.wo_topic_label: (str) The topic label assigned by the WebOrganizer classifier, indicating the main subject of the content. Check the WebOrganizer Classifiers for more details.wo_topic_score: (float) The confidence score for the topic label assigned by the WebOrganizer classifier.wo_format_output: (list[dict]) The full output of the WebOrganizer classifier for the format label, including the label and score of all formats.wo_topic_output: (list[dict]) The full output of the WebOrganizer classifier for the topic label, including the label and score of all topics.length: (int) The length of the document in characters.token_count: (int) The number of tokens in the document, calculated using the GPT-4o tokenizer.orig_text: (str) The original text of the document before translation.orig_len: (int) The length of the original text in characters.orig_token_count: (int) The number of tokens in the original text, using the gpt2 tokenizer.
Data Formats
The dataset contains documents in the following formats, filtered from all formats available in the WebOrganizer classifier:
- Academic Writing
- Nonfiction Writing
- Personal Blog
- Q&A Forum
- Structured Data
- Creative Writing
- Documentation
- Tutorial
- Knowledge Article
Topics
The dataset contains documents on the following topics:
- Adult
- Art & Design
- Software Dev.
- Crime & Law
- Education & Jobs
- Hardware
- Entertainment
- Social Life
- Fashion & Beauty
- Finance & Business
- Food & Dining
- Games
- Health
- History
- Home & Hobbies
- Industrial
- Literature
- Politics
- Religion
- Science & Tech.
- Software
- Sports & Fitness
- Transportation
- Travel
Deduplication
The dataset has been fully deduplicated using the MinHash algorithm with the following parameters:
- Num Buckets: 16
- Hashes per Bucket: 8
- Ngrams: 13
Sample Example
{
"id": "<urn:uuid:e49313c1-7459-49e8-bfe5-407103413201>",
"text": "# Los hongos más comunes pueden ser un reto para encontrar en la naturaleza\n\nLos hongos *morel* pueden ser un reto para encontrar en la naturaleza gracias a su corta temporada de crecimiento y su necesidad de condiciones específicas de suelo y clima. Pero quizás entender cómo crecen y se dispersan podría ayudarte a predecir mejor cuándo y dónde encontrarlos. Entonces, ¿cómo se reproducen los *morels*? ¿Qué tan rápido crecen? ¿Y puedes encontrarlos en los mismos lugares año tras año? En este artículo responderemos todas estas preguntas y más.\n\n## Lo que aprenderás hoy\n\n### ¿Cómo se dispersan los *morels*?\n\nLos *morels* se reproducen de manera muy similar a otros hongos, pero necesitan un conjunto muy específico de condiciones para crecer. Recuerda, los hongos visibles son solo los frutos de una red mucho más grande de filamentos fúngicos, conocidos como micelio, que se encuentran en el suelo.\n\nEl hongo *morel*, como otros hongos, puede dispersarse de dos maneras diferentes:\n\n- Los hongos producen esporas que se liberan al ambiente cuando las cabezas fructíferas maduran. Estas esporas producen hilos de micelio que crecen y se dispersan como raíces hasta que son lo suficientemente maduros como para producir hongos en nuevas ubicaciones.\n- El micelio también puede extenderse a través del suelo, creciendo y dispersándose a nuevas ubicaciones........",
"title": "El crecimiento y reproducción de los hongos más comestibles",
"metadata": {
"url": "https://www.forestwildlife.org/how-do-morels-reproduce/",
"dump": "CC-MAIN-2021-49",
"date": "1970-01-01 00:00:00",
"file_path": "s3://commoncrawl/crawl-data/CC-MAIN-2021-49/segments/1637964358233.7/warc/CC-MAIN-20211127193525-20211127223525-00445.warc.gz",
"language": "en",
"language_score": 0.9539546371,
"minhash_cluster_size": 8,
"fw_edu_int_score": 4,
"fw_edu_score": 3.765625,
"wo_format_label": "Knowledge_Article",
"wo_format_score": 0.7705206,
"wo_topic_label": "Science_&_Tech.",
"wo_topic_score": 0.4413778,
"wo_format_output": [
{
"label": "Knowledge Article",
"score": 0.7705206
},
{
"label": "Tutorial",
"score": 0.10928307
},
{
"label": "Customer Support",
"score": 0.0725144
},
.....
],
"wo_topic_output": [
{
"label": "Science & Tech.",
"score": 0.4413778
},
{
"label": "Home & Hobbies",
"score": 0.43623558
},
{
"label": "Health",
"score": 0.04413119
},
...
],
"length": 6741,
"token_count": 1276,
"orig_text": "Morel mushrooms can be a challenge to find in the wild thanks to their short growing season and their need for specific soil and weather conditions. But perhaps understanding how they grow and spread would help you better predict when and where to find them............",
"orig_len": 5738,
"orig_token_count": 1276
}
}
Loading...
