High Quality Norwegian Corpus
Foundation Model Datasets
Tags and Keywords
Trusted By




"No reviews yet"
£12,000
About
13M docs, 15B Tokens, 4+ FineWeb-edu score collection of high-quality Norwegian text data with their metadata.
Buy the Multilingual Pack for £28,000 instead of ~~£36,000~~ and save £8,000.
Save over $20,000 in GPU costs with our ready-to-use dataset.
Dataset License
Creation
The dataset was created by filtering all
English common crawl data for high-quality text using the FineWeb-Edu classifier with education score of 4 or higher over 5.
The data is source from the v1.0.0 of the HuggingFaceFW/fineweb-edu dataset which corresponds to CC-MAIN-2024-10 from common crawl.
The data was also fully deduplicated and labeled for Topic and Format using the WebOrganizer Classifiers, and then we only keep documents with a specific format (list below).
All documents were then translated from English to Norwegian using the Qwen3-235B-A22B LLM model, while also removing any webscraping artifacts and reformating the output text using markdown (added headings, lists, or other formatting elements to improve readability), ensuring the text is high quality and clean.
The LLM was also used to generate a title if the document did not have one.Data Statistics
- Total Documents: 13,372,873
- Total Tokens: 15.2B GPT-4o Tokens (15,263,569,765 Tokens)
- Total Size: ~49GB
- Total GPU Hours Cost: 15,000 H100 Hours per language (~$36000)
Data Fields
id: (str) Unique identifier for the document.title: (str) Title of the document.text: (str) The main content of the document, translated to Norwegian.metadata: (dict) Additional metadata about the document, including:url: (str) The original URL of the document.dump: (str) The common crawl dump from which the document was extracted.date: (str) The date when the document was scraped.file_path: (str) The path to the original file in the common crawl dataset.language: (str) The language of the original document (always "English"en).language_score: (float) The language quality score of the document, ranging from 0 to 1.minhash_cluster_size: (int) The size of the deduplication cluster the document belongs to.fw_edu_int_score: (int) The rounded FineWeb-Edu classifier score for the document, indicating its educational quality (0-5).fw_edu_score: (float) The FineWeb-Edu classifier score for the document, indicating its educational quality (0-5).wo_format_label: (str) The format label assigned by the WebOrganizer classifier, indicating the type of content. Check the WebOrganizer Classifiers for more details.wo_format_score: (float) The confidence score for the format label assigned by the WebOrganizer classifier.wo_topic_label: (str) The topic label assigned by the WebOrganizer classifier, indicating the main subject of the content. Check the WebOrganizer Classifiers for more details.wo_topic_score: (float) The confidence score for the topic label assigned by the WebOrganizer classifier.wo_format_output: (list[dict]) The full output of the WebOrganizer classifier for the format label, including the label and score of all formats.wo_topic_output: (list[dict]) The full output of the WebOrganizer classifier for the topic label, including the label and score of all topics.length: (int) The length of the document in characters.token_count: (int) The number of tokens in the document, calculated using the GPT-4o tokenizer.orig_text: (str) The original text of the document before translation.orig_len: (int) The length of the original text in characters.orig_token_count: (int) The number of tokens in the original text, using the gpt2 tokenizer.
Data Formats
The dataset contains documents in the following formats, filtered from all formats available in the WebOrganizer classifier:
- Academic Writing
- Nonfiction Writing
- Personal Blog
- Q&A Forum
- Structured Data
- Creative Writing
- Documentation
- Tutorial
- Knowledge Article
Topics
The dataset contains documents on the following topics:
- Adult
- Art & Design
- Software Dev.
- Crime & Law
- Education & Jobs
- Hardware
- Entertainment
- Social Life
- Fashion & Beauty
- Finance & Business
- Food & Dining
- Games
- Health
- History
- Home & Hobbies
- Industrial
- Literature
- Politics
- Religion
- Science & Tech.
- Software
- Sports & Fitness
- Transportation
- Travel
Deduplication
The dataset has been fully deduplicated using the MinHash algorithm with the following parameters:
- Num Buckets: 16
- Hashes per Bucket: 8
- Ngrams: 13
Sample Example
{
"id": "<urn:uuid:02bce8b9-19cb-4937-8435-e78163cf5465>",
"text": "Les hva en lærer sier om denne aktiviteten:\n\"Two-way feedback er en aktivitet som barnetrinn- og ungdomstrinnselever kan bruke til å selv vurdere. Det er en god aktivitet for å introdusere selvbedømmelse. Du kan også finne ut hva elevene dine synes om aktivitetene. Selv om det kan være forvirrende første gang de gjør det, kan elevene bli mer selvstendige på sikt.\"\n\n## Første trinn: Forberedelse\nForbered en skriveoppgave for elevene dine og lag en enkel vurderingsform. For barnetrinnet, ta med can-do-utsagn. For ungdomstrinnet, ta også med kriterier.\n\n### Eksempelform for barnetrinn\n|Jeg kan skrive om min beste venn|\nMin tekst er ...\nVeldig god [2 smilende fjes-emoji]\nGod [1 smilende fjes-emoji]\nGreie [1 nøytral fjes-emoji]\nDenne oppgaven var ...\nInteressant [1 smilende fjes-emoji]\nGreie [1 nøytral fjes-emoji]\nKjedelig [1 trist fjes-emoji]\n\n### Eksempelform for ungdomstrinn\n|Jeg kan skrive om fordelene og ulempene ved turisme.||Sett et kryss (X) på linjen.|\n|Jeg skrev om to fordeler og to ulemper.||Nei ----------¦---------- Ja|\n|Jeg skrev 4 paragrafer.||Nei ----------¦---------- Ja|\n|Jeg brukte lenkeord.||Nei ----------¦---------- Ja|\n|Jeg brukte et variert ordforråd.||Nei ----------¦---------- Ja|\n|Jeg sjekket rettskriving og tegnsetting og grammatikk||Nei ----------¦---------- Ja|\n|Hva jeg gjorde bra:|\n|Noe jeg kan forbedre neste gang:|\n\n## Andre trinn: Introduksjon\nNår elevene har fullført skriveoppgaven, si: \"Nå skal dere vurdere arbeidet deres.\"\n\nBarnetrinn/Ungdomstrinn: Spør: \"Hva er can-do-utsagnet?\" (f.eks. Jeg kan skrive om min beste venn)\nUngdomstrinn: Spør: \"Hva er kriteriene?\" (f.eks. skrive fire paragrafer)\n\nTegn opp formen på tavla. (Tips: Tegn den på tavla før timen og skjul den med papir.) Hvis du kan, gi dem en utskrevet kopi av den.\n\nFortell elevene at de skal kopiere formen inn i skriveblokkene sine. (Tips: Elevene kan ha en spesifikk skriveblokk til selvbedømmelse, som du kan ta med og lese.)\n\n## Tredje trinn: Modell\nBruk formen på tavla til å modellere aktiviteten.\n\nBarnetrinn: Spør: \"Var teksten interessant? [smil] Ja? Kryss av i denne boksen. Var den grei? [trekk et ansikt som om du er usikker] Kryss av i denne boksen. Kjedelig? [se trist ut] Kryss av i denne boksen.\"\n\nUngdomstrinn: Spør: \"Likte du denne oppgaven? Sett et merke på linja for å vise hvor mye du likte den.\" Bruk et lokalspråk til å forklare hvis det er nødvendig.\n\n## Fjerde trinn: Selvbedømmelse\nBarnetrinn: Si: \"Nå skal dere se på teksten deres og krysse av i boksene.\" Gå rundt, støtt og gi tilbakemelding.\n\nUngdomstrinn: Si: \"Nå skal dere se på teksten deres. Sett et merke på linja på riktig sted. Dere har tre minutter.\" Du kan måtte forklare på et lokalspråk. For eksempel, hvis eleven skrev to paragrafer, ville de sette et merke midt på. Hvis de skrev tre, ville merket være nærmere \"ja\". Gå rundt og støtt, men gi ikke tilbakemelding.\n\nNår dette er gjort, si: \"Nå skal dere se på bunnen. Tenk på hva dere gjorde bra og skriv en setning. Tenk også på noe dere kan forbedre neste gang og skriv en setning. Dere har tre minutter.\" Gå rundt og støtt, men gi ikke tilbakemelding.\n\n### Valgfritt trinn: Pararbeid\nBarnetrinn: Si: \"Velg en partner. Se på hverandres arbeid. Se på boksen (veldig god, god, grei). Er dere enige med partneren? Snakk om det.\" Gi dem tre minutter. Gå rundt, støtt og gi tilbakemelding.\n\nUngdomstrinn: Si: \"Velg en partner. Se på hverandres arbeid. Se på hverandres selvbedømmelse. Hva synes dere? Er dere enige med partneren? Diskuter.\" Gi dem fem minutter. Gå rundt og lytt.\n\n## Femte trinn: Avsluttende aktivitet\nBarnetrinn: Be to eller tre elever om å oppsummere diskusjonene sine. Du kan bruke denne tilbakemeldingen når du planlegger fremtidige aktiviteter.\n\nSpør elevene om de liker å bruke formen til selvbedømmelse. Bruk denne tilbakemeldingen til å tilpasse aktiviteten neste gang hvis nødvendig.\n\nUngdomstrinn: Fortell elevene at de skal gi deg formene sine. Si: \"Jeg skal lese kommentarene deres, så skal jeg legge til mine nederst. Deretter kan dere sammenligne vurderingen deres med min.\"\n\nSpør elevene hvor lett eller vanskelig det var å gjøre selvbedømmelse. Fortell dem at det blir lettere hvis de gjør det oftere.\n\nNeste gang elevene skriver en tekst, be dem om å se på selvbedømmelsesformene sine, og huske hva de kan forbedre. Du kan lage selvbedømmelsesformen på tavla sammen med elevene dine.",
"title": "Two-way feedback: En aktivitet for selvbedømmelse",
"metadata": {
"url": "https://africa.teachingenglish.org.uk/classroom/activities/two-way-feedback",
"dump": "CC-MAIN-2022-33",
"date": "1970-01-01T00:00:00",
"file_path": "s3://commoncrawl/crawl-data/CC-MAIN-2022-33/segments/1659882571950.76/warc/CC-MAIN-20220813111851-20220813141851-00106.warc.gz",
"language": "en",
"language_score": 0.9150523543,
"minhash_cluster_size": 1,
"fw_edu_int_score": 4,
"fw_edu_score": 4,
"wo_format_label": "Tutorial",
"wo_format_score": 0.9552263,
"wo_topic_label": "Education_&_Jobs",
"wo_topic_score": 0.9857722,
"wo_format_output": [
{
"label": "Tutorial",
"score": 0.9552263
},
{
"label": "Customer Support",
"score": 0.029994402
},
{
"label": "Truncated",
"score": 0.0024911128
},
.......
],
"wo_topic_output": [
{
"label": "Education & Jobs",
"score": 0.9857722
},
{
"label": "Literature",
"score": 0.009932071
},
{
"label": "Social Life",
"score": 0.0011474773
},
......
],
"length": 4371,
"token_count": 1277,
"orig_text": "Read what a teacher says about this activity:\n‘Two-way feedback is an activity which primary and secondary learners can use to self-assess. It’s a good activity for introducing self-assessment. You can also discover what your learners think about the activities. Although it may be confusing the first time they do it, learners can become more independent in the long term.’\nStage 1: Preparation\nPrepare a writing task for your learners and design a simple assessment form. For primary learners, include the can-do statements. For secondary learners, also include criteria.\nExample form for primary\n|I can write about my best friend|\nMy writing is …\nVery good [2 smiley face emojis]\nGood [1 smiley emoji]\nOK [1 neutral face emoji]\nThis task was ...\nInteresting [1 smiley face emoji]\nOK [1 neutral face emoji]\nBoring [1 sad face emoji]\nExample form for secondary\n|I can write about the advantages and disadvantages of tourism.||Place a cross (X) on the line.|\n|I wrote about two advantages and two disadvantages.||No ----------¦---------- Yes|\n|I wrote 4 paragraphs.||No ----------¦---------- Yes|\n|I used linking words.||No ----------¦---------- Yes|\n|I used a range of vocabulary.||No ----------¦---------- Yes|\n|I checked spelling and punctuation and grammar||No ----------¦---------- Yes|\n|What I did well:|\n|Something I can improve next time:|\nStage 2: Introduction\nWhen learners have completed the writing task, say: ‘Now you will assess your work.’\nPrimary/Secondary: Ask: ‘What is the can-do?’ (e.g. I can write about my best friend)\nSecondary: Ask: ‘What are the criteria?’ (e.g. write four paragraphs)\nDraw the form on the board. (Tip: Draw it on the board before the lesson and hide it with paper.) If you can, give them a printed copy of it.\nTell learners to copy the form into their notebooks. (Tip: Learners can have a specific notebook for self-assessment, which you can take away and read.)\nStage 3: Model\nUse the form on the board to model the activity.\nPrimary: Ask: ‘Was the writing interesting? [smile] Yes? Tick this box. Was it OK? [make a face like you’re not sure] Tick this box. Boring? [look sad] Tick this box.’\nSecondary: Ask: ‘Did you like this task? Put a mark on the line to show how much you liked it.’ Use a local language to explain if needed.\nStage 4: Self-assessment\nPrimary: Say: ‘Now look at your writing and tick the boxes.’ Circulate, support and give feedback.\nSecondary: Say: ‘Now look at your writing. Place a mark on the line in the appropriate place. You have three minutes.’ You may need to explain in a local language. For example, if the learner wrote two paragraphs, they would put a mark in the middle. If they wrote three, the mark would be close to ‘yes’. Circulate and support but do not give feedback.\nWhen this is done, say: ‘Now look at the bottom. Think about what you did well and write a sentence. Also think about something you can improve next time and write a sentence. You have three minutes.’ Circulate and support but do not give feedback.\nOptional stage: Pair work\nPrimary: Say: ‘Choose a partner. Look at each other’s work. Look at the box (very good, good, OK). Do you agree with your partner? Talk about it.’ Give them three minutes. Circulate, support and give feedback.\nSecondary: Say: ‘Choose a partner. Look at each other’s work. Look at each other’s self-assessment. What do you think? Do you agree with your partner? Discuss.’ Give them five minutes. Circulate and listen.\nStage 5: Closing activity\nPrimary: Ask two or three learners to summarise their discussions. You can use this feedback when planning future activities.\nAsk learners if they like using the form for self-assessment. Use this feedback to adapt the activity next time if necessary.\nSecondary: Tell learners to give you their forms. Say: ‘I’ll read your comments, then I’ll add mine at the bottom. You can then compare your assessment with mine.’\nAsk learners how easy or difficult it was to self-assess. Tell them it will be easier if they do it more often.\nThe next time your learners write a text, ask them to look at their self-assessment forms, and remember what they can improve. You can create the self-assessment form on the board together with your learners.",
"orig_len": 4209,
"orig_token_count": 1106
}
}
Loading...
