Opendatabay APP

Preprocessed BBC News Dataset

E-commerce & Online Transactions

Tags and Keywords

Business

Tabular

Beginner

Nlp

Multiclass

Classification

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Preprocessed BBC News Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset comprises 2225 pre-processed text documents originally sourced from the BBC news website, covering stories published between 2004 and 2005. The original collection was organised into five distinct topical areas: business, entertainment, politics, sport, and technology. The dataset has undergone a three-stage pre-processing pipeline to enhance its usability. This involved extracting metadata from the files into a single CSV, cleaning and compressing the text content into another CSV, and finally, processing the English language text using spaCy for tasks such as stop-word removal, lemmatisation, and Named Entity Recognition (NER). Each successive stage refines and builds upon the data from the previous stage, persisting it into new CSV files.

Columns

The dataset includes metadata columns extracted from the original files:
  • DocType: This represents the class-label, derived from the folder name that contained the original documents.
  • DocId: An identifier formed from the first character of the document type combined with the file name.
  • FileSize: The size of the original file, measured in bytes.
  • FilePath: The relative path to the original document within its dataset.

Distribution

The dataset is provided in CSV format after its pre-processing stages. It contains 2225 unique documents. The original articles were structured across five distinct folders corresponding to their content type. Analysis of the document types reveals that approximately 23% of the documents pertain to Sport, another 23% to Business, and the remaining 54% are categorised as 'Other', encompassing entertainment, politics, and technology topics. The file sizes of the original documents vary significantly, ranging from 503 bytes to over 25,000 bytes.

Usage

This dataset is ideally suited for Natural Language Processing (NLP) tasks. It is particularly useful for developing and evaluating multi-class text classification models, allowing users to train algorithms to categorise news articles into specific topics. Researchers and practitioners can also leverage this dataset for various text analysis, information retrieval, and linguistic studies.

Coverage

The dataset's content covers news stories from the BBC news website published during the 2004-2005 period. While the source is specific to the BBC, the topics covered have global relevance. The dataset itself is listed as being available for a global region.

License

CC0

Who Can Use It

This dataset is beneficial for a wide range of users, including:
  • Data scientists looking to build and test text classification models.
  • Machine learning engineers seeking structured text data for NLP projects.
  • Researchers in linguistics, media studies, or computational social science.
  • Students, especially those new to data science, NLP, or machine learning, as it is categorised as suitable for beginners.

Dataset Name Suggestions

  • BBC News Articles 2004-2005
  • Preprocessed BBC News Dataset
  • BBC Text Classification Dataset
  • Historical BBC News Corpus

Attributes

Original Data Source: BBC Full Text Preprocessed

Listing Stats

VIEWS

1

DOWNLOADS

0

LISTED

27/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in ZIP Format