Preprocessed BBC News Dataset
E-commerce & Online Transactions
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset comprises 2225 pre-processed text documents originally sourced from the BBC news website, covering stories published between 2004 and 2005. The original collection was organised into five distinct topical areas: business, entertainment, politics, sport, and technology. The dataset has undergone a three-stage pre-processing pipeline to enhance its usability. This involved extracting metadata from the files into a single CSV, cleaning and compressing the text content into another CSV, and finally, processing the English language text using spaCy for tasks such as stop-word removal, lemmatisation, and Named Entity Recognition (NER). Each successive stage refines and builds upon the data from the previous stage, persisting it into new CSV files.
Columns
The dataset includes metadata columns extracted from the original files:
- DocType: This represents the class-label, derived from the folder name that contained the original documents.
- DocId: An identifier formed from the first character of the document type combined with the file name.
- FileSize: The size of the original file, measured in bytes.
- FilePath: The relative path to the original document within its dataset.
Distribution
The dataset is provided in CSV format after its pre-processing stages. It contains 2225 unique documents. The original articles were structured across five distinct folders corresponding to their content type. Analysis of the document types reveals that approximately 23% of the documents pertain to Sport, another 23% to Business, and the remaining 54% are categorised as 'Other', encompassing entertainment, politics, and technology topics. The file sizes of the original documents vary significantly, ranging from 503 bytes to over 25,000 bytes.
Usage
This dataset is ideally suited for Natural Language Processing (NLP) tasks. It is particularly useful for developing and evaluating multi-class text classification models, allowing users to train algorithms to categorise news articles into specific topics. Researchers and practitioners can also leverage this dataset for various text analysis, information retrieval, and linguistic studies.
Coverage
The dataset's content covers news stories from the BBC news website published during the 2004-2005 period. While the source is specific to the BBC, the topics covered have global relevance. The dataset itself is listed as being available for a global region.
License
CC0
Who Can Use It
This dataset is beneficial for a wide range of users, including:
- Data scientists looking to build and test text classification models.
- Machine learning engineers seeking structured text data for NLP projects.
- Researchers in linguistics, media studies, or computational social science.
- Students, especially those new to data science, NLP, or machine learning, as it is categorised as suitable for beginners.
Dataset Name Suggestions
- BBC News Articles 2004-2005
- Preprocessed BBC News Dataset
- BBC Text Classification Dataset
- Historical BBC News Corpus
Attributes
Original Data Source: BBC Full Text Preprocessed