Opendatabay APP

News Content Analysis Dataset

Entertainment & Media Consumption

Tags and Keywords

News

Text

Classification

Nlp

Languages

Trusted By
Trusted by company1Trusted by company2Trusted by company3
News Content Analysis Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset contains BBC News articles, featuring both the article text and associated labels such as sport or business categories. An updated version of the dataset is included, providing additional labels like text complexity scores and various forms of text summarisation. It is a valuable resource for a range of tasks, including content classification and creating text summaries.

Columns

  • text: The full content of the news article.
  • labels: The primary category assigned to the article, such as sport, business, or other.
  • article type: An identifier for the type of article.
  • no_sentences: The total number of sentences present in the article's text.
  • Flesch Reading Ease Score: A metric indicating the readability of the text based on the Flesch scale.
  • Dale-Chall Readability Score: Another score assessing the readability of the text using the Dale-Chall formula.
  • text_rank_summary: A summary of the article generated using a text rank algorithm.
  • lsa_summary: A summary of the article generated using Latent Semantic Analysis (LSA).

Distribution

The dataset is typically provided in a CSV file format. It includes a variety of unique values and distributions for its labels and scores. For instance, the 'labels' column has 2127 unique values, with approximately 24% classified as 'sport', 24% as 'business', and 53% falling into 'Other' categories. Readability scores also show varied distributions; for the Flesch Reading Ease Score, counts range significantly, with a notable number of articles scoring between 4.00 and 28.80 (1,895 instances) and fewer in higher ranges like 128.00 - 152.80 (4 instances). Similarly, Dale-Chall Readability Scores mostly concentrate between 55.15 and 60.59 (494 instances). The number of sentences ('no_sentences') also varies, with a large cluster having between 8.74 and 9.52 sentences (849 instances). The 'article type' column contains 2072 unique values. Specific total row/record counts are not available in the provided details.

Usage

This dataset is ideal for a variety of applications, particularly in the fields of natural language processing and machine learning. Key use cases include:
  • Developing and training text classification models to categorise news articles.
  • Creating and evaluating text summarisation algorithms.
  • Analysing text complexity and readability across different news articles.

Coverage

The data originates from BBC News articles and is intended for global application, without specific geographic or demographic limitations mentioned. The dataset reflects article content as published by BBC News.

License

CC0

Who Can Use It

This dataset is well-suited for a broad range of users involved in data science, artificial intelligence, and machine learning projects. Examples include:
  • Data scientists and AI/ML practitioners: For training models in text classification, natural language processing, and text summarisation.
  • Researchers: Studying language patterns, readability, and content analysis in news media.
  • Developers: Building applications that require automated categorisation or summarisation of textual content.

Dataset Name Suggestions

  • BBC News Article Text & Readability Dataset
  • BBC Article Classification & Summarisation Data
  • News Content Analysis Dataset
  • Text Complexity and NLP Dataset

Attributes

Listing Stats

VIEWS

1

DOWNLOADS

0

LISTED

16/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free