Al Jazeera News Articles Dataset
Entertainment & Media Consumption
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset features news articles gathered from Al Jazeera through a web scraping process. It is designed for various analytical and natural language processing applications. The collection primarily covers news content from categories such as Science & Technology, Economics, and Sports. The scraping code was developed in late 2022, and users may need to update it to accommodate any changes in the Al Jazeera website's structure.
Columns
- category: This column contains categorical data, represented as strings, indicating the news topic or section.
- title: This column holds the title of each news article, also as string data.
- text: This column contains the full textual content of the article. Importantly, newline characters within the text have been specifically replaced with
\\n
to ensure correct preservation and avoid misinterpretation when the data is saved, particularly in CSV format.
Distribution
The dataset is typically provided in a CSV file format. While precise total row counts are not available, the dataset includes one unique category, 1409 unique article titles, and 1413 unique article content entries, suggesting a substantial collection of distinct articles.
Usage
This dataset is ideal for a wide range of natural language processing (NLP) tasks, including text classification, sentiment analysis, topic modelling, and information extraction. It can be particularly valuable for training machine learning models that require real-world news content for analysis.
Coverage
The data consists of news articles scraped from Al Jazeera, a global news provider, indicating a global region of coverage. The articles were collected using code developed in November and December 2022. While initially focused on Science & Technology, Economics, and Sports, the provided scraping code can be adapted to collect content from additional news categories.
License
CC-BY-NC
Who Can Use It
This dataset is particularly useful for researchers, data scientists, and developers involved in natural language processing, text mining, or media content analysis. Students and academics working on projects related to news data, classification, or large language models can also benefit from this resource.
Dataset Name Suggestions
- Al Jazeera News Articles Dataset
- Global News Text Corpus
- Al Jazeera Web Scraped News
- News Article NLP Dataset
Attributes
Original Data Source: Aljazeera News Dataset