News Headlines Dataset
Entertainment & Media Consumption
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset is a follow-up to the News Category Dataset, specifically designed to offer beginners an easy-to-use resource for natural language processing tasks. It comprises approximately 45,500 news headlines collected from HuffPost, spanning the years 2012 to 2018. The dataset has undergone a cleaning and filtering process, with its target feature balanced, making it more accessible and manageable than its original counterpart. It aims to assist those new to NLP in getting started with real-world data applications.
Columns
- category: Indicates the category to which a news article belongs. This serves as the target column.
- headline: Contains the main headline of the news article.
- short_description: Provides a brief summary or description of the news article.
- links: Lists the URL links for the respective news articles.
- keywords: Features the primary keywords extracted from the URLs present in the original dataset. Please note that this column may contain null values.
Distribution
The dataset contains 45,500 records, organised into 5 columns. It is typically provided as a data file, commonly in CSV format. Each of the target categories within the dataset contains 4,500 rows, ensuring a balanced distribution across different news topics.
Usage
This dataset is ideal for beginners embarking on natural language processing projects. It is well-suited for tasks such as text classification, news categorisation, and general machine learning applications involving text data. It can be used for training models to predict news categories or to analyse trends in news headlines over time.
Coverage
The dataset covers news articles published between the years 2012 and 2018. The news content is sourced from HuffPost and is global in its regional scope. It includes diverse news categories such as Business, Politics, Food & Drink, Travel, Parenting, Style & Beauty, Wellness, World news, Sports, and Entertainment.
License
CCO
Who Can Use It
This dataset is primarily intended for:
- Beginners in NLP: Provides a clean and balanced starting point for learning text-based machine learning.
- Students and Academics: Useful for educational purposes, assignments, and research in natural language processing and data science.
- Data Scientists and Developers: Can be used for prototyping and developing text classification models.
- Researchers: Those interested in analysing news trends and category distribution over specific time periods.
Dataset Name Suggestions
- News Headlines Dataset 2012-2018
- HuffPost News Articles for NLP
- Cleaned News Category Dataset
- Beginner NLP News Data
- Multi-Category News Headlines
Attributes
Original Data Source: News Category Dataset