Dark Mode

Home

Data Categories

Web & Social Media Data

Authentic vs. Fake News Corpus

FREE DATASET LIBRARY

Verified Data Provider

£0

Authentic vs. Fake News Corpus

News & Media Articles

Tags and Keywords

News

Fake

Classification

Articles

Data

Trusted By

Authentic vs. Fake News Corpus Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset, known as WELFake, is designed for fake news classification. It comprises 72,134 news articles, specifically 35,028 real news articles and 37,106 fake news articles. Its creation involved merging four widely-used news datasets—Kaggle, McIntire, Reuters, and BuzzFeed Political—to mitigate classifier overfitting and provide a larger text corpus for improved machine learning model training.

Columns

The dataset contains four key columns:

Serial number (Index): A unique identifier for each entry, starting from 0. This column has 72,134 valid entries.
Title: The heading or title of the news article. This column has approximately 62,348 unique values and 71,600 valid entries, with about 1% missing entries.
Text: The main content of the news article. This column has approximately 62,700 unique values and 72,100 valid entries, with a minimal number of missing entries (less than 1%).
Label: Indicates whether the news article is fake or real. A label of 0 signifies fake news, while 1 signifies real news. There are 35,028 entries labelled as fake and 37,106 entries labelled as real.

Distribution

The dataset is provided as a CSV file named WELFake_Dataset.csv, with a file size of approximately 245.09 MB. While the CSV file contains 78,098 data entries, 72,134 entries are actively accessed within the data frame. It maintains a structured format with the four specified columns.

Usage

This dataset is ideally suited for machine learning tasks related to fake news classification. It can be effectively used for:

Training and evaluating machine learning models for detecting fabricated news stories.
Developing and testing natural language processing (NLP) algorithms for text analysis in news contexts.
Research into journalistic integrity and information authenticity.

Coverage

The dataset is a compilation of articles from various sources, including Kaggle, McIntire, Reuters, and BuzzFeed Political news datasets. As such, it broadly covers news content. Specific geographic or time range details for the articles themselves are not explicitly provided within the dataset's description.

License

Attribution 4.0 International (CC BY 4.0) license

Who Can Use It

This dataset is particularly valuable for:

Data Scientists: For building and refining fake news detection models.
Machine Learning Engineers: To implement and deploy automated news classification systems.
Researchers: Studying misinformation, NLP, and computational social science.
Academics: As a resource for educational purposes in data science and AI courses.

Dataset Name Suggestions

Fake News Article Classifier Dataset
Authentic vs. Fake News Corpus
News Authenticity Judgement Dataset

Attributes

Original Data Source: Authentic vs. Fake News Corpus

Listing Stats

VIEWS

DOWNLOADS

LISTED

14/07/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format

Recommended Datasets

Loading recommendations...