Authentic vs. Fake News Corpus
News & Media Articles
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset, known as WELFake, is designed for fake news classification. It comprises 72,134 news articles, specifically 35,028 real news articles and 37,106 fake news articles. Its creation involved merging four widely-used news datasets—Kaggle, McIntire, Reuters, and BuzzFeed Political—to mitigate classifier overfitting and provide a larger text corpus for improved machine learning model training.
Columns
The dataset contains four key columns:
- Serial number (Index): A unique identifier for each entry, starting from 0. This column has 72,134 valid entries.
- Title: The heading or title of the news article. This column has approximately 62,348 unique values and 71,600 valid entries, with about 1% missing entries.
- Text: The main content of the news article. This column has approximately 62,700 unique values and 72,100 valid entries, with a minimal number of missing entries (less than 1%).
- Label: Indicates whether the news article is fake or real. A label of 0 signifies fake news, while 1 signifies real news. There are 35,028 entries labelled as fake and 37,106 entries labelled as real.
Distribution
The dataset is provided as a CSV file named
WELFake_Dataset.csv
, with a file size of approximately 245.09 MB. While the CSV file contains 78,098 data entries, 72,134 entries are actively accessed within the data frame. It maintains a structured format with the four specified columns.Usage
This dataset is ideally suited for machine learning tasks related to fake news classification. It can be effectively used for:
- Training and evaluating machine learning models for detecting fabricated news stories.
- Developing and testing natural language processing (NLP) algorithms for text analysis in news contexts.
- Research into journalistic integrity and information authenticity.
Coverage
The dataset is a compilation of articles from various sources, including Kaggle, McIntire, Reuters, and BuzzFeed Political news datasets. As such, it broadly covers news content. Specific geographic or time range details for the articles themselves are not explicitly provided within the dataset's description.
License
Attribution 4.0 International (CC BY 4.0) license
Who Can Use It
This dataset is particularly valuable for:
- Data Scientists: For building and refining fake news detection models.
- Machine Learning Engineers: To implement and deploy automated news classification systems.
- Researchers: Studying misinformation, NLP, and computational social science.
- Academics: As a resource for educational purposes in data science and AI courses.
Dataset Name Suggestions
- Fake News Article Classifier Dataset
- Authentic vs. Fake News Corpus
- News Authenticity Judgement Dataset
Attributes
Original Data Source: Authentic vs. Fake News Corpus