Japanese & English News Archive
Entertainment & Media Consumption
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset features news articles collected from Japanese newspaper websites, combining content from both current sources and the Old Newspapers dataset. It includes articles in both Japanese and English, providing a valuable resource for analysing media consumption and societal trends over two decades. The dataset is particularly useful for those interested in entertainment, media, and political discourse within Japan and related international contexts. It serves as a foundation for various analytical and machine learning applications.
Columns
The dataset includes the following columns for each news article:
- source: The specific newspaper site from which the article was retrieved.
- date: The publication date of the article.
- title: The headline or title of the news article.
- author: The name of the article's author.
- text: The full body text of the news article.
Distribution
The dataset is structured into two main subcorpuses based on language. The Japanese articles subcorpus contains 312,954 texts sourced from 21 different newspaper sites. The English articles subcorpus comprises 36,766 texts from 2 newspaper sites. Specific file formats for the dataset are not detailed in the available information.
Usage
This dataset is ideal for a variety of applications, including:
- Natural Language Processing (NLP) research and model training, especially for Japanese and English text.
- Trend analysis in news reporting and media coverage over time.
- Political science research to study discourse and policy changes reflected in news.
- Sentiment analysis of public opinion as expressed in news articles.
- Historical research on media representation and societal events in Japan.
Coverage
The dataset's coverage spans:
- Geographic Scope: Primarily Japan, with articles from Japanese newspaper sites. It also includes English articles which may offer broader, international perspectives relating to Japan.
- Time Range: The Japanese articles cover the period from July 2005 to October 2021. The English articles span from January 2001 to December 2021.
License
CC0
Who Can Use It
This dataset is intended for a diverse range of users, including:
- Researchers in linguistics, media studies, political science, and history.
- Data scientists and machine learning engineers developing NLP models, particularly for sentiment analysis, topic modelling, and text generation.
- Journalists and media analysts seeking to understand shifts in news reporting and public discourse.
- Educational institutions for teaching and research purposes in data science and humanities.
Dataset Name Suggestions
- Japanese & English News Archive
- Japan News Article Corpus 2001-2021
- Historical Japanese Media Data
- Bilingual News Dataset (Japan)
- East Asian News Articles
Attributes
Original Data Source: Japanese Newspapers 2001-2021