NLP Vietnamese News (2022) Dataset
Social Media and Networking
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset provides online articles extracted from the 25 most popular news sites in Vietnam, specifically collected during July 2022. It is particularly well-suited for Natural Language Processing tasks in the Vietnamese language. Online news platforms have become an unavoidable aspect of contemporary society due to their widespread accessibility and predominantly free content. Their influence on community thought and action is increasingly a concern for various groups, including legislators, content creators, and marketers. Beyond their effects, the content published in the news serves as a valuable reflection of public will, attention, and even cultural standards. In Vietnam, despite journalists facing considerable criticism in recent years, news outlets continue to garner substantial traffic compared to other information dissemination methods.
Columns
The dataset includes the following columns, offering detailed information about each news article:
Unnamed: 0.1
: An initial index column, possibly for internal tracking.Unnamed: 0
: Another index column, similar to the first.id
: A unique identification number for each article.author
: The credited writer or source of the article's content.content
: The full textual body of the news article.picture_count
: The numerical count of images embedded within the article.processed
: A flag indicating whether the article's content has undergone any processing steps.source
: The specific news website or media organisation from which the article was obtained.title
: The headline or main title of the news piece.topic
: The subject area or category to which the news article belongs, such as 'Culture', 'Society', 'Sports', 'Business', or 'World'.url
: The direct web address pointing to the original online article.crawled_at
: The exact timestamp when the article was collected.
Distribution
The dataset's initial format was JSON, which has been converted to CSV for streamlined data processing. The data is typically available in CSV format, with a sample file available on the platform. The exact number of rows or records is not specified, but the dataset is derived from articles gathered from 25 prominent Vietnamese news websites.
Usage
This dataset is highly beneficial for a range of applications, including:
- Natural Language Processing (NLP) development: Particularly for models focused on the Vietnamese language.
- Exploratory Data Analysis: To uncover trends and patterns within online news content.
- Text Pre-processing exercises: Offering raw text suitable for cleaning, tokenisation, and other preparatory steps.
- Societal analysis: To understand prevailing public attention, cultural benchmarks, and interests reflected in news reporting.
- Media studies: For examining the influence of news outlets on communities, relevant for policy-makers, content creators, and market strategists.
Coverage
- Geographic Scope: The content primarily covers news and topics relevant to Vietnam, originating from Vietnamese news sites.
- Time Range: The articles were collected specifically during July 2022.
- Demographic Scope: While not explicitly demographic, the dataset reflects the general interests and concerns of the Vietnamese population as covered by their popular news media.
License
This is a ODbl.
Who Can Use It
This dataset is ideal for:
- Data scientists and NLP professionals: For building and refining machine learning models that process Vietnamese text.
- Academic researchers: Conducting studies on media impact, public sentiment, or cultural shifts in Vietnam.
- Content strategists and marketers: Seeking insights into popular topics and consumer interests for content creation or campaign planning.
- Government officials and policy analysts: To monitor societal discourse and understand the broader impact of online information.
Dataset Name Suggestions
- Vietnamese Online News 2022 (July)
- Vietnam Digital Media Archive
- Vietnamese Language News Dataset
- Trending News in Vietnam
- July 2022 Vietnam News Corpus
Attributes
Original Data Source: Vietnamese Online News .csv dataset