Turkish Media Archive Dataset
Entertainment & Media Consumption
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset presents a substantial collection of 2415 articles sourced from the official website of the Sabah newspaper, a prominent Turkish publication. The articles span a five-year period, from 2017 to 2021, and were penned by 204 unique authors. Formatted as a CSV file, this resource is highly suitable for various text analysis and natural language processing (NLP) applications focusing on the Turkish language. It also includes a dedicated stop-word file, specifically prepared for the Zemberek Turkish natural language processing library, containing 1797 words to aid in text pre-processing.
Columns
- date: This column specifies the date each article was published, presented in Turkish text format.
- author: Contains the name of the individual author responsible for writing the article.
- title: Represents the heading or title of the news article.
- link: Provides the direct web address (URL) to the original article as published on the Sabah newspaper's website.
- text: Encompasses the full body content of the news article.
Distribution
- Format: The dataset is provided as a CSV file.
- Size: It comprises 2415 individual articles (rows) and 5 distinct columns.
- Structure:
- There are 204 unique authors contributing to the dataset.
- The collection features 2405 unique article titles.
- Each article has a unique link, totalling 2415 unique links.
- The dataset contains 2413 unique article texts, indicating a high degree of distinct content.
- A separate
stop-word.txt
file, consisting of 1797 words, is included for use with the Zemberek Turkish natural language processing library.
Usage
This dataset is an ideal resource for a variety of applications, including:
- Text Classification: Training models to categorise Turkish news articles.
- Natural Language Processing (NLP): Developing, testing, and refining NLP models tailored for the Turkish language.
- Sentiment Analysis: Analysing the emotional tone and public opinion within Turkish news content.
- Author Attribution: Investigating and identifying authors based on their unique writing styles.
- Trend Analysis: Monitoring evolving themes and topics covered in Turkish media over time.
- Topic Modelling: Discovering and extracting key thematic structures from a large corpus of news articles.
Coverage
- Geographic: The articles originate from the Sabah newspaper, making the primary geographic focus Turkey.
- Time Range: The dataset covers a five-year period, with articles published between 2017 and 2021.
- Demographic: The content is sourced from articles written by 204 different authors.
License
CCO
Who Can Use It
This dataset is particularly beneficial for:
- Academic Researchers: Conducting studies in linguistics, media studies, political science, or social sciences related to Turkish content.
- Data Scientists: Building and deploying machine learning models for text processing and analysis.
- NLP Engineers and Developers: Enhancing or creating tools and applications for Turkish language understanding.
- Media Analysts: Gaining insights into news coverage, editorial focus, and public discourse in Turkey.
- Students: Utilising a real-world dataset for educational projects in data analysis, linguistics, and machine learning.
Dataset Name Suggestions
- Sabah Newspaper Articles: 2017-2021
- Turkish News Article Dataset (Sabah)
- Turkish Newspaper Content 2017-2021
- Sabah Daily Articles Corpus
- Turkish Media Archive
Attributes
Original Data Source: Turkish News Article