Opendatabay APP

Turkish Media Archive Dataset

Entertainment & Media Consumption

Tags and Keywords

News

Text

Classification

Nlp

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Turkish Media Archive Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset presents a substantial collection of 2415 articles sourced from the official website of the Sabah newspaper, a prominent Turkish publication. The articles span a five-year period, from 2017 to 2021, and were penned by 204 unique authors. Formatted as a CSV file, this resource is highly suitable for various text analysis and natural language processing (NLP) applications focusing on the Turkish language. It also includes a dedicated stop-word file, specifically prepared for the Zemberek Turkish natural language processing library, containing 1797 words to aid in text pre-processing.

Columns

  • date: This column specifies the date each article was published, presented in Turkish text format.
  • author: Contains the name of the individual author responsible for writing the article.
  • title: Represents the heading or title of the news article.
  • link: Provides the direct web address (URL) to the original article as published on the Sabah newspaper's website.
  • text: Encompasses the full body content of the news article.

Distribution

  • Format: The dataset is provided as a CSV file.
  • Size: It comprises 2415 individual articles (rows) and 5 distinct columns.
  • Structure:
    • There are 204 unique authors contributing to the dataset.
    • The collection features 2405 unique article titles.
    • Each article has a unique link, totalling 2415 unique links.
    • The dataset contains 2413 unique article texts, indicating a high degree of distinct content.
    • A separate stop-word.txt file, consisting of 1797 words, is included for use with the Zemberek Turkish natural language processing library.

Usage

This dataset is an ideal resource for a variety of applications, including:
  • Text Classification: Training models to categorise Turkish news articles.
  • Natural Language Processing (NLP): Developing, testing, and refining NLP models tailored for the Turkish language.
  • Sentiment Analysis: Analysing the emotional tone and public opinion within Turkish news content.
  • Author Attribution: Investigating and identifying authors based on their unique writing styles.
  • Trend Analysis: Monitoring evolving themes and topics covered in Turkish media over time.
  • Topic Modelling: Discovering and extracting key thematic structures from a large corpus of news articles.

Coverage

  • Geographic: The articles originate from the Sabah newspaper, making the primary geographic focus Turkey.
  • Time Range: The dataset covers a five-year period, with articles published between 2017 and 2021.
  • Demographic: The content is sourced from articles written by 204 different authors.

License

CCO

Who Can Use It

This dataset is particularly beneficial for:
  • Academic Researchers: Conducting studies in linguistics, media studies, political science, or social sciences related to Turkish content.
  • Data Scientists: Building and deploying machine learning models for text processing and analysis.
  • NLP Engineers and Developers: Enhancing or creating tools and applications for Turkish language understanding.
  • Media Analysts: Gaining insights into news coverage, editorial focus, and public discourse in Turkey.
  • Students: Utilising a real-world dataset for educational projects in data analysis, linguistics, and machine learning.

Dataset Name Suggestions

  • Sabah Newspaper Articles: 2017-2021
  • Turkish News Article Dataset (Sabah)
  • Turkish Newspaper Content 2017-2021
  • Sabah Daily Articles Corpus
  • Turkish Media Archive

Attributes

Original Data Source: Turkish News Article

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

05/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free