Towards Data Science Publishing Archive
Synthetic Data Generation
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
Captures detailed metrics for articles published on the Towards Data Science (TDS) platform, a major platform for thousands of individuals to share ideas and advance the understanding of data science. The collection includes over 45,000 articles and provides insights into publishing trends and reader engagement from late 2010 to mid-2021. This resource is essential for anyone wishing to analyse the growth and popularity of data science content over the last decade.
Columns
The dataset contains 8 distinct features for each article:
- publish_date: The date the article went live. The data spans from 21 November 2010 to 31 July 2021.
- title: The title of the article.
- author: The individual credited with writing the article.
- url: The unique web link to access the article.
- claps: The count of reader support, functioning as a measure similar to 'likes.' The maximum observed value is 52,000.
- responses: The total number of comments or responses generated by the article. The maximum observed value is 298.
- reading_time: An estimate of the time required to read the article, calculated based on the assumption of an average adult reading speed (approximately 265 words per minute).
- paid: A binary field indicating participation in the Medium Partner Program (1 signifies paid/member content, 0 signifies free content).
Distribution
The dataset is provided as a CSV file (
tds_data.csv), currently amounting to 8.42 MB. It consists of 48,060 unique article records, each described by 8 features. The data is extracted from the platform archive and is scheduled for updates on a monthly basis.Usage
This data product is ideally suited for several analytical applications:
- Content Performance Modelling: Building models to predict article success (high claps or responses) based on structural features like title and reading time.
- Trend Analysis: Tracking the evolution of popular data science topics and sub-disciplines over the covered time frame.
- Platform Research: Studying the impact of monetisation strategies, such as the Medium Partner Program, on content production and engagement.
- Natural Language Processing (NLP): Utilising article titles and features for topic clustering and classification tasks.
Coverage
The temporal scope of the articles spans from November 2010 through to July 2021. The content reflects publications on Towards Data Science, a platform associated with a Canadian-registered corporation, focusing specifically on articles related to data science, computer science, and related business topics.
License
CC0: Public Domain
Who Can Use It
- Data Scientists: For developing predictive algorithms related to content virality and reader behaviour.
- Journalism Researchers: To study metrics and publishing patterns on high-volume online platforms.
- Content Managers: To benchmark article performance and inform future editorial strategies within the data science domain.
Dataset Name Suggestions
- TDS Article Engagement Metrics 2010–2021
- Towards Data Science Publishing Archive
- Data Science Content Performance Index
Attributes
Original Data Source: Towards Data Science Publishing Archive
Loading...
