Content Preference Recommendation Set
Social Media and Posts
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
this collection originates from a recent competitive challenge organised by Sony, aiming to foster the creation of advanced recommendation systems. It provides the essential information to develop a system capable of suggesting the top 10 movies, adapting to individual user locations and their preferences. The data offers insights into various content characteristics, facilitating detailed analysis for crafting tailored entertainment experiences.
Columns
- content_id: A unique identifier for each piece of content, featuring 48,645 distinct values. All entries are valid.
- content_type: Specifies the kind of content, with 'series' accounting for 93% and 'sports' for 7%. There are 4 unique content types in total. All entries are valid.
- language: Indicates the content's language, predominantly 'hindi' (49%) and 'english' (19%). The dataset includes content in 11 different languages. All entries are valid.
- genre: Describes the content category, with 'drama' (47%) and 'comedy' (20%) being the most frequent. There are 22 distinct genres present. All entries are valid.
- duration: Represents the length of the content, varying from 60,000 to 11,100,000 units, with an average duration of 3.53 million units. All entries are valid.
- release_date: The date when the content was made available, ranging from 11 October 1990 to 31 December 2020. All entries are valid.
- rating: The numerical rating given to the content, with values from 0 to 10 and an average rating of 5.04. All entries are valid.
- episode_count: The number of episodes for the content, from 0 to 60, with a mean of 16.2 episodes. All entries are valid.
- season_count: The number of seasons for the content, from 0 to 44, with a mean of 6.61 seasons. All entries are valid.
Distribution
The data is structured as a CSV file, named 'content.csv'. This primary data file is 2.98 MB in size and comprises 9 distinct columns. All 48.6 thousand records across these columns are valid, with no missing or mismatched entries, ensuring high data quality for analysis.
Usage
This data is well-suited for:
- Developing and training recommendation systems for movies and other media content.
- Building models to suggest the top 10 movies specifically tailored to user locations and preferences.
- Conducting exploratory data analysis to discover patterns and relationships within media content attributes.
- Machine learning research focused on content discovery and user personalisation.
Coverage
The data's geographic focus is identified by the 'India' tag, suggesting that a significant portion of the content or its target audience is associated with this region. The content release dates span from 11 October 1990 to 31 December 2020, offering a broad historical perspective. While specific demographic details are not explicitly provided, the aim to recommend based on user preferences implies applicability to diverse user segments interested in media content, further supported by content languages such as Hindi and English.
License
Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
Who Can Use It
- Data Scientists and Machine Learning Engineers: To design and implement advanced content recommendation algorithms.
- Researchers: Those engaged in data challenges, particularly within the entertainment and media sectors.
- Data Analysts: For exploring trends, patterns, and insights within movie and series content.
- Product Developers: Building applications that require personalised media suggestions.
Dataset Name Suggestions
- Sony RISE Movie Challenge Data
- Content Preference Recommendation Set
- Entertainment Media Attributes Data
- User-Centric Movie Recommender Data
- Global Content Insights Data
Attributes
Original Data Source: Content Preference Recommendation Set