Toy Data for Prediction Models
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset is a fictional resource designed for exploratory data analysis (EDA) and for evaluating simple prediction models. It serves as a toy dataset, allowing users to familiarise themselves with data analysis techniques. All the data within is simulated, with distributions engineered to facilitate straightforward statistical analysis.
Columns
- Number: A simple sequential index assigned to each row in the dataset.
- City: Represents the geographical location of an individual. The locations included are Dallas, New York City, Los Angeles, Mountain View, Boston, Washington D.C., San Diego, and Austin. Notably, New York City accounts for 34% of entries, while Los Angeles represents 21%. There are 8 unique city values.
- Gender: Indicates the gender of a person, categorised as either Male or Female. Males constitute 56% of the entries, and Females make up 44%. There are 2 unique gender values.
- Age: Specifies the age of a person, with values ranging from 25 to 65 years. The mean age is approximately 45 years, with a standard deviation of about 11.6 years.
- Income: Details the annual income of an individual. Incomes range from approximately -674 to 177,175. The mean income is around 91.3 thousand, with a standard deviation of about 25 thousand.
- Illness: A binary indicator representing whether a person is ill (Yes or No). It is important to note that all 150,000 entries for this column are currently marked as mismatched, and there are no valid 'Yes' or 'No' counts within the provided sample.
Distribution
The dataset is structured with 150,000 rows and 6 columns. It is typically provided in a CSV file format and has a file size of 5.74 MB. The underlying data distributions have been specifically generated to be convenient for statistical analysis.
Usage
This dataset is ideally suited for:
- Conducting exploratory data analysis (EDA) to uncover patterns and insights.
- Developing and testing simple prediction models.
Coverage
- Geographic Scope: Includes data points from several major US cities: Dallas, New York City, Los Angeles, Mountain View, Boston, Washington D.C., San Diego, and Austin.
- Demographic Scope: Features demographic attributes such as Gender (Male, Female), Age (25-65 years), and Income.
- Time Range: Not applicable as the dataset is fictional and does not represent a specific time period.
License
CC0: Public Domain
Who Can Use It
This dataset is appropriate for:
- Data scientists and analysts for prototyping and testing algorithms.
- Students and educators for learning and teaching data analysis concepts.
- Researchers looking for a synthetic dataset to validate methodologies.
Dataset Name Suggestions
- Fictional Demographic Data
- Synthetic Population Insights
- Demographic Simulation Dataset
- Toy Data for Prediction Models
Attributes
Original Data Source: Toy Data for Prediction Models