Opendatabay APP

Enhanced Used Car Valuation Data

Stock & Market Data

Tags and Keywords

Cars

Used

Prices

Regression

Vehicles

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Enhanced Used Car Valuation Data Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset is an extension designed to enhance the existing data used in the "Regression of Used Car Prices" Kaggle competition. It provides additional data points, generated using OpenAI's GPT-4o-mini, which offer more nuanced information to potentially improve model performance. The synthetically generated values are surprisingly accurate and realistic, making this dataset valuable for in-depth analysis of used car pricing and for enriching regression models. It aims to provide greater variability and depth to existing car pricing data.

Columns

  • model_year: The year the vehicle model was manufactured. This field is directly mapped to the model year in the competition dataset and includes all combinations of years present, ranging from 1974 to 2024. The age of a car is a key factor in its resale value.
  • brand: The manufacturer of the vehicle. This field matches the brand in the competition dataset. Different brands exhibit varying resale values influenced by their reliability, reputation, and market demand. Ford is the most common brand at 9%, with Chevrolet at 7%, and 57 unique brands in total.
  • model: The specific model of the vehicle produced by the brand. This field is directly mapped to the model in the competition dataset. Even within the same brand, different models can show significant price variations. There are 1898 unique models.
  • type: The classification of the vehicle, such as SUV, Coupe, Sedan, Convertible, or Van. Vehicle type affects price due to its utility, market demand, and intended use. SUV is the most common type at 35%, followed by Sedan at 29%, with 9 unique types in total.
  • miles_per_gallon: The fuel efficiency of the vehicle, measured in miles per gallon (MPG). Higher fuel efficiency is generally linked to higher value due to lower operational costs. Values range from -1 to 234 MPG, with a mean of 21.8.
  • premium_version: A binary field indicating whether the car is a premium version (1) or not (0). Premium versions often include luxury features or higher-end specifications, which typically increase the vehicle's value. There are slightly more premium versions (15,015 records) than non-premium (13,128 records).
  • msrp: The Manufacturer's Suggested Retail Price (MSRP) when the car was new. This is a strong indicator of the car's original value and influences its depreciation rate and current market value. Values range from 0 to 2.5 million, with a mean of 57.4k.
  • collection_car: A binary field indicating whether the car is considered a "collector's item" (1) or not (0). Collector cars tend to retain or increase their value over time due to rarity or historical significance. Most cars are not considered collector's items (24,279 records for 0), while 3,864 are (for 1).

Distribution

The dataset is provided as a CSV file, named extended_data.csv, with a size of 1.48 MB. It contains 8 columns and approximately 28,100 valid records. There are a small number of missing values for miles_per_gallon and msrp (17 records each), but otherwise, the data is complete.

Usage

This dataset is ideal for merging or joining with existing competition data for used car prices based on common features like model_year, brand, and model. It can be used to enrich existing datasets, introduce more variability and depth, and ultimately improve the performance of regression models aimed at predicting used car prices. It is particularly useful for feature engineering and exploring more nuanced insights into car valuation.

Coverage

The dataset covers model_year from 1974 to 2024. It focuses on vehicle attributes and market indicators for used cars. Specific geographical or demographic scopes are not detailed within the provided information, suggesting a general market applicability for used car price regression.

License

CC BY-SA 4.0

Who Can Use It

This dataset is intended for data scientists, machine learning engineers, and researchers involved in predictive modelling, especially those working on regression tasks. It is particularly useful for participants in Kaggle competitions focused on used car price prediction, as well as automotive market analysts and economists looking to understand factors influencing car values.

Dataset Name Suggestions

  • Used Car Price Regression Extension
  • Enhanced Used Car Valuation Data
  • Synthetic Car Market Data
  • Automotive Price Prediction Dataset
  • GPT-Generated Used Car Features

Attributes

Original Data Source: Enhanced Used Car Valuation Data

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

31/08/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format