NYC Taxi Fare Prediction Dataset
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset is designed for predicting taxi trip fares in New York City. It contains information related to millions of New York City yellow taxi cab trips, allowing data analysts to create, train, evaluate, and predict with machine learning models using BigQuery Machine Learning (BQML) with minimal coding. The ultimate goal is to help cab drivers easily identify fare trips in their respective locations and reach customers efficiently.
Columns
- trip_duration: Indicates how long the journey lasted, measured in Seconds.
- distance_traveled: Shows how far the taxi travelled, measured in Km.
- num_of_passengers: Records the number of passengers in the taxi.
- fare: Represents the base fare for the journey, in INR.
- tip: Details how much the driver received in tips, in INR.
- miscellaneous_fees: Accounts for any additional charges during the trip, such as tolls, convenience fees, or GST, in INR.
- total_fare: The grand total for the ride, in INR, which is the prediction target for models.
- surge_applied: A boolean indicator (Yes or No) if surge pricing was applied.
Distribution
The dataset is typically provided in CSV format, with data files including
submission.csv
, test.csv
, and train.csv
. The sizes of these files are approximately:submission.csv
: 359.46 kBtest.csv
: 3.05 MBtrain.csv
: 9.02 MB The total size for Version 2 is around 12.43 MB. While specific row counts are not detailed, the dataset contains millions of New York City yellow taxi cab trips and is structured for training and testing machine learning models. The dataset is expected to be updated daily.
Usage
This dataset is ideal for:
- Creating, training, evaluating, and making predictions with machine learning models.
- Forecasting numeric values such as taxi fares using Linear Regression (linear_reg) models within BQML.
- Binary or Multiclass Classification tasks (e.g., spam detection) using Logistic Regression (logistic_reg), though not the primary focus for fare prediction.
- Unsupervised learning for exploration, utilising k-Means Clustering (kmeans).
- Querying and exploring large public taxi cab datasets efficiently.
- Building forecasting models that can assist cab drivers (e.g., Uber, Rapido) in identifying trip fares and optimising routes for quicker customer reach.
- Visualising fare trip prices for better insights.
- Applying AutoML for automatically selecting important features and models.
Coverage
The dataset focuses on New York City and comprises trips from yellow taxi cabs. It includes millions of trip records. The data is part of a BigQuery Public Dataset.
License
CC0: Public Domain
Who Can Use It
- Data analysts looking to build and deploy machine learning models with minimal coding.
- Machine learning engineers and data scientists interested in regression, classification, or clustering tasks.
- Cab drivers or ride-sharing companies (e.g., Uber, Rapido) seeking insights into fare predictions and operational efficiency.
- Anyone interested in data analytics and exploring large-scale public datasets.
Dataset Name Suggestions
- NYC Taxi Fare Prediction Dataset
- BigQuery ML Yellow Cab Trips
- New York City Taxi Fare Analytics
- Taxi Ride Pricing Model Data
- BQML Taxi Fare Forecast
Attributes
Original Data Source: NYC Taxi Fare Prediction Dataset