Back to datasets
Dataset assetOpen Source CommunityData AnalysisBike Sharing

new_york_citibike

This public dataset contains two BigQuery tables; the table used is `citybike_trips`, containing over 58 million records. The `tripduration` field indicates the duration of each bike rental (in seconds); other fields serve as potential features.

Source
github
Created
Jun 12, 2024
Updated
Jun 27, 2024
Signals
340 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Information

  • Dataset Name: new_york_citibike
  • Data Table: citybike_trips
  • Data Volume: Over 58 million records
  • Label: tripduration (ride duration, in seconds)
  • Features: Other fields

Data Processing

  • Preprocessing: Cleaning, handling missing values, converting datetime variables, feature scaling
  • Data Splitting: Divide the dataset into three parts for model selection, evaluation, and testing, using month as the split criterion

Model Selection and Evaluation

  • Model Choice: Linear regression model
  • Evaluation Metric: Mean Squared Error (MSE)
  • Model Optimization: Iterative adjustments to improve performance

Model Evaluation Results

  • Model 1: trip_duration_by_stations, MSE = 111.2176
  • Model 2: trip_duration_by_stations_and_day, MSE = 98.0522
  • Model 3: trip_duration_by_stations_day_age, MSE = 110.8004

Conclusion

  • Prediction Outcome: A total of 1,548,371 predictions were made; the predicted values differ from the actual values by less than 15 minutes
  • Accuracy: In 89.6% of cases, the model predicts ride duration within 15 minutes of the actual value, and the average absolute error for predicting ride cost is 6.8 minutes

Recommended Strategy

  • Pricing Model Strategy: Adopt quarterly ticket pricing and periodic payment mode
  • User Story: When a user selects a bike at a start station on a specific date and specifies a destination, the model can predict ride duration and cost
  • Model Performance: The model can predict ride duration and cost with accuracies of 89.6% and an average absolute error of 6.8 minutes respectively
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio