For this mid-term project of class DS5230: Unsupervised Machine Learning, we formulated a problem statement and implemented various clustering methods on a selected dataset.

As head of a taxi company in Chicago, we want to optimize fleet distribution and taxi performance to boost efficiency and profitability.
➡️ Cluster on spatial and temporal trip data to identify high-demand hotspots and patterns. Insights from this analysis will enable strategic fleet positioning, reduce response times, enhance customer satisfaction.
➡️ Cluster taxis based on performance metrics to distinguish efficiency profiles. Insights from this analysis will enable interventions for low performers and adoption of best practices, enhance service quality and optimize costs.
- 1.7 million taxi trips in Chicago from September to November 2023. Each entry represents a trip and with 20+ attributes.
- Temporal features include start and end timestamps, rounded to the nearest 15 minutes. Spatial attributes include census tracts and community areas for pickup and dropoff locations with centroid coordinates. Spatial info is not available for locations outside Chicago.
- Additional attributes include taxi ID, trip duration, distance, fare, non-cash tips, tolls, extra charges, total cost, payment method, and associated taxi company.
- We queried this dataset from Google Big Query. The original dataset was provided by the City of Chicago.
├── Analysis 1
│ ├── trip_data_processing.ipynb
│ ├── trip_eda.ipynb
│ ├── trip_cluster_NP.ipynb
├── Analysis 2
│ ├── taxi_data_processing.ipynb
│ ├── taxi_eda.ipynb
│ ├── taxi_cluster.ipynb
├── Presentation
│ ├── Taxi_Management_Presentation.pdf
- Selected features: Pickup & Dropoff Coordinates, Period Start (morning rush, midday, evening etc.), Is Weekend, Trip Total (payment)
- One hot encode categorical feature
- GMM and K-means preferred over HDBSCAN and hierarchical for better computational efficiency
- Final model: GMM with k=5 selected using BIC score and optimized parameters and initializations

- Exclusive downtown pickups and dropoffs; uniformly lower fare => Short trips
- Peak activity during business hours
- Insights: Offer subscription services and loyalty programs; Optimize taxi availability

- City-wide coverage, urban routes and airport commutes
- Off-peak hours: early morning & late night
- Wide fare range, high median => diverse trip lengths
- Insights: Collaborate with businesses such as hotels, airlines, and night venues; potential for dynamic pricing

- Common routes to and from downtown, short trips (low fare); Midday and evening hours
- Insights: Likely non-commute travel (leisure or errands); Targeted ads for urban activities (shopping/ leisure outings)

- Mixed distance trips
- Common routes between downtown & airports (O’Hare, Garfield Ridge); Afternoon and rush hours
- Insights: Explore Midway airport as an underserved market; offer competitive pricing for airport trips

- Exclusive Ohare pickups to mainly downtown
- High demand in afternoon and evening rush hour
- Insights: Ensure availability during peak times; Offer competitive flat rates to compete with ride-sharing services
- Aggregated trip data into profiles for 2,864 taxis
- Derived taxi attributes:
- For clustering model: (median) Idle Time, Daily Use Rate, Trips per Day, Average Speed, Daily Revenue, Tips Proportion.
- For later analysis: distribution of trips by pickup area, time of day, and payment type
- Implemented K-Means, GMM, Agglomerative Hierarchical, and HDBSCAN, with varying parameter and initialization tests.
- Baseline model:
- RobustScaler() for feature scaling due to outliers; GMM with k=4 identifies the most distinct cluster.
- Adjustment: removed outliers to address imbalanced clusters, and excluded Tips Proportion feature due to no variance.
- Final model:
- StandardScaler() as features become more normal after adjustments
- GMM with k=4 selected based on BIC, Silhouette scores, and testing different values of k
We compared clusters against 5 performance metrics and analyzed additional features to derive the final conclusions for each cluster. For detailed results, please refer to the Presentation.



- Characteristics:
- Top earner, high efficiency, steady demand
- Broad coverage beyond typical hotspots
- High usage of "Other" payment methods
- Suggestions:
- Market wide coverage to attract diverse riders
- Explore 'other' payments for new revenue channels
- Characteristics:
- Short trips but slower speeds, steady service demand in downtown
- Active during rush/ business hours
- Suggestions:
- Better route optimization to avoid congested areas
- Offer perks for trips during less busy hours
- Expand coverage to other high-demand city areas
- Characteristics:
- Primarily airport trips with peak evening activity, explains longer routes and fast speeds
- Low trip count, high idle time suggest downtime issues
- Preference for credit card payments
- Suggestions:
- Adjust taxi scheduling to match flight arrival times to target peak airport demand
- Add courier services during downtime
- Partner with hotels/ airlines for airport pick-ups
- Characteristics:
- Balance airport and city trips but fail to optimize either => high idle time and low earnings
- Effective service but poor demand capture => infrequent trips
- Suggestions:
- Demand analysis: understand reasons for low trip counts
- Adopt high-performance practices to reduce idle times