2 million NYC crash reports analyzed with machine learning reveal the exact hour and conditions when New York's roads become most dangerous.
From 2012 to 2023, NYC recorded over 2 million motor vehicle collisions. The data reveals clear temporal patterns, contributing factors, and injury severity trends.
Four regression models trained on 452,283 records to predict total casualties (injured + killed). Random Forest achieved the best performance, though low R² suggests casualties are largely unpredictable from basic crash features.
K-Means clustering (K=4) on crash characteristics reveals distinct collision profiles. PCA explains 52% of variance in 2 dimensions.
A complete supervised and unsupervised learning pipeline built for Georgia Tech's regression analysis curriculum.
Downloaded 2,018,963 NYC crash records (2012–2023) from NYC Open Data via Kaggle. Removed records with missing critical fields (location, time, casualties). Standardized vehicle types and contributing factors. Final clean dataset: 452,283 records.
Created target variable TOTAL_CASUALTIES = injured + killed. Label encoded categorical features: vehicle type, contributing factor, borough. Extracted temporal features: hour, day of week, month. Selected 8 predictive features.
Trained Linear Regression, Ridge (CV α selection), and Lasso (CV α selection) on 361,826 training samples. Evaluated with RMSE and R². Random Forest provided feature importance rankings. Best model R² = 0.0296 (casualties largely unpredictable).
StandardScaler normalization on 6 clustering features (hour, day, borough, vehicle type, factor, casualties). Elbow curve analysis confirmed K=4 as optimal. PCA reduced to 2 dimensions for visualization (52% variance explained). Cluster profiles interpreted manually.
Analysis run on Mac (Apple Silicon) with Python 3.9 venv. Data downloaded via Kaggle API. Python data science stack in Jupyter-style workflow. Website built in pure HTML/CSS/JS.