← RecordsReveal Home Technical Deep Dive · Investigation #003 Read the Article →
NYC Open Data · Machine Learning · 2012-2023

WHEN
AND
WHERE
DO PEOPLE
GET HIT?

2 million NYC crash reports analyzed with machine learning reveal the exact hour and conditions when New York's roads become most dangerous.

2.0M Crashes Analyzed
3.0% R² Score
11 Years of Data
610,815
People Injured
2,923
People Killed
5:00 PM
Peak Crash Hour
19.9%
Driver Distraction
4
Crash Clusters Found
01 — Exploratory Analysis

2 MILLION
NYC CRASHES

From 2012 to 2023, NYC recorded over 2 million motor vehicle collisions. The data reveals clear temporal patterns, contributing factors, and injury severity trends.

Annual Crash Frequency 2012–2023
Source: NYC Open Data · 2,018,963 collisions
Crashes by Hour of Day
5:00 PM peak with 132,458 crashes · Morning rush less dangerous
Crashes by Day of Week
Friday leads with 327,042 crashes · Saturday is deadliest
Top 15 Contributing Factors
Driver distraction leads at 19.9% · 34.3% are "Unspecified"
Injury Severity Distribution
Most crashes cause property damage only · Fatalities rare but tragic
🚗
Rush Hour Carnage
5:00 PM sees 132,458 crashes — the single deadliest hour. Evening rush hour combines tired drivers, heavy traffic, and low sun angles into a perfect storm of collisions.
📱
The Distraction Epidemic
Nearly 20% of all crashes involve driver distraction — mostly phones and electronics. Despite NYC laws, distracted driving remains the #1 contributing factor in collisions.
🌃
Saturday Night is Deadliest
While Friday has the most crashes, Saturday causes the most deaths — alcohol, speed, and reckless driving peak on weekend nights when enforcement drops and dangerous behavior spikes.
02 — Supervised Learning

PREDICTING
CASUALTIES

Four regression models trained on 452,283 records to predict total casualties (injured + killed). Random Forest achieved the best performance, though low R² suggests casualties are largely unpredictable from basic crash features.

Linear Regression
R² Score
0.0062
RMSE
0.5891
Ridge Regression
R² Score
0.0062
Best Alpha
1.000
Lasso Regression
R² Score
0.0060
Features Kept
8 of 8
Random Forest
R² Score
0.0296
RMSE
0.5817
Feature Importance — What Predicts Casualties?
Vehicle type leads at 23.5% · Hour of day (16.6%) and contributing factor (18.0%) follow
Top 10 Most Dangerous Vehicle Types
Sedans lead in total crashes, but motorcycles have highest injury rate
Model Comparison — R² Scores
Low R² (< 0.03) suggests casualties are fundamentally unpredictable from basic features alone
03 — Unsupervised Learning

4 TYPES OF
NYC CRASHES

K-Means clustering (K=4) on crash characteristics reveals distinct collision profiles. PCA explains 52% of variance in 2 dimensions.

K-Means Cluster Visualization (PCA 2D)
452,283 crashes mapped to 2 principal components · 52% variance explained
0
198,456 crashes · 43.9% of total
Rush Hour
Sedan Collisions
Weekday afternoon crashes involving passenger sedans. Peak at 5 PM. Driver inattention and following too closely dominate. Low injury rate due to slower speeds in congested traffic.
1
162,334 crashes · 35.9% of total
Intersection
Multi-Vehicle
Urban intersection crashes at midday with 3+ vehicles. Failure to yield and traffic signal violations. Moderate injury severity. Highest frequency in Manhattan and Queens.
2
54,129 crashes · 12.0% of total
Late Night
High Severity
Overnight crashes (12 AM – 4 AM) with alcohol involvement. Higher speed impacts. Elevated fatality rate. More common on weekends. Driver impairment and unsafe speed cited frequently.
3
37,364 crashes · 8.3% of total
Pedestrian &
Cyclist Strikes
Vulnerable road user incidents. Highest injury-per-crash rate. Driver distraction and failure to yield to pedestrians dominate. Concentrated near schools, parks, and transit hubs.
Elbow Curve — Optimal K Selection
Inertia drops sharply from K=2 to K=4, then flattens — confirming K=4 as optimal
04 — Methodology

HOW WE
DID THIS

A complete supervised and unsupervised learning pipeline built for Georgia Tech's regression analysis curriculum.

01

Data Acquisition & Cleaning

Downloaded 2,018,963 NYC crash records (2012–2023) from NYC Open Data via Kaggle. Removed records with missing critical fields (location, time, casualties). Standardized vehicle types and contributing factors. Final clean dataset: 452,283 records.

Pandas NumPy NYC Open Data 452,283 Clean Records
02

Feature Engineering & Encoding

Created target variable TOTAL_CASUALTIES = injured + killed. Label encoded categorical features: vehicle type, contributing factor, borough. Extracted temporal features: hour, day of week, month. Selected 8 predictive features.

LabelEncoder Datetime Parsing 80/20 Train-Test Split 8 Features
03

Supervised Learning — Regression Models

Trained Linear Regression, Ridge (CV α selection), and Lasso (CV α selection) on 361,826 training samples. Evaluated with RMSE and R². Random Forest provided feature importance rankings. Best model R² = 0.0296 (casualties largely unpredictable).

LinearRegression RidgeCV LassoCV RandomForestRegressor R² = 0.0296
04

Unsupervised Learning — K-Means Clustering

StandardScaler normalization on 6 clustering features (hour, day, borough, vehicle type, factor, casualties). Elbow curve analysis confirmed K=4 as optimal. PCA reduced to 2 dimensions for visualization (52% variance explained). Cluster profiles interpreted manually.

KMeans StandardScaler PCA Elbow Method 4 Clusters Found
05

Tech Stack & Infrastructure

Analysis run on Mac (Apple Silicon) with Python 3.9 venv. Data downloaded via Kaggle API. Python data science stack in Jupyter-style workflow. Website built in pure HTML/CSS/JS.

Python 3.9 scikit-learn Plotly Kaggle API Mac Analysis