NYC Traffic Crashes — Technical Analysis

01 — Exploratory Analysis

2 MILLION
NYC CRASHES

From 2012 to 2023, NYC recorded over 2 million motor vehicle collisions. The data reveals clear temporal patterns, contributing factors, and injury severity trends.

Annual Crash Frequency 2012–2023

Source: NYC Open Data · 2,018,963 collisions

Crashes by Hour of Day

5:00 PM peak with 132,458 crashes · Morning rush less dangerous

Crashes by Day of Week

Friday leads with 327,042 crashes · Saturday is deadliest

Top 15 Contributing Factors

Driver distraction leads at 19.9% · 34.3% are "Unspecified"

Injury Severity Distribution

Most crashes cause property damage only · Fatalities rare but tragic

🚗

Rush Hour Carnage

5:00 PM sees 132,458 crashes — the single deadliest hour. Evening rush hour combines tired drivers, heavy traffic, and low sun angles into a perfect storm of collisions.

📱

The Distraction Epidemic

Nearly 20% of all crashes involve driver distraction — mostly phones and electronics. Despite NYC laws, distracted driving remains the #1 contributing factor in collisions.

🌃

Saturday Night is Deadliest

While Friday has the most crashes, Saturday causes the most deaths — alcohol, speed, and reckless driving peak on weekend nights when enforcement drops and dangerous behavior spikes.

02 — Supervised Learning

PREDICTING
CASUALTIES

Four regression models trained on 452,283 records to predict total casualties (injured + killed). Random Forest achieved the best performance, though low R² suggests casualties are largely unpredictable from basic crash features.

Linear Regression

R² Score

0.0062

RMSE

0.5891

Ridge Regression

R² Score

0.0062

Best Alpha

1.000

Lasso Regression

R² Score

0.0060

Features Kept

8 of 8

Random Forest

R² Score

0.0296

RMSE

0.5817

Feature Importance — What Predicts Casualties?

Vehicle type leads at 23.5% · Hour of day (16.6%) and contributing factor (18.0%) follow

Top 10 Most Dangerous Vehicle Types

Sedans lead in total crashes, but motorcycles have highest injury rate

Model Comparison — R² Scores

Low R² (< 0.03) suggests casualties are fundamentally unpredictable from basic features alone

03 — Unsupervised Learning

4 TYPES OF
NYC CRASHES

K-Means clustering (K=4) on crash characteristics reveals distinct collision profiles. PCA explains 52% of variance in 2 dimensions.

K-Means Cluster Visualization (PCA 2D)

452,283 crashes mapped to 2 principal components · 52% variance explained

198,456 crashes · 43.9% of total

Rush Hour
Sedan Collisions

Weekday afternoon crashes involving passenger sedans. Peak at 5 PM. Driver inattention and following too closely dominate. Low injury rate due to slower speeds in congested traffic.

162,334 crashes · 35.9% of total

Intersection
Multi-Vehicle

Urban intersection crashes at midday with 3+ vehicles. Failure to yield and traffic signal violations. Moderate injury severity. Highest frequency in Manhattan and Queens.

54,129 crashes · 12.0% of total

Late Night
High Severity

Overnight crashes (12 AM – 4 AM) with alcohol involvement. Higher speed impacts. Elevated fatality rate. More common on weekends. Driver impairment and unsafe speed cited frequently.

37,364 crashes · 8.3% of total

Pedestrian &
Cyclist Strikes

Vulnerable road user incidents. Highest injury-per-crash rate. Driver distraction and failure to yield to pedestrians dominate. Concentrated near schools, parks, and transit hubs.

Elbow Curve — Optimal K Selection

Inertia drops sharply from K=2 to K=4, then flattens — confirming K=4 as optimal

04 — Methodology

HOW WE
DID THIS

A complete supervised and unsupervised learning pipeline built for Georgia Tech's regression analysis curriculum.

Data Acquisition & Cleaning

Downloaded 2,018,963 NYC crash records (2012–2023) from NYC Open Data via Kaggle. Removed records with missing critical fields (location, time, casualties). Standardized vehicle types and contributing factors. Final clean dataset: 452,283 records.

Pandas NumPy NYC Open Data 452,283 Clean Records

Feature Engineering & Encoding

Created target variable TOTAL_CASUALTIES = injured + killed. Label encoded categorical features: vehicle type, contributing factor, borough. Extracted temporal features: hour, day of week, month. Selected 8 predictive features.

LabelEncoder Datetime Parsing 80/20 Train-Test Split 8 Features

Supervised Learning — Regression Models

Trained Linear Regression, Ridge (CV α selection), and Lasso (CV α selection) on 361,826 training samples. Evaluated with RMSE and R². Random Forest provided feature importance rankings. Best model R² = 0.0296 (casualties largely unpredictable).

LinearRegression RidgeCV LassoCV RandomForestRegressor R² = 0.0296

Unsupervised Learning — K-Means Clustering

StandardScaler normalization on 6 clustering features (hour, day, borough, vehicle type, factor, casualties). Elbow curve analysis confirmed K=4 as optimal. PCA reduced to 2 dimensions for visualization (52% variance explained). Cluster profiles interpreted manually.

KMeans StandardScaler PCA Elbow Method 4 Clusters Found

Tech Stack & Infrastructure

Analysis run on Mac (Apple Silicon) with Python 3.9 venv. Data downloaded via Kaggle API. Python data science stack in Jupyter-style workflow. Website built in pure HTML/CSS/JS.

Python 3.9 scikit-learn Plotly Kaggle API Mac Analysis

WHEN AND WHERE DO PEOPLE GET HIT?

2 MILLIONNYC CRASHES

PREDICTINGCASUALTIES

4 TYPES OFNYC CRASHES

HOW WEDID THIS

Data Acquisition & Cleaning

Feature Engineering & Encoding

Supervised Learning — Regression Models

Unsupervised Learning — K-Means Clustering

Tech Stack & Infrastructure

WHEN
AND
WHERE
DO PEOPLE
GET HIT?

2 MILLION
NYC CRASHES

PREDICTING
CASUALTIES

4 TYPES OF
NYC CRASHES

HOW WE
DID THIS