← RecordsReveal Home Technical Deep Dive · Investigation #002 Read the Article →
TMDB/IMDB · Machine Learning · 2026

WHAT
MAKES A
MOVIE
MAKE
MONEY?

4,803 films. Machine learning regression and clustering reveals the real formula behind box office success — and why horror filmmakers figured it out decades ago.

4,803 Films Analyzed
69.5% R² Score
4 Archetypes Found
1,001%
Horror Average ROI
51%
Revenue from Audience Buzz
27,548%
"The Legend" Cluster ROI
3.5×
June vs January Revenue
$302M
Avg Animation Revenue
01 — Exploratory Analysis

WHAT THE
DATA SHOWS

Distributions, correlations, and patterns across 4,803 films with verified budget and revenue data from the TMDB/IMDB database.

Mean Revenue by Release Month
June peaks at $205M · September bottoms out at $62M · 3.5× summer premium
Revenue vs Budget — Log Scale
3,157 films with budget + revenue both > $100K · log-transformed for model fitting
Mean Revenue and ROI by Genre
Animation leads raw revenue ($302M) · Horror leads ROI (1,001%) · Action worst ROI at 188%
📅
Release Month Matters
June earns $205M average vs January's $59M — a 3.5× difference with no additional investment. May–July is the golden window. September ($62M) is Hollywood's dumping ground for films studios don't believe in.
🎬
The Horror ROI Paradox
Horror earns 1,001% average ROI — the best of any genre — because budgets are tiny and audiences are loyal. Action films dominate screens but deliver only 188% ROI. The massive budgets destroy the margins.
💬
Buzz Beats Budget
vote_count accounts for 51% of revenue prediction power — more than budget (22%), year (7%), and popularity (5%) combined. Films that generate conversation before they open massively outperform those that don't.
02 — Supervised Learning

PREDICTING
BOX OFFICE REVENUE

Four regression models trained on 2,525 films (80/20 split) to predict log-transformed worldwide revenue. Random Forest achieved the best R² at 0.695.

Linear Regression
R² Score
0.610
Notes
Baseline model · log revenue target
Ridge Regression
R² Score
0.610
Notes
L2 regularization · α via CV
Lasso Regression
R² Score
0.610
Notes
L1 regularization · sparse solution
Random Forest
R² Score
0.695
Advantage
Captures non-linear relationships · best feature importance
Feature Importance — What Predicts Revenue?
vote_count dominates at 51% · budget 22% · everything else minor
Model Comparison — R² Scores
Random Forest outperforms linear models · non-linearity matters in film economics
Lasso Feature Coefficients — Log Revenue
Positive = increases predicted revenue · Negative = reduces predicted revenue · on log scale
03 — Unsupervised Learning

4 MOVIE
ARCHETYPES

K-Means clustering (K=4) on standardized features reveals four natural film profiles. PCA explains 67.3% of variance in 2 dimensions. One cluster defies all conventional wisdom.

K-Means Cluster Visualization (PCA 2D)
3,157 films mapped to 2 principal components · 67.3% variance explained · Cluster 3 (The Legend) isolated top-left
0
874 films · 27.7% of total
The Grassroots Hit
Budget: $7.9M avg
Revenue: $21.3M avg
ROI: 550%
Rating: 7.0/10
Low-budget films with strong word-of-mouth. Often comedies, foreign-language films, or niche genre entries that outperform their modest investment.
1
1,016 films · 32.2% of total
The Summer Blockbuster
Budget: $71.9M avg
Revenue: $277M avg
ROI: 435%
Popularity: 56 avg
High-budget tentpole films with massive marketing budgets and franchise potential. Avatar, Titanic, The Avengers. The industry's core commercial engine.
2
1,261 films · 39.9% of total
The Middle Child
Budget: $40.4M avg
Revenue: $71.1M avg
ROI: 126%
Rating: 6.0/10
The largest cluster. Mid-budget films with moderate performance — enough to make money, but not enough to define a studio's slate. Often sequels and franchise spinoffs past their peak.
3
6 films · 0.2% of total ⚡
The Legend
Budget: $430K avg
Revenue: $110M avg
ROI: 27,548%
Bambi, American Graffiti, Mad Max. Made for almost nothing, earned everything. These outliers cannot be manufactured — but they prove the ceiling is unlimited when conditions align perfectly.
Elbow Curve — Optimal K Selection
Inertia drops sharply through K=4, then flattens — confirming K=4 as optimal number of clusters
04 — Methodology

HOW WE
DID THIS

A complete supervised and unsupervised learning pipeline applied to the TMDB/IMDB movies dataset.

01

Data Acquisition & Cleaning

Downloaded 4,803 films from Kaggle TMDB Movies Dataset (utkarshx27/movies-dataset). Filtered to 3,157 films with both budget and revenue above $100,000 to eliminate placeholder values. Extracted primary genre from JSON-encoded genre field.

Pandas NumPy TMDB Dataset 3,157 Clean Records
02

Feature Engineering & Encoding

Log-transformed both budget and revenue (log_budget, log_revenue) to normalize skewed distributions. Extracted release_month and release_year from date field. Encoded season (Spring/Summer/Fall/Winter) as ordinal feature. Label-encoded primary genre. Final feature set: 9 variables.

Log Transform LabelEncoder Date Extraction 80/20 Train-Test Split 9 Features
03

Supervised Learning — Regression Models

Trained Linear Regression, Ridge (CV α selection), Lasso (CV α selection), and Random Forest on 2,525 training samples. Target: log_revenue. Evaluated with R². Random Forest (R²=0.695) outperformed all linear models (R²=0.61), indicating non-linear relationships in the data.

LinearRegression RidgeCV LassoCV RandomForestRegressor Best R² = 0.695
04

Unsupervised Learning — K-Means Clustering

StandardScaler normalization on 5 clustering features (log_budget, log_revenue, popularity, vote_average, vote_count). Elbow curve confirmed K=4. PCA reduced to 2 dimensions for visualization (67.3% variance explained). Cluster profiles interpreted with Ollama llama3.2 assistance.

KMeans StandardScaler PCA Elbow Method 4 Archetypes Found
05

Tech Stack & Infrastructure

Analysis run on a Linux PC (Intel i5-4590, NVIDIA GTX 1060 6GB) with GPU-accelerated Ollama for AI-assisted cluster interpretation. Python 3.12 data science stack in Jupyter notebooks. Website built in pure HTML/CSS/JS with inline Plotly charts.

Python 3.12 scikit-learn Plotly Ollama llama3.2 GTX 1060 GPU