Hollywood Formula — Technical Analysis

01 — Exploratory Analysis

WHAT THE
DATA SHOWS

Distributions, correlations, and patterns across 4,803 films with verified budget and revenue data from the TMDB/IMDB database.

Mean Revenue by Release Month

June peaks at $205M · September bottoms out at $62M · 3.5× summer premium

Revenue vs Budget — Log Scale

3,157 films with budget + revenue both > $100K · log-transformed for model fitting

Mean Revenue and ROI by Genre

Animation leads raw revenue ($302M) · Horror leads ROI (1,001%) · Action worst ROI at 188%

📅

Release Month Matters

June earns $205M average vs January's $59M — a 3.5× difference with no additional investment. May–July is the golden window. September ($62M) is Hollywood's dumping ground for films studios don't believe in.

🎬

The Horror ROI Paradox

Horror earns 1,001% average ROI — the best of any genre — because budgets are tiny and audiences are loyal. Action films dominate screens but deliver only 188% ROI. The massive budgets destroy the margins.

💬

Buzz Beats Budget

vote_count accounts for 51% of revenue prediction power — more than budget (22%), year (7%), and popularity (5%) combined. Films that generate conversation before they open massively outperform those that don't.

02 — Supervised Learning

PREDICTING
BOX OFFICE REVENUE

Four regression models trained on 2,525 films (80/20 split) to predict log-transformed worldwide revenue. Random Forest achieved the best R² at 0.695.

Linear Regression

R² Score

0.610

Notes

Baseline model · log revenue target

Ridge Regression

R² Score

0.610

Notes

L2 regularization · α via CV

Lasso Regression

R² Score

0.610

Notes

L1 regularization · sparse solution

Random Forest

R² Score

0.695

Advantage

Captures non-linear relationships · best feature importance

Feature Importance — What Predicts Revenue?

vote_count dominates at 51% · budget 22% · everything else minor

Model Comparison — R² Scores

Random Forest outperforms linear models · non-linearity matters in film economics

Lasso Feature Coefficients — Log Revenue

Positive = increases predicted revenue · Negative = reduces predicted revenue · on log scale

03 — Unsupervised Learning

4 MOVIE
ARCHETYPES

K-Means clustering (K=4) on standardized features reveals four natural film profiles. PCA explains 67.3% of variance in 2 dimensions. One cluster defies all conventional wisdom.

K-Means Cluster Visualization (PCA 2D)

3,157 films mapped to 2 principal components · 67.3% variance explained · Cluster 3 (The Legend) isolated top-left

874 films · 27.7% of total

The Grassroots Hit

Budget: $7.9M avg
Revenue: $21.3M avg
ROI: 550%
Rating: 7.0/10

Low-budget films with strong word-of-mouth. Often comedies, foreign-language films, or niche genre entries that outperform their modest investment.

1,016 films · 32.2% of total

The Summer Blockbuster

Budget: $71.9M avg
Revenue: $277M avg
ROI: 435%
Popularity: 56 avg

High-budget tentpole films with massive marketing budgets and franchise potential. Avatar, Titanic, The Avengers. The industry's core commercial engine.

1,261 films · 39.9% of total

The Middle Child

Budget: $40.4M avg
Revenue: $71.1M avg
ROI: 126%
Rating: 6.0/10

The largest cluster. Mid-budget films with moderate performance — enough to make money, but not enough to define a studio's slate. Often sequels and franchise spinoffs past their peak.

6 films · 0.2% of total ⚡

The Legend

Budget: $430K avg
Revenue: $110M avg
ROI: 27,548%

Bambi, American Graffiti, Mad Max. Made for almost nothing, earned everything. These outliers cannot be manufactured — but they prove the ceiling is unlimited when conditions align perfectly.

Elbow Curve — Optimal K Selection

Inertia drops sharply through K=4, then flattens — confirming K=4 as optimal number of clusters

04 — Methodology

HOW WE
DID THIS

A complete supervised and unsupervised learning pipeline applied to the TMDB/IMDB movies dataset.

Data Acquisition & Cleaning

Downloaded 4,803 films from Kaggle TMDB Movies Dataset (utkarshx27/movies-dataset). Filtered to 3,157 films with both budget and revenue above $100,000 to eliminate placeholder values. Extracted primary genre from JSON-encoded genre field.

Pandas NumPy TMDB Dataset 3,157 Clean Records

Feature Engineering & Encoding

Log-transformed both budget and revenue (log_budget, log_revenue) to normalize skewed distributions. Extracted release_month and release_year from date field. Encoded season (Spring/Summer/Fall/Winter) as ordinal feature. Label-encoded primary genre. Final feature set: 9 variables.

Log Transform LabelEncoder Date Extraction 80/20 Train-Test Split 9 Features

Supervised Learning — Regression Models

Trained Linear Regression, Ridge (CV α selection), Lasso (CV α selection), and Random Forest on 2,525 training samples. Target: log_revenue. Evaluated with R². Random Forest (R²=0.695) outperformed all linear models (R²=0.61), indicating non-linear relationships in the data.

LinearRegression RidgeCV LassoCV RandomForestRegressor Best R² = 0.695

Unsupervised Learning — K-Means Clustering

StandardScaler normalization on 5 clustering features (log_budget, log_revenue, popularity, vote_average, vote_count). Elbow curve confirmed K=4. PCA reduced to 2 dimensions for visualization (67.3% variance explained). Cluster profiles interpreted with Ollama llama3.2 assistance.

KMeans StandardScaler PCA Elbow Method 4 Archetypes Found

Tech Stack & Infrastructure

Analysis run on a Linux PC (Intel i5-4590, NVIDIA GTX 1060 6GB) with GPU-accelerated Ollama for AI-assisted cluster interpretation. Python 3.12 data science stack in Jupyter notebooks. Website built in pure HTML/CSS/JS with inline Plotly charts.

Python 3.12 scikit-learn Plotly Ollama llama3.2 GTX 1060 GPU

WHAT MAKES A MOVIE MAKE MONEY?

WHAT THEDATA SHOWS

PREDICTINGBOX OFFICE REVENUE

4 MOVIEARCHETYPES

HOW WEDID THIS

Data Acquisition & Cleaning

Feature Engineering & Encoding

Supervised Learning — Regression Models

Unsupervised Learning — K-Means Clustering

Tech Stack & Infrastructure

WHAT
MAKES A
MOVIE
MAKE
MONEY?

WHAT THE
DATA SHOWS

PREDICTING
BOX OFFICE REVENUE

4 MOVIE
ARCHETYPES

HOW WE
DID THIS