4,803 films. Machine learning regression and clustering reveals the real formula behind box office success — and why horror filmmakers figured it out decades ago.
Distributions, correlations, and patterns across 4,803 films with verified budget and revenue data from the TMDB/IMDB database.
Four regression models trained on 2,525 films (80/20 split) to predict log-transformed worldwide revenue. Random Forest achieved the best R² at 0.695.
K-Means clustering (K=4) on standardized features reveals four natural film profiles. PCA explains 67.3% of variance in 2 dimensions. One cluster defies all conventional wisdom.
A complete supervised and unsupervised learning pipeline applied to the TMDB/IMDB movies dataset.
Downloaded 4,803 films from Kaggle TMDB Movies Dataset (utkarshx27/movies-dataset). Filtered to 3,157 films with both budget and revenue above $100,000 to eliminate placeholder values. Extracted primary genre from JSON-encoded genre field.
Log-transformed both budget and revenue (log_budget, log_revenue) to normalize skewed distributions. Extracted release_month and release_year from date field. Encoded season (Spring/Summer/Fall/Winter) as ordinal feature. Label-encoded primary genre. Final feature set: 9 variables.
Trained Linear Regression, Ridge (CV α selection), Lasso (CV α selection), and Random Forest on 2,525 training samples. Target: log_revenue. Evaluated with R². Random Forest (R²=0.695) outperformed all linear models (R²=0.61), indicating non-linear relationships in the data.
StandardScaler normalization on 5 clustering features (log_budget, log_revenue, popularity, vote_average, vote_count). Elbow curve confirmed K=4. PCA reduced to 2 dimensions for visualization (67.3% variance explained). Cluster profiles interpreted with Ollama llama3.2 assistance.
Analysis run on a Linux PC (Intel i5-4590, NVIDIA GTX 1060 6GB) with GPU-accelerated Ollama for AI-assisted cluster interpretation. Python 3.12 data science stack in Jupyter notebooks. Website built in pure HTML/CSS/JS with inline Plotly charts.