Data Science Methodology

How We Analyze Public Records with AI & Statistical Rigor

RecordsReveal uses a hybrid AI-human data science pipeline to investigate public government records. This page documents our technical methodology, statistical approaches, and AI architecture for transparency and reproducibility.

⚠️ For Technical Audiences Only

This page contains statistical formulas, Python code, and data science terminology. For plain-English explanations of our findings, see our investigations page.

System Architecture

Our investigation pipeline is a two-stage AI system that separates data analysis (local, free) from journalism (cloud, paid):

Pipeline Stages

AI Models

$0.04

Avg Cost/Investigation

~4min

Runtime

Stage 1: Data Analysis (Ollama - Local Inference)

Model: qwen2.5-coder:7b (quantized, 4-bit)
Hardware: Remote Mac Mini (Apple Silicon M2, 16GB RAM)
Cost: $0.00 per investigation
Runtime: ~2 minutes

The data analysis stage runs on a local Ollama instance and performs:

Schema Profiling: Column types, null rates, unique value counts
Descriptive Statistics: Mean, median, mode, std dev, quartiles
Distribution Analysis: Skewness, kurtosis, histogram bins
Outlier Detection: IQR method, Z-score flagging (σ > 3)
Correlation Analysis: Pearson r for numeric pairs
Temporal Patterns: Time series decomposition if datetime columns present
Categorical Analysis: Frequency tables, chi-square tests for independence

Key Innovation: Unlike rigid statistical pipelines, Ollama has creative freedom to explore patterns the data suggests, rather than forcing predefined analysis types.

Stage 2: Journalism (Claude Sonnet 4.5 - API)

Model: claude-sonnet-4-5-20250929
Cost: $0.02-0.08 per investigation
Runtime: ~30 seconds

Claude receives Ollama's statistical analysis and writes:

Headline: Newsworthy summary with key statistics
Lede: 2-3 sentence context-setting opening
Findings: 3-7 insights with supporting evidence
Pull Quotes: One memorable statistic per finding
Stat Boxes: 3-5 KPIs for dashboard

Stage 3: Visualization (Claude + Leonardo.ai)

Visualization: Claude Sonnet 4.5 designs Chart.js visualizations
Hero Image: Leonardo.ai Phoenix model generates editorial illustrations
Cost: $0.02-0.04 per investigation
Runtime: ~60 seconds

Technology Stack

Python 3.9+

Data processing, API orchestration, HTML generation

Pandas 2.x

CSV loading, data cleaning, statistical computation

Ollama

Local LLM inference for data analysis (qwen2.5-coder:7b)

Anthropic Claude

Cloud API for journalism writing and visualization design

Leonardo.ai

AI image generation for hero illustrations

Chart.js 4.x

Interactive data visualizations (client-side JavaScript)

Statistical Methods

Outlier Detection

We use two complementary methods for outlier flagging:

1. Interquartile Range (IQR) Method

Q1 = 25th percentile
Q3 = 75th percentile
IQR = Q3 - Q1
Lower Bound = Q1 - 1.5 × IQR
Upper Bound = Q3 + 1.5 × IQR

Outlier if: x < Lower Bound OR x > Upper Bound

2. Z-Score Method

z = (x - μ) / σ

Outlier if: |z| > 3

Where μ is the mean and σ is the standard deviation. We report both methods and flag values exceeding either threshold.

Correlation Analysis

For numeric column pairs, we compute Pearson correlation coefficient:

r = Σ[(xi - x̄)(yi - ȳ)] / √[Σ(xi - x̄)² × Σ(yi - ȳ)²]

Where:
- r ∈ [-1, 1]
- |r| > 0.7 = strong correlation
- |r| > 0.5 = moderate correlation
- |r| < 0.3 = weak correlation

We report correlations with |r| > 0.5 as potentially newsworthy relationships.

Statistical Significance

When comparing groups or testing relationships, we use:

Chi-square test: Categorical independence (α = 0.05)
T-test: Mean differences between groups (α = 0.05)
ANOVA: Mean differences across 3+ groups (α = 0.05)

P-values below 0.05 are considered statistically significant. However, we prioritize effect size over p-values for editorial decisions (e.g., a 200% increase is newsworthy even if n is small).

Data Sources & Ethics

Source Requirements

All datasets analyzed by RecordsReveal must meet these criteria:

Publicly Available: Government records, FOIA responses, or open datasets
Machine-Readable: CSV, JSON, or structured formats (no PDFs)
Documented: Clear data dictionary or column definitions
Verifiable: Source URL and download date recorded

Data Cleaning

Our cleaning pipeline:

# 1. Load raw data
df = pd.read_csv('data.csv', low_memory=False)

# 2. Standardize column names
df.columns = df.columns.str.lower().str.replace(' ', '_')

# 3. Parse dates
date_cols = df.select_dtypes(include=['object']).columns
for col in date_cols:
    df[col] = pd.to_datetime(df[col], errors='coerce')

# 4. Handle missing values (flag, don't drop)
missing_report = df.isnull().sum()

# 5. Remove duplicates
df = df.drop_duplicates()

# 6. Validate constraints (e.g., amounts > 0)
# Report violations, don't silently fix

Transparency Rule: We report cleaning steps in investigation methodology sections.

Privacy & Ethics

We do not analyze personally identifiable information (PII) beyond what's in public records
We do not merge datasets to deanonymize individuals
We do aggregate small groups (n < 10) to prevent identification
We do consult domain experts before publishing sensitive findings

Reproducibility

Open Source Commitment

All RecordsReveal investigations include:

Raw Data: CSV download link on every investigation page
Code: Python scripts available on GitHub
Prompts: Exact AI prompts documented in PROMPT_CHAIN.md
Dependencies: requirements.txt with pinned versions

Running Investigations Locally

# 1. Clone repository
git clone https://github.com/recordsreveal/investigations.git
cd investigations

# 2. Install dependencies
pip install -r requirements.txt

# 3. Set API keys
cp .env.example .env
# Add: ANTHROPIC_API_KEY, LEONARDO_API_KEY

# 4. Run investigation
python3 investigate.py data/your_dataset.csv

# 5. Render HTML
python3 render_complete.py investigation_output/*.json

Validation & Quality Control

Statistical Review

Before publication, every investigation undergoes:

Automated Checks: Schema validation, null rate thresholds, outlier counts
Manual Review: Human verification of key statistics
Reproducibility Test: Re-run from raw data to confirm results

AI Output Validation

Claude's journalism is validated for:

Factual Accuracy: Every statistic cited must appear in Ollama's analysis
No Hallucination: Cross-reference all numbers against source data
Causal Claims: Flag any "causes" or "leads to" language (correlation ≠ causation)

🔍 Spotted an Error?

We take data accuracy seriously. If you find a mistake in our analysis, please email corrections@recordsreveal.com with:

Investigation URL
Specific statistic or claim
Your calculation or source

We'll investigate and publish corrections prominently if warranted.

Known Limitations

What We Can Analyze

✅ Structured data (CSV, JSON)
✅ Datasets up to ~10M rows (hardware limit)
✅ Numeric, categorical, temporal, text columns

What We Cannot Analyze

❌ PDFs, scanned documents (OCR errors too high)
❌ Datasets > 10M rows (use sampling)
❌ Unstructured text (no NLP pipeline yet)
❌ Real-time data (batch processing only)

AI Model Limitations

Ollama (qwen2.5-coder:7b): 4-bit quantization reduces precision slightly; may miss subtle patterns in very large datasets
Claude Sonnet 4.5: Occasionally generates overly sensational headlines; we manually tone down
Leonardo.ai Phoenix: Image generation can be hit-or-miss; we regenerate if quality is poor

Cost & Sustainability

Component	Tool	Cost/Investigation
Data Analysis	Ollama (local)	$0.00
Journalism	Claude Sonnet 4.5	$0.02-0.08
Visualization	Claude Sonnet 4.5	$0.02-0.04
Hero Image	Leonardo.ai	$0.00*
TOTAL		$0.04-0.12

* Leonardo.ai offers free tier with monthly credits

Revenue Model: RecordsReveal is funded by Google AdSense. We maintain editorial independence by never accepting payment for coverage or analysis angles. All investigations are chosen based on public interest and data availability.

Future Improvements

Planned enhancements to our methodology:

Causal Inference: Implement do-calculus for causal claims (Pearl framework)
NLP Pipeline: Add text mining for PDF reports and press releases
Geospatial Analysis: Map visualizations for location-based data
Time Series Forecasting: ARIMA/Prophet models for trend predictions
Peer Review: Submit select investigations to data journalism conferences

Questions?

This page is intended for data scientists, statisticians, and journalists interested in our technical approach. For general inquiries, see our About page.

Technical Contact: data@recordsreveal.com
Code Repository: github.com/recordsreveal
Prompt Documentation: See PROMPT_CHAIN.md in repo