Data Science Methodology

How We Analyze Public Records with AI & Statistical Rigor

RecordsReveal uses a hybrid AI-human data science pipeline to investigate public government records. This page documents our technical methodology, statistical approaches, and AI architecture for transparency and reproducibility.

⚠️ For Technical Audiences Only

This page contains statistical formulas, Python code, and data science terminology. For plain-English explanations of our findings, see our investigations page.

System Architecture

Our investigation pipeline is a two-stage AI system that separates data analysis (local, free) from journalism (cloud, paid):

2
Pipeline Stages
3
AI Models
$0.04
Avg Cost/Investigation
~4min
Runtime

Stage 1: Data Analysis (Ollama - Local Inference)

Model: qwen2.5-coder:7b (quantized, 4-bit)
Hardware: Remote Mac Mini (Apple Silicon M2, 16GB RAM)
Cost: $0.00 per investigation
Runtime: ~2 minutes

The data analysis stage runs on a local Ollama instance and performs:

  1. Schema Profiling: Column types, null rates, unique value counts
  2. Descriptive Statistics: Mean, median, mode, std dev, quartiles
  3. Distribution Analysis: Skewness, kurtosis, histogram bins
  4. Outlier Detection: IQR method, Z-score flagging (σ > 3)
  5. Correlation Analysis: Pearson r for numeric pairs
  6. Temporal Patterns: Time series decomposition if datetime columns present
  7. Categorical Analysis: Frequency tables, chi-square tests for independence

Key Innovation: Unlike rigid statistical pipelines, Ollama has creative freedom to explore patterns the data suggests, rather than forcing predefined analysis types.

Stage 2: Journalism (Claude Sonnet 4.5 - API)

Model: claude-sonnet-4-5-20250929
Cost: $0.02-0.08 per investigation
Runtime: ~30 seconds

Claude receives Ollama's statistical analysis and writes:

Stage 3: Visualization (Claude + Leonardo.ai)

Visualization: Claude Sonnet 4.5 designs Chart.js visualizations
Hero Image: Leonardo.ai Phoenix model generates editorial illustrations
Cost: $0.02-0.04 per investigation
Runtime: ~60 seconds

Technology Stack

Python 3.9+

Data processing, API orchestration, HTML generation

Pandas 2.x

CSV loading, data cleaning, statistical computation

Ollama

Local LLM inference for data analysis (qwen2.5-coder:7b)

Anthropic Claude

Cloud API for journalism writing and visualization design

Leonardo.ai

AI image generation for hero illustrations

Chart.js 4.x

Interactive data visualizations (client-side JavaScript)

Statistical Methods

Outlier Detection

We use two complementary methods for outlier flagging:

1. Interquartile Range (IQR) Method

Q1 = 25th percentile
Q3 = 75th percentile
IQR = Q3 - Q1
Lower Bound = Q1 - 1.5 × IQR
Upper Bound = Q3 + 1.5 × IQR

Outlier if: x < Lower Bound OR x > Upper Bound

2. Z-Score Method

z = (x - μ) / σ

Outlier if: |z| > 3

Where μ is the mean and σ is the standard deviation. We report both methods and flag values exceeding either threshold.

Correlation Analysis

For numeric column pairs, we compute Pearson correlation coefficient:

r = Σ[(xi - x̄)(yi - ȳ)] / √[Σ(xi - x̄)² × Σ(yi - ȳ)²]

Where:
- r ∈ [-1, 1]
- |r| > 0.7 = strong correlation
- |r| > 0.5 = moderate correlation
- |r| < 0.3 = weak correlation

We report correlations with |r| > 0.5 as potentially newsworthy relationships.

Statistical Significance

When comparing groups or testing relationships, we use:

P-values below 0.05 are considered statistically significant. However, we prioritize effect size over p-values for editorial decisions (e.g., a 200% increase is newsworthy even if n is small).

Data Sources & Ethics

Source Requirements

All datasets analyzed by RecordsReveal must meet these criteria:

  1. Publicly Available: Government records, FOIA responses, or open datasets
  2. Machine-Readable: CSV, JSON, or structured formats (no PDFs)
  3. Documented: Clear data dictionary or column definitions
  4. Verifiable: Source URL and download date recorded

Data Cleaning

Our cleaning pipeline:

# 1. Load raw data
df = pd.read_csv('data.csv', low_memory=False)

# 2. Standardize column names
df.columns = df.columns.str.lower().str.replace(' ', '_')

# 3. Parse dates
date_cols = df.select_dtypes(include=['object']).columns
for col in date_cols:
    df[col] = pd.to_datetime(df[col], errors='coerce')

# 4. Handle missing values (flag, don't drop)
missing_report = df.isnull().sum()

# 5. Remove duplicates
df = df.drop_duplicates()

# 6. Validate constraints (e.g., amounts > 0)
# Report violations, don't silently fix

Transparency Rule: We report cleaning steps in investigation methodology sections.

Privacy & Ethics

Reproducibility

Open Source Commitment

All RecordsReveal investigations include:

Running Investigations Locally

# 1. Clone repository
git clone https://github.com/recordsreveal/investigations.git
cd investigations

# 2. Install dependencies
pip install -r requirements.txt

# 3. Set API keys
cp .env.example .env
# Add: ANTHROPIC_API_KEY, LEONARDO_API_KEY

# 4. Run investigation
python3 investigate.py data/your_dataset.csv

# 5. Render HTML
python3 render_complete.py investigation_output/*.json

Validation & Quality Control

Statistical Review

Before publication, every investigation undergoes:

  1. Automated Checks: Schema validation, null rate thresholds, outlier counts
  2. Manual Review: Human verification of key statistics
  3. Reproducibility Test: Re-run from raw data to confirm results

AI Output Validation

Claude's journalism is validated for:

🔍 Spotted an Error?

We take data accuracy seriously. If you find a mistake in our analysis, please email corrections@recordsreveal.com with:

We'll investigate and publish corrections prominently if warranted.

Known Limitations

What We Can Analyze

What We Cannot Analyze

AI Model Limitations

Cost & Sustainability

Component Tool Cost/Investigation
Data Analysis Ollama (local) $0.00
Journalism Claude Sonnet 4.5 $0.02-0.08
Visualization Claude Sonnet 4.5 $0.02-0.04
Hero Image Leonardo.ai $0.00*
TOTAL $0.04-0.12

* Leonardo.ai offers free tier with monthly credits

Revenue Model: RecordsReveal is funded by Google AdSense. We maintain editorial independence by never accepting payment for coverage or analysis angles. All investigations are chosen based on public interest and data availability.

Future Improvements

Planned enhancements to our methodology:

  1. Causal Inference: Implement do-calculus for causal claims (Pearl framework)
  2. NLP Pipeline: Add text mining for PDF reports and press releases
  3. Geospatial Analysis: Map visualizations for location-based data
  4. Time Series Forecasting: ARIMA/Prophet models for trend predictions
  5. Peer Review: Submit select investigations to data journalism conferences

Questions?

This page is intended for data scientists, statisticians, and journalists interested in our technical approach. For general inquiries, see our About page.

Technical Contact: data@recordsreveal.com
Code Repository: github.com/recordsreveal
Prompt Documentation: See PROMPT_CHAIN.md in repo