How We Analyze Public Records with AI & Statistical Rigor
RecordsReveal uses a hybrid AI-human data science pipeline to investigate public government records. This page documents our technical methodology, statistical approaches, and AI architecture for transparency and reproducibility.
This page contains statistical formulas, Python code, and data science terminology. For plain-English explanations of our findings, see our investigations page.
Our investigation pipeline is a two-stage AI system that separates data analysis (local, free) from journalism (cloud, paid):
Model: qwen2.5-coder:7b (quantized, 4-bit)
Hardware: Remote Mac Mini (Apple Silicon M2, 16GB RAM)
Cost: $0.00 per investigation
Runtime: ~2 minutes
The data analysis stage runs on a local Ollama instance and performs:
Key Innovation: Unlike rigid statistical pipelines, Ollama has creative freedom to explore patterns the data suggests, rather than forcing predefined analysis types.
Model: claude-sonnet-4-5-20250929
Cost: $0.02-0.08 per investigation
Runtime: ~30 seconds
Claude receives Ollama's statistical analysis and writes:
Visualization: Claude Sonnet 4.5 designs Chart.js visualizations
Hero Image: Leonardo.ai Phoenix model generates editorial illustrations
Cost: $0.02-0.04 per investigation
Runtime: ~60 seconds
Data processing, API orchestration, HTML generation
CSV loading, data cleaning, statistical computation
Local LLM inference for data analysis (qwen2.5-coder:7b)
Cloud API for journalism writing and visualization design
AI image generation for hero illustrations
Interactive data visualizations (client-side JavaScript)
We use two complementary methods for outlier flagging:
1. Interquartile Range (IQR) Method
Q1 = 25th percentile
Q3 = 75th percentile
IQR = Q3 - Q1
Lower Bound = Q1 - 1.5 × IQR
Upper Bound = Q3 + 1.5 × IQR
Outlier if: x < Lower Bound OR x > Upper Bound
2. Z-Score Method
z = (x - μ) / σ
Outlier if: |z| > 3
Where μ is the mean and σ is the standard deviation. We report both methods and flag values exceeding either threshold.
For numeric column pairs, we compute Pearson correlation coefficient:
r = Σ[(xi - x̄)(yi - ȳ)] / √[Σ(xi - x̄)² × Σ(yi - ȳ)²]
Where:
- r ∈ [-1, 1]
- |r| > 0.7 = strong correlation
- |r| > 0.5 = moderate correlation
- |r| < 0.3 = weak correlation
We report correlations with |r| > 0.5 as potentially newsworthy relationships.
When comparing groups or testing relationships, we use:
P-values below 0.05 are considered statistically significant. However, we prioritize effect size over p-values for editorial decisions (e.g., a 200% increase is newsworthy even if n is small).
All datasets analyzed by RecordsReveal must meet these criteria:
Our cleaning pipeline:
# 1. Load raw data
df = pd.read_csv('data.csv', low_memory=False)
# 2. Standardize column names
df.columns = df.columns.str.lower().str.replace(' ', '_')
# 3. Parse dates
date_cols = df.select_dtypes(include=['object']).columns
for col in date_cols:
df[col] = pd.to_datetime(df[col], errors='coerce')
# 4. Handle missing values (flag, don't drop)
missing_report = df.isnull().sum()
# 5. Remove duplicates
df = df.drop_duplicates()
# 6. Validate constraints (e.g., amounts > 0)
# Report violations, don't silently fix
Transparency Rule: We report cleaning steps in investigation methodology sections.
All RecordsReveal investigations include:
PROMPT_CHAIN.mdrequirements.txt with pinned versions# 1. Clone repository
git clone https://github.com/recordsreveal/investigations.git
cd investigations
# 2. Install dependencies
pip install -r requirements.txt
# 3. Set API keys
cp .env.example .env
# Add: ANTHROPIC_API_KEY, LEONARDO_API_KEY
# 4. Run investigation
python3 investigate.py data/your_dataset.csv
# 5. Render HTML
python3 render_complete.py investigation_output/*.json
Before publication, every investigation undergoes:
Claude's journalism is validated for:
We take data accuracy seriously. If you find a mistake in our analysis, please email corrections@recordsreveal.com with:
We'll investigate and publish corrections prominently if warranted.
| Component | Tool | Cost/Investigation |
|---|---|---|
| Data Analysis | Ollama (local) | $0.00 |
| Journalism | Claude Sonnet 4.5 | $0.02-0.08 |
| Visualization | Claude Sonnet 4.5 | $0.02-0.04 |
| Hero Image | Leonardo.ai | $0.00* |
| TOTAL | $0.04-0.12 | |
* Leonardo.ai offers free tier with monthly credits
Revenue Model: RecordsReveal is funded by Google AdSense. We maintain editorial independence by never accepting payment for coverage or analysis angles. All investigations are chosen based on public interest and data availability.
Planned enhancements to our methodology:
This page is intended for data scientists, statisticians, and journalists interested in our technical approach. For general inquiries, see our About page.
Technical Contact: data@recordsreveal.com
Code Repository: github.com/recordsreveal
Prompt Documentation: See PROMPT_CHAIN.md in repo