Sensors, KPIs, Counterfactuals, Review Clocks
The Learning Imperative: Why Static Systems Fail
Institutional Senescence and Adaptive Decay
Senescence (biological aging) occurs when organisms lose capacity to adapt to environmental stress. Institutional senescence is the organizational analog: systems become rigid, fail to incorporate new information, and lose fitness as environments evolve.
Manifestation in disaster risk systems:
Example 1 – Hurricane forecasting (1950s-1970s): Early numerical weather prediction models had 3-day track forecasts with ~500km average error. For decades, evacuation planning assumed this error margin. But by 2020, 3-day errors fell to ~150km—threefold improvement. Yet many jurisdictions continued using outdated error assumptions, evacuating unnecessarily large areas, causing:
- Public fatigue (“we evacuated for nothing”)
- Economic losses (unnecessary business closures)
- Reduced compliance with future warnings
- Failure to leverage improved forecasts
Senescence mechanism: No feedback loop updating evacuation protocols as forecast skill improved. System designed once, ossified.
Example 2 – Building codes in seismic zones: Post-earthquake damage assessments reveal building failures. Engineers update understanding of ground motion, structural vulnerabilities, construction practices. But building codes lag by decades. Result: New construction continues using obsolete standards; preventable deaths in next earthquake.
Senescence mechanism: Code update cycles (10-20 years) far slower than knowledge accumulation. No forcing function for continuous improvement.
Example 3 – Social protection targeting: Cash transfer programs use poverty proxy means test (PMT) developed from census data. But poverty drivers evolve (urbanization, climate impacts, conflicts, pandemics change vulnerability patterns). PMT becomes progressively less accurate at identifying truly vulnerable. Result: Growing exclusion errors; resources miss intended beneficiaries.
Senescence mechanism: No systematic recalibration. Initial design assumed perpetual validity.
The Core Claim: Trust Through Falsifiability
Karl Popper’s falsificationism: Scientific theories gain credibility not by being proven true (impossible to prove general statements) but by being falsifiable (making specific predictions that can be tested, potentially disproving theory). Theories surviving repeated falsification attempts gain provisional trust.
Applied to disaster risk systems: Systems gain legitimacy not by asserting infallibility but by making explicit, testable predictions with transparent success/failure tracking and structured improvement cycles when predictions fail.
Psychological foundation (trust research):
- People trust systems that acknowledge uncertainty > systems claiming certainty
- People trust systems that improve after failures > systems that deny failures
- People trust systems with visible accountability > systems with opaque performance
GCRI’s approach: Feedback sovereignty—affected populations and oversight bodies have right and capability to:
- See assumptions: What system assumes about world
- Observe outcomes: What actually happened
- Challenge performance: Demand explanation when predictions fail
- Trigger review: Force systematic reassessment
- Verify improvement: Confirm changes actually improve performance
“Sovereignty” because feedback isn’t optional afterthought—it’s structural requirement enforced through technical architecture and governance.
Mechanism I: Sensors and Observability Infrastructure
Multi-Scale Monitoring Architecture
Effective feedback requires observing reality across time and space scales. Single point measurements miss systemic patterns.
Temporal scales:
- Real-time (seconds to hours): Flash flood detection, earthquake shaking, disease cluster emergence
- Event-scale (days to weeks): Cyclone impacts, epidemic progression, drought evolution
- Seasonal (months): Crop yields, malnutrition rates, water availability
- Annual (years): Disaster loss trends, early warning coverage, adaptive capacity
- Decadal (decades): Climate change impacts, ecosystem shifts, demographic transitions
Spatial scales:
- Household/individual: Specific vulnerable persons, buildings, assets
- Community (neighborhood, village): Local exposure, social networks, collective coping
- Municipal (city, district): Service delivery, infrastructure functionality, governance capacity
- Provincial/state: Regional hazard patterns, coordination effectiveness, resource flows
- National: Aggregate outcomes, policy impacts, fiscal sustainability
- Regional/global: Cross-border risks, comparative performance, systemic risks
Sensor types:
1. Physical sensors (environmental monitoring):
Hydrometeorological:
- Rain gauges (tipping bucket, weighing, optical): 10-minute to daily accumulation
- River level gauges (pressure transducers, radar, ultrasonic): Continuous water level, discharge calculation
- Weather stations (temperature, humidity, wind, pressure, solar radiation): Hourly observations
- Snow depth sensors (ultrasonic, laser): Critical for snowmelt flood forecasting
- Soil moisture probes (capacitance, TDR): Agricultural drought monitoring
Geophysical:
- Seismometers: Earthquake detection, magnitude, location (strong motion + broadband)
- GPS/GNSS: Ground deformation for volcanic eruptions, landslides, subsidence
- Tiltmeters: Volcano monitoring (magma movement)
- Infrasound arrays: Volcanic eruptions, explosions, meteor airburst
Oceanic:
- Tide gauges: Storm surge, tsunami detection
- Buoys (moorings, drifters): Ocean temperature, wave height, currents
- DART (Deep-ocean Assessment and Reporting of Tsunamis): Open ocean tsunami detection
Atmospheric quality:
- PM2.5/PM10 monitors: Air pollution, wildfire smoke
- Gas sensors (CO, NO2, SO2, O3): Industrial accidents, air quality
- Radiation detectors: Nuclear accidents
Deployment strategy:
- National networks: Government meteorological/geological services operate core networks
- Community-based: Local organizations/volunteers operate supplementary low-cost sensors in underserved areas
- IoT/citizen science: Crowdsourced observations (smartphone barometers, home weather stations, traffic cameras)
- Integration: Combine official + community + crowd data with quality control and provenance tracking
Data quality assurance:
- Calibration: Sensors compared to reference standards annually; drift corrections applied
- Outlier detection: Automated algorithms flag implausible values (e.g., negative rainfall, temperature outside physical range)
- Spatial consistency: Compare neighboring sensors; isolated anomalies investigated
- Temporal consistency: Check for sensor failure modes (stuck values, drift, drop-out)
- Metadata: Document sensor type, installation date, maintenance history, known issues
2. Satellite remote sensing (wide-area continuous monitoring):
Optical imagery:
- High-resolution (<5m): Building damage, flood extent, landslides (Planet, Maxar, Airbus)
- Medium-resolution (10-30m): Land use, vegetation health, agriculture (Sentinel-2, Landsat)
- Coarse-resolution (250m-1km): Daily global coverage, cloud monitoring (MODIS, VIIRS)
Synthetic Aperture Radar (SAR):
- All-weather (penetrates clouds), day/night imaging
- Flood mapping (water appears dark), ground deformation (InSAR interferometry)
- Sentinel-1, RADARSAT, ALOS PALSAR
Thermal infrared:
- Land surface temperature: Urban heat islands, drought stress, wildfire detection
- MODIS, Landsat thermal bands, ECOSTRESS
Passive microwave:
- Soil moisture (SMAP, SMOS), snow water equivalent, precipitation (GPM)
- All-weather, coarse resolution (10-50km)
Altimetry:
- River levels from space (Sentinel-3, Jason, upcoming SWOT mission)
- Lake levels, flood extent
Atmospheric sounding:
- Temperature/humidity profiles for weather forecasting
- Geostationary (GOES, Meteosat, Himawari) and polar orbiting (NOAA, MetOp)
Processing pipelines:
- Automated processing from Level 0 (raw) → Level 1 (calibrated) → Level 2 (geophysical parameters) → Level 3 (gridded products)
- Cloud-optimized formats (COG, Zarr) for efficient access
- STAC catalogs for discovery
- API access for automated systems
3. Impact sensors (observing disaster effects):
Infrastructure monitoring:
- Power grid: Outage maps (utilities, crowdsource via smartphone connectivity)
- Transportation: Road closures, bridge damage (traffic cameras, GPS probes, reports)
- Communications: Network availability (cell tower functionality, internet connectivity)
- Water/sanitation: Service disruptions (treatment plant status, contamination alerts)
Health surveillance:
- Syndromic: Emergency department visits, pharmacy sales (early epidemic detection)
- Laboratory: Confirmed disease cases, pathogen genomic sequences
- Mortality: Excess deaths (compared to baseline, adjusted for seasonality)
- Nutrition: Acute malnutrition rates (MUAC screening in health facilities, community surveys)
Economic impacts:
- Business activity: Mobile money transactions, electricity consumption, nighttime lights
- Employment: Payroll data, unemployment claims, labor force participation
- Prices: Food commodity prices, inflation (especially essential goods)
- Financial stress: Bank withdrawals, loan defaults, credit demand
Social impacts:
- Displacement: Camp populations, border crossings, mobile phone mobility patterns
- Education: School closures, attendance, drop-out rates
- Protection: Gender-based violence incidence, child protection cases
- Social cohesion: Conflict events, hate speech (media monitoring), polarization indices
Human mobility and displacement:
- GPS traces (aggregated, anonymized) from mobile operators or apps (with consent)
- Social media geolocation (public posts only, privacy-respecting)
- Camp registration systems (UNHCR, IOM)
- Satellite detection of informal settlements
Privacy and ethics:
- Individual-level data never shared raw; only aggregated statistics with differential privacy
- Informed consent for any personal data collection
- Independent ethics review (IRB equivalent) for novel sensor deployments
- Community consultation before deploying sensors in sensitive areas
4. Social sensors (human perception and experience):
Surveys and assessments:
- Rapid needs assessments: 24-72 hours post-disaster; identify urgent needs
- Multi-sectoral assessments: 1-2 weeks post-disaster; comprehensive across WASH, shelter, food, health
- Post-distribution monitoring: After aid delivery; verify assistance reached intended beneficiaries
- Household surveys: Periodic (quarterly, annually); track recovery, resilience, vulnerability trends
Participatory monitoring:
- Community scorecards: Communities rate service delivery (early warning, assistance, reconstruction)
- Focus group discussions: Qualitative insights on system performance, equity, appropriateness
- Photovoice: Affected populations document experiences via photography with narrative
- Participatory GIS: Communities map hazards, resources, vulnerabilities using local knowledge
Feedback hotlines:
- Toll-free phone lines (24/7) for questions, complaints, suggestions
- SMS-based feedback (lower barrier than calling)
- WhatsApp/messaging bots (conversational interface)
- Analysis: Text analytics identify common themes; sentiment analysis gauges satisfaction
Social media listening (with privacy safeguards):
- Public posts on X/Twitter, Facebook, Reddit, local platforms
- Natural language processing for disaster mentions, needs, misinformation
- Geographic clustering of concerns
- Ethics: Only public posts; no individual profiling; aggregated insights only
Traditional media monitoring:
- Newspapers, radio, TV coverage of disasters and response
- Identify narratives, perceived failures, public concerns
- Comparative analysis across outlets (state media vs independent)
Grievance and feedback mechanisms (Section 1.5):
- Formal complaints through accountability systems
- Structured data: Issue type, timeliness, resolution, satisfaction
- Trend analysis: Recurring issues signal systemic problems
Real-Time Dashboards and Observability
Observability: In software engineering, ability to understand system internal state from external outputs. Applied to disaster risk: ability to understand how system performing from sensor data and metrics.
Principles:
- Three pillars: Metrics (quantitative), Logs (events), Traces (causality)
- Aggregation: Summary statistics for executives; drill-down to details for operators
- Alerting: Automated notifications when metrics exceed thresholds
- Visualization: Maps, time series, distributions—appropriate chart types for data
Dashboard tiers:
Executive dashboard (national leadership, donors, board):
- High-level KPIs: Lives saved, population covered, protection latency, equity metrics
- Trend indicators: Improving/stable/degrading
- Financial: Budget utilization, cost per beneficiary, leverage ratios
- Update frequency: Daily to weekly
- Access: Public transparency portal (aggregated data)
Operational dashboard (emergency operations centers, NWG staff):
- Real-time hazard monitoring: Active forecasts, sensor readings, satellite imagery
- Response status: Actions taken, resources deployed, populations reached
- Situational awareness: Infrastructure status, access constraints, security incidents
- Coordination: Inter-agency activities, mutual aid requests, logistics tracking
- Update frequency: Continuous (seconds to minutes)
- Access: Authenticated users with operational roles
Technical dashboard (validators, model developers, data scientists):
- Model performance: Skill scores, bias metrics, ensemble spread
- Data quality: Sensor availability, latency, completeness
- System health: Computing resources, API response times, error rates
- Validation status: Pending reviews, signature coverage, dissent tracking
- Update frequency: Real-time for system health; daily for model performance
- Access: Technical staff and validators
Community dashboard (local leaders, affected populations):
- Local forecasts and alerts in plain language
- Evacuation routes and shelter locations (maps)
- Distribution schedules and locations
- Feedback channel access (submit reports, check grievance status)
- Update frequency: As needed during events; periodic otherwise
- Access: Public, mobile-optimized, offline-capable, multi-language
Implementation technologies:
- Time-series databases: InfluxDB, TimescaleDB (efficient storage/query of sensor data)
- Visualization: Grafana, Kibana, custom React/D3.js dashboards
- Alerting: Prometheus Alertmanager, PagerDuty (on-call notifications)
- Log aggregation: Elasticsearch, Loki (searchable event logs)
- Distributed tracing: Jaeger, Zipkin (understand causality in complex systems)
Mechanism II: Key Performance Indicators (KPIs) and Benchmarking
Outcome-Focused KPI Design
Shift from outputs to outcomes:
- Outputs: Activities completed (workshops held, systems installed, people trained)
- Outcomes: Changes in vulnerability, exposure, coping capacity, wellbeing
GCRI KPI framework: Structured hierarchy from strategic goals → operational metrics.
Strategic goal: Reduce disaster mortality, economic losses, and displacement through early warning and anticipatory action
Strategic KPIs (measure goal achievement):
SK1. Mortality reduction:
Disaster-attributable deaths per million population (annualized)
Target: Year-over-year reduction of ≥10%
Baseline: Historical 10-year average
Disaggregation: By hazard type, gender, age, geography, wealth quintile
SK2. Economic resilience:
Disaster losses as % of GDP (annualized)
Target: Maintain <0.5% in non-catastrophic years; <2% in 1-in-20 year events
Baseline: Historical average
Disaggregation: By sector (agriculture, infrastructure, housing, commercial)
SK3. Displacement prevention:
Person-months of displacement (internally displaced + refugees)
Target: Reduce by ≥20% compared to counterfactual
Measurement: Compare actual displacement to model prediction absent early action
Disaggregation: Cause (conflict vs disaster), duration, demographics
Operational goal: Deliver timely, accurate early warning with equity to populations at risk
Operational KPIs:
OK1. Early warning coverage:
% of at-risk population with access to timely early warning (≥24h lead time)
Target: ≥90% by 2026; 100% by 2030 (EW4All commitment)
Measurement: Surveys, registration data, network coverage maps
Disaggregation: Geography, demographics, disability, language
OK2. Forecast accuracy:
Probability of Detection (POD), False Alarm Rate (FAR), Critical Success Index (CSI)
Target (varies by hazard):
- Riverine floods (7-day): POD ≥0.85, FAR ≤0.30, CSI ≥0.60
- Tropical cyclones (72h track): Error ≤150km
- Drought onset (90-day): Hit rate ≥0.70
OK3. Protection latency:
Median time from forecast to first protective action (hours)
Target: ≤48 hours for all hazards; ≤72 hours for vulnerable populations specifically
Measurement: Timestamp logs (forecast issued → playbook activated → assistance delivered)
Disaggregation: Hazard type, geography, beneficiary demographics
OK4. Equity metrics (from Section 1.7):
Reach ratios ≥1.0 for all vulnerable groups
Gini coefficient of risk declining year-over-year
Adequacy ratios ≥0.9 for assistance to all groups
Technical goal: Maintain reliable, secure, validated system infrastructure
Technical KPIs:
TK1. System availability:
Uptime % for critical services (forecasting, alerting, coordination platforms)
Target: ≥99.5% (≤43 hours downtime/year)
Measurement: Automated health checks, synthetic monitoring
TK2. Validation timeliness:
% of critical outputs validated (2-of-N signatures) within SLA
Target: ≥95% within 24h for routine; ≥95% within 4h for urgent
Measurement: NVM validation logs with timestamps
TK3. Data quality:
% of required sensor data available with <5% missing values
Target: ≥90% sensor availability; ≤5% data gaps
Measurement: Automated quality checks on sensor feeds
TK4. Cybersecurity posture:
Zero critical vulnerabilities unpatched beyond SLA
Zero successful intrusions leading to data breach
Target: 100% compliance with patch SLAs; zero breaches
Measurement: Vulnerability scans, security incident reports
Governance goal: Maintain transparent, accountable, participatory processes
Governance KPIs:
GK1. Transparency:
% of critical outputs published with full verification packages within 30 days
Target: 100%
Measurement: Transparency portal audit
GK2. Grievance responsiveness:
% of grievances acknowledged within 48h; resolved within SLA
Target: 100% acknowledged; ≥90% resolved within SLA
Measurement: Grievance mechanism database
GK3. Participatory governance:
% of validation nodes with representation from all 6 quintuple helix sectors
% of decisions with civil society/Indigenous participation
Target: 100% of countries have complete node representation; ≥80% major decisions include community consultation
Measurement: Node registry, decision logs
Benchmarking and Comparative Performance
Peer comparison: How does performance compare to similar contexts?
Peer groups (for meaningful comparison):
- Geography: Neighboring countries, shared hazard zones (Pacific islands, Sahel, Caribbean, etc.)
- Development level: Similar GDP per capita, HDI, governance indicators
- Hazard profile: Countries facing similar dominant risks (flood-prone, cyclone-exposed, earthquake zones)
Benchmarking reports (quarterly):
Country: Bangladesh
Peer group: South Asian countries with flood risk (India, Pakistan, Nepal)
Early Warning Coverage:
- Bangladesh: 87%
- Peer average: 72%
- Best in group: 94% (India)
- Assessment: Above average; gap to best practice: 7 percentage points
Protection Latency:
- Bangladesh: 38 hours
- Peer average: 56 hours
- Best in group: 28 hours (Nepal)
- Assessment: Strong performance; opportunity to learn from Nepal's playbook protocols
Equity (Reach Ratio for bottom quintile):
- Bangladesh: 1.12
- Peer average: 0.89
- Best in group: 1.18 (Pakistan)
- Assessment: Pro-equity; slight opportunity for improvement
Learning: Benchmarking identifies:
- Strengths to celebrate and share with others
- Gaps requiring improvement
- Peer models to learn from (specific practices to adopt)
- Innovation opportunities (where no peer excels; room for breakthrough)
Global dashboards: Public rankings (with country consent) create reputational incentives for improvement—no government wants to be bottom of league table. But rankings must:
- Account for context (don’t penalize least developed countries for having fewer resources)
- Emphasize improvement trajectory, not just absolute performance
- Highlight exemplars across diverse contexts (best performer in Africa, best improver globally, most equitable system, etc.)
Confidence Tiers and Epistemic Humility
Not all KPIs have equal certainty: Some metrics well-measured (sensor uptime); others estimated with substantial uncertainty (lives saved counterfactually).
GCRI approach: Assign confidence tiers to KPIs, communicate uncertainty transparently.
Tier 1 – High confidence (direct measurement, minimal inference):
- System uptime (server logs)
- Validation timeliness (timestamp logs)
- Grievance response times (database records)
- Sensor coverage (inventory)
Tier 2 – Medium confidence (requires some inference/modeling):
- Forecast accuracy (observed events vs forecasts; but some events may not be fully observed)
- Protection latency (requires matching forecast timestamps to action reports; some actions undocumented)
- Early warning coverage (surveys have sampling error; self-reported access may differ from actual)
Tier 3 – Lower confidence (substantial counterfactual estimation):
- Lives saved (requires counterfactual model of what would have happened)
- Economic losses avoided (requires baseline loss model)
- Displacement prevented (requires migration model)
Reporting format:
Lives saved (2024): 12,400 [90% CI: 8,200 - 18,100] (Tier 3)
Methodology: Counterfactual model comparing actual mortality to baseline
vulnerability × hazard intensity, validated against historical events
Confidence: Lower (model uncertainty ±35%)
Sensitivity: Result sensitive to assumptions about baseline vulnerability;
best estimate uses conservative assumptions
Why transparency matters:
- Credibility: Acknowledging uncertainty builds trust more than false precision
- Accountability: Clear about what we know vs estimate
- Learning: Understanding uncertainty helps prioritize where to improve measurement
Mechanism III: Counterfactuals and Causal Inference
The Fundamental Problem of Causal Inference
Counterfactual question: What would have happened without the intervention?
Fundamental problem: Cannot observe both realities—can’t have intervention AND no intervention in same place/time. We observe one; must estimate the other.
Naive approach – Before/after comparison:
Mortality before early warning: 500 deaths/year
Mortality after early warning: 100 deaths/year
Claimed impact: 400 lives saved
Problem: Many confounders (things that changed beyond intervention):
- Maybe hazard intensity decreased (fewer/weaker cyclones)
- Maybe overall development increased (better housing, healthcare)
- Maybe population moved away from high-risk areas
- Maybe other programs also contributing
Before/after comparison conflates:
- Intervention effect (what we want to measure)
- Time trends (secular improvement)
- Confounding factors (other concurrent changes)
Counterfactual Estimation Methods
1. Randomized Controlled Trials (RCTs) – Gold standard when ethical and feasible
Design:
- Randomly assign units (villages, districts, households) to treatment (receive intervention) or control (do not)
- Randomization ensures treatment and control groups are statistically identical except for intervention
- Compare outcomes; difference = causal effect
Example – Forecast-based financing in Bangladesh:
Design: 60 sub-districts randomly assigned: 30 receive anticipatory cash transfers
triggered by GloFAS forecasts; 30 receive standard post-disaster response
Outcome: Food insecurity measured 6 months post-flood
Results:
- Treatment group: 18% severe food insecurity
- Control group: 28% severe food insecurity
- Difference: 10 percentage points (95% CI: 6-14pp)
- Interpretation: Anticipatory action reduced severe food insecurity by 36% relative to control
Limitations:
- Ethical concerns: Denying possibly life-saving intervention to control group
- Political constraints: Governments often must act everywhere, not selectively
- Externalities: Treatment may affect control (spillovers via migration, trade, information)
- Cost and time: RCTs expensive, take years to complete
When appropriate: Testing new approaches on margin (expanding coverage area); learning whether specific design features matter; measuring mechanisms.
2. Quasi-experimental designs – Exploit natural variation
Difference-in-differences (DiD):
Logic: Compare change over time between treated and untreated groups.
Assumption: Parallel trends—absent treatment, both groups would have followed same trajectory.
Formula:
Impact = (Y_treated,after - Y_treated,before) - (Y_control,after - Y_control,before)
Example – Early warning system in Colombia:
Setting: Early warning deployed in Pacific coast municipalities 2018
Comparison: Caribbean coast municipalities (similar exposure, no early warning until 2020)
Mortality trend (per 100k):
2015 2016 2017 2018 2019 2020
Pacific 8.2 7.9 7.6 5.1 4.8 4.6 (early warning starts 2018)
Caribbean 8.4 8.1 7.8 7.5 7.3 4.9 (early warning starts 2020)
DiD calculation:
Pacific change (2017→2019): 7.6 - 4.8 = -2.8
Caribbean change (2017→2019): 7.8 - 7.3 = -0.5
Difference: -2.8 - (-0.5) = -2.3 deaths per 100k
Interpretation: Early warning reduced disaster mortality by ~2.3 per 100k (30% reduction)
Checks:
- Pre-trends parallel? (Yes, Pacific and Caribbean followed similar trends 2015-2017)
- No other confounders? (Check for concurrent programs, policy changes)
- Placebo tests: If we pretend intervention happened earlier (e.g., 2016), do we see effect? (Should be no)
Regression discontinuity (RD):
Logic: Intervention has sharp threshold; compare units just above vs just below threshold.
Example – Flood insurance threshold:
Setting: Flood insurance subsidized for properties <5m above sea level
Question: Does insurance reduce flood damages?
Approach: Compare properties at 4.5-5.0m (insured) vs 5.0-5.5m (not insured)
Properties on both sides of threshold similar except for insurance
Results:
- Insured properties (4.5-5.0m): Average flood damage $8,200
- Uninsured properties (5.0-5.5m): Average flood damage $12,400
- RD estimate: Insurance reduces damage by $4,200 (34%)
Assumptions: No manipulation (people don't precisely choose elevation to get insurance)
Synthetic control method (SCM):
Logic: Create synthetic version of treated unit from weighted combination of untreated units. Post-intervention, gap between treated and synthetic = impact.
Example – Ethiopia Productive Safety Net Programme (PSNP) expansion:
Setting: PSNP expanded to Somali region in 2014 with shock-responsive features
Question: Impact on poverty and resilience?
Approach:
- Create "synthetic Somali" from weighted average of other Ethiopian regions
- Weights chosen so synthetic Somali matches actual Somali pre-2014 on:
* Poverty rates
* Rainfall patterns
* Livestock ownership
* Infrastructure access
Results:
2010 2012 2014 2016 2018 2020
Actual 38% 36% 34% 28% 26% 24% (poverty rate)
Synthetic 38% 36% 34% 32% 31% 30%
Difference (2014-2020): 6 percentage points
Interpretation: PSNP reduced poverty rate by ~6pp more than would have occurred otherwise
Advantages: Works when only one treated unit; flexible; visually intuitive.
3. Statistical matching – Create comparable groups post hoc
Propensity score matching (PSM):
Logic: Estimate probability (propensity) that unit receives treatment given observed characteristics. Match treated units to control units with similar propensity scores.
Example – Community-based early warning:
Setting: Some villages adopted community early warning (CBEW); others did not
Selection not random (villages with strong leadership more likely to adopt)
Approach:
1. Estimate propensity score: P(CBEW | village characteristics)
Characteristics: Leadership quality, education, distance to town, prior disaster experience
2. Match each CBEW village to non-CBEW village with similar propensity
3. Compare outcomes between matched pairs
Results:
- CBEW villages: 12% of population displaced during floods
- Matched control villages: 21% displaced
- PSM estimate: CBEW reduced displacement by 9 percentage points (43% relative reduction)
Limitation: Only controls for observed confounders. If unobserved factors drive both treatment and outcomes, PSM biased.
Instrumental variables (IV):
Logic: Find variable (instrument) that affects treatment but doesn’t directly affect outcome.
Example – Forecast accuracy and mortality:
Question: Does more accurate forecasting reduce mortality?
Problem: Can't randomize forecast accuracy
Instrument: Distance to weather radar
- Closer to radar → better forecast accuracy (instrument affects treatment)
- Distance itself doesn't directly affect mortality (exclusion restriction)
Approach:
1. First stage: Forecast accuracy = f(distance to radar, controls)
2. Second stage: Mortality = f(predicted forecast accuracy, controls)
Result: 10 percentage point increase in forecast accuracy → 15% reduction in mortality
4. Model-based counterfactuals – Use simulation models
When no comparison group available, use calibrated models to simulate counterfactual.
Approach:
- Calibrate disaster loss model to historical events (before intervention)
- For recent event with intervention, model predicts losses given hazard intensity and baseline vulnerability (no intervention)
- Actual losses observed
- Difference = intervention impact estimate
Example – 2020 Cyclone Amphan (West Bengal):
Hazard: Category 3 equivalent cyclone (185 km/h winds)
Model prediction (baseline vulnerability, no early warning):
- Based on calibration to 1999 Super Cyclone (similar intensity, pre-early warning era)
- Predicted mortality: 480-620 deaths
Actual observed mortality: 86 deaths
Estimated lives saved: 394-534 (central estimate: 464)
Model uncertainty:
- Assumes baseline vulnerability unchanged (may overestimate if development reduced vulnerability)
- Assumes hazard intensity similar (wind speed estimates have ±15% uncertainty)
- Confidence interval: 250-700 lives saved (90% CI)
Validation: Test model on historical events where we know actual outcome. If model accurately predicts those, increases confidence in counterfactual estimates.
Causal Mechanism Analysis
Beyond “did it work?” ask “how did it work?”—understanding mechanisms enables replication and improvement.
Process tracing: Detailed case studies documenting causal chain.
Example – Why did early warning reduce mortality in Bangladesh but not in Haiti?
Bangladesh (successful mechanism):
- Forecast issued 72h ahead →
- District officials activated playbook (pre-authorized) →
- Community volunteers conducted door-to-door notification →
- Cyclone shelters opened, evacuation transport provided →
- 2M people evacuated →
- Cyclone struck; shelters protected population →
- Result: 26 deaths (vs 500+ predicted)
Haiti (broken mechanism):
- Forecast issued 48h ahead →
- ❌ No clear authority to activate response (government dysfunction) →
- ❌ Alerts issued via radio/SMS but low trust in government (history of broken promises) →
- ❌ No designated shelters or organized evacuation →
- ❌ Most people did not evacuate despite warning →
- Hurricane struck →
- Result: 546 deaths
Lesson: Early warning necessary but insufficient. Requires: Functional governance, public trust, pre-positioned resources, rehearsed protocols. GCRI designs systems addressing full causal chain, not just forecast dissemination.
Mediation analysis: Quantify how much of effect operates through specific pathways.
Example – How does anticipatory cash reduce food insecurity?
Potential mechanisms:
- Income smoothing: Cash allows purchasing food during shock
- Asset protection: Cash prevents distress sale of livestock/tools
- Migration prevention: Cash allows staying in place, maintaining livelihoods
Mediation analysis:
Total effect of cash on food insecurity: -12 percentage points
Decomposition:
- Via income smoothing: -5pp (42% of effect)
- Via asset protection: -4pp (33% of effect)
- Via migration prevention: -2pp (17% of effect)
- Direct effect (residual): -1pp (8% of effect)
Implication: Optimize design by targeting mechanisms. If asset protection is key pathway, ensure cash arrives before distress sales begin.
Mechanism IV: Sunset Clauses and Renewal Clocks
Designed Obsolescence as Governance Tool
Problem: Systems persist indefinitely through inertia, even when circumstances change or performance deteriorates. No forcing function for review.
Solution: Sunset clauses—systems automatically expire unless explicitly renewed after demonstrating continued value.
Rationale:
- Forces periodic reassessment (instead of assuming perpetual relevance)
- Shifts burden of proof (must justify continuation, not justify termination)
- Creates opportunity for redesign (incorporate learning, adapt to changed context)
- Prevents zombie programs (operational but ineffective)
Review Clock Architecture
Three clock types with different cadences:
1. Safety review clock (annual) – Technical performance
Trigger: Automatic at 12-month intervals from deployment
Scope: Model performance, data quality, system reliability
Review questions:
- Are forecast skill scores maintaining target thresholds?
- Has model drift been detected? (Performance degrading over time)
- Are data sources still available and reliable?
- Have any security incidents occurred?
- Are validators satisfied with output quality?
Evidence required:
- Verification statistics (POD, FAR, CSI) over past year
- Comparison to previous year (improving/stable/degrading?)
- Documented model recalibrations or updates
- Security audit results
- Validator assessments and any dissents
Decision outcomes:
- Renew (green): Performance satisfactory; continue operations
- Conditional renewal (yellow): Performance acceptable but trends concerning; corrective action plan required; re-review in 6 months
- Suspend (red): Performance below standards; system suspended until issues resolved
- Sunset (black): Fundamental approach flawed; system terminated; redesign needed
2. Legitimacy review clock (biennial) – Rights and equity
Trigger: Every 24 months
Scope: Rights protections, equity outcomes, community satisfaction, grievance patterns
Review questions:
- Are reach ratios maintained >1.0 for vulnerable groups?
- Are protection latency gaps closed?
- Is grievance mechanism functioning (response times, resolution rates)?
- Do affected populations trust system? (Survey data)
- Any rights violations or FPIC breaches?
Evidence required:
- Disaggregated outcome data (Section 1.7 metrics)
- Equity trend analysis
- Grievance statistics and systemic issues identified
- Community satisfaction surveys
- Independent human rights audit
Participants:
- Civil society/Indigenous validation nodes (lead)
- Community representatives
- Human rights organizations
- Disability rights advocates
- Women’s groups
Decision outcomes:
- Renew: Equity standards met; rights protections effective
- Conditional: Minor equity gaps; improvement plan within 6 months
- Major overhaul: Significant equity failures; operations suspended until redesign addresses structural issues
3. Value review clock (triennial) – Strategic fit and impact
Trigger: Every 36 months
Scope: Does system still address priority needs? Are resources well-allocated? Is approach still optimal given evolving context?
Review questions:
- Has risk landscape changed? (Climate change, urbanization, conflict)
- Are interventions still targeting highest-impact opportunities?
- Do benefits justify costs? (Cost-effectiveness analysis)
- Are alternative approaches now available that would be more effective?
- Does system align with national/regional strategies?
Evidence required:
- Cost-benefit analysis ($ per life saved, $ per person covered)
- Comparative effectiveness (vs alternative approaches)
- Stakeholder value assessments (government, communities, donors)
- Strategic alignment review
- Theory of change validation (do our assumptions still hold?)
Participants:
- Government (national planning, finance ministries)
- Donors and investors
- Continental Steward Node
- Academic institutions (independent evaluation)
- Community representatives
Decision outcomes:
- Renew: System delivering value; continue with current design
- Evolve: Fundamental approach sound but needs adaptation to changed context; managed evolution
- Pivot: Original theory of change no longer valid; major redesign needed
- Phase out: Problem solved, system no longer needed, or resources better allocated elsewhere
Default to Sunset (Not Default to Continuation)
Critical design choice: What happens if review doesn’t occur on schedule?
Traditional approach: System continues operating (default to continuation)
- Problem: Review can be delayed indefinitely; creates institutional inertia
GCRI approach: System automatically suspends if review not completed (default to sunset)
- NVM enforces: As renewal deadline approaches, alerts sent at 90, 60, 30 days
- If renewal not approved by deadline, system enters grace period (30 days) with degraded status
- If still not renewed after grace period, system automatically suspends (no new outputs; existing operations phase down)
Why this matters: Creates institutional urgency for review. Cannot ignore deadlines. Forces explicit decisions about continuation.
Exception: True emergency during review window. Temporary extension (max 60 days) with documented justification and mandatory expedited review.
Participatory Review Process
Who participates in renewal reviews?
Safety review: Technical validators (academia, industry, government technical staff)
Legitimacy review: Civil society, Indigenous representatives, community members, human rights advocates
Value review: Government decision-makers, donors, academic evaluators, community representatives, Continental Steward
Process design:
- Self-assessment (system operators prepare report against review criteria)
- Independent evaluation (external reviewers analyze evidence, conduct site visits, interview stakeholders)
- Public comment period (30 days; anyone can submit input)
- Stakeholder deliberation (review committee considers all evidence and input)
- Decision and documentation (publish decision with rationale; if conditional/suspended, publish corrective action plan)
- Appeals process (decisions can be appealed to Continental Steward within 30 days)
Transparency: All review materials public (except sensitive security details); decisions published; dissenting opinions included.
Mechanism V: Management Letters and Calibration Discipline
The Management Letter Concept
Origin: Financial auditing—auditors issue “management letter” identifying control weaknesses, risks, and recommendations (separate from formal audit opinion).
Adaptation to disaster risk systems: When model recalibration, methodology change, or operational adjustment needed, formal management letter documents:
- Issue identified: What problem/deficiency detected
- Evidence: Data showing performance degradation, bias, or failure mode
- Root cause analysis: Why issue occurred
- Recommended action: Specific changes to address issue
- Implementation plan: Who, what, when (with target dates)
- Verification: How will we confirm fix worked
Triggers for management letter:
- Model performance below thresholds (forecast skill declining)
- Systematic bias detected (underpredicting risk for specific locations/populations)
- Equity metrics degrading (reach ratios falling for vulnerable groups)
- Grievance patterns indicating systemic issue (recurring complaints about same problem)
- Security incident revealing vulnerability
- Validator dissents raising common concerns
Example – Management Letter for Flood Forecast Model:
Management Letter 2024-Q3-001
Issue: Systematic underestimation of flood peaks in upper Indus basin
Evidence:
- Last 6 flood events: Observed peaks averaged 18% higher than forecast
- Bias particularly pronounced for snowmelt-driven floods (vs rainfall-driven)
- False negative rate 28% (vs target 15%)
Root Cause:
- Snow water equivalent (SWE) estimates from MODIS satellite have negative bias in
high-elevation areas (>4000m) due to cloud contamination
- Model calibration period (2010-2020) didn't include recent extreme snowfall years
Recommended Action:
1. Integrate SNODAS snow data product (higher accuracy) as supplementary SWE input
2. Recalibrate model using extended period (2005-2023) including recent extremes
3. Implement ensemble approach (combine MODIS + SNODAS) with uncertainty quantification
Implementation Plan:
- Data integration: Complete by 2024-10-30 (Data Engineering team)
- Recalibration: Complete by 2024-11-15 (Hydrology team)
- Validation testing: 2024-11-16 to 2024-11-30 (using hindcast on 2023 events)
- Deployment: 2024-12-01 (pending validator approval)
Verification:
- Monitor forecast skill for 2024-25 winter season
- Target: Bias <±10%, false negative rate <15%
- Review 2025-03-30; confirm improvement or escalate
Approved by:
[Academia validator signature]
[Government validator signature]
Published: 2024-09-15 (public transparency portal)
Why public:
- Accountability: Organizations cannot hide performance problems
- Learning: Other jurisdictions facing similar issues can adopt solutions
- Trust: Public sees problems identified and addressed, increasing confidence
Recalibration Discipline
Recalibration: Updating model parameters or structure based on new data.
Discipline: Structured process (not ad hoc tweaking) with documentation and oversight.
When to recalibrate:
1. Scheduled (annual baseline):
- Even if performance acceptable, refit models annually with latest data
- Climate non-stationarity means historical patterns shift
- Protects against slow drift
2. Performance-triggered (skill thresholds breached):
- If skill scores fall below targets for 2 consecutive seasons → mandatory recalibration
- If systematic bias detected (>10% for 3+ events) → immediate recalibration
3. Event-triggered (unprecedented event):
- When event occurs outside historical calibration range (e.g., 1-in-100 year event)
- Major learning opportunity; incorporate into training data
- May reveal model structural deficiencies
4. Methodological advancement (better approaches available):
- Research yields improved algorithms, better satellite products, enhanced understanding
- Opportunity to adopt frontier techniques
Recalibration protocol:
Phase 1 – Analysis (weeks 1-2):
- Identify performance degradation or new data availability
- Analyze root causes (data issues, model structure, parameter drift, non-stationarity)
- Evaluate whether recalibration likely to help (or need redesign)
Phase 2 – Recalibration (weeks 3-6):
- Update training data
- Refit model parameters (or adopt new model structure)
- Tune hyperparameters via cross-validation
- Generate new uncertainty estimates
Phase 3 – Validation (weeks 7-8):
- Test recalibrated model on held-out data (hindcast recent events)
- Compare to old model (is new version actually better?)
- Check for unintended consequences (improved overall skill but worse equity?)
- Prepare safety case update
Phase 4 – Approval (weeks 9-10):
- Submit to validators with comparison report
- Validators review methodology, test results, equity impacts
- 2-of-N signatures required for operational deployment
- If concerns, iterate or escalate to Continental Steward
Phase 5 – Deployment (week 11):
- Canary deployment (10% of traffic to test in production)
- Monitor closely for 7 days
- If stable, full deployment
- Old model maintained for 30 days as rollback option
Phase 6 – Monitoring (ongoing):
- Track performance of recalibrated model
- Compare to pre-recalibration baseline
- Publish results in quarterly performance report
Documentation requirements:
- Change log (what changed, why, by whom, when)
- Reproducibility package (code, data, configuration for new version)
- Performance comparison report
- Validator approval with signatures
- Public announcement (transparency portal)
Avoiding Recalibration Traps
Over-fitting: Recalibrating too aggressively on recent events; model becomes overly specific to recent patterns, loses generalization.
Solution: Cross-validation; hold out recent period; ensure performance on unseen data.
Premature adjustment: Reacting to single poor forecast; noise vs signal confusion.
Solution: Require multiple failures before triggering recalibration; statistical significance tests.
Complexity creep: Each recalibration adds parameters/features; model becomes unwieldy and opaque.
Solution: Parsimony principle; prefer simpler models; regularly simplify (prune unused features).
Loss of institutional memory: Staff turnover means rationale for model design forgotten; recalibrations made without understanding history.
Solution: Documentation discipline; every recalibration includes “theory of change” explaining model logic; onboarding includes model history review.
Summary: Feedback as Power
Design Principle IV asserts that systems earn trust through structured learning, not asserted authority.
The mechanisms:
- Sensors and observability → Make reality visible across scales and domains
- KPIs with confidence tiers → Quantify performance; acknowledge uncertainty
- Counterfactuals and causal inference → Distinguish correlation from causation; estimate impact honestly
- Sunset clauses and review clocks → Force periodic reassessment; prevent institutional senescence
- Management letters and recalibration → Respond to feedback systematically; improve continuously
Leadership question: Can citizens, overseers, and investors see what we assumed, observe what occurred, understand what we learned, and verify that we adapted?
If yes: Feedback sovereignty is real. System is falsifiable, therefore credible.
If no: System is black box. Might be performing well, but indistinguishable from failing system that hides problems. Without feedback sovereignty, trust is impossible.