1.8 Design Principle IV — Feedback Sovereignty

Sensors, KPIs, Counterfactuals, Review Clocks

The Learning Imperative: Why Static Systems Fail

Institutional Senescence and Adaptive Decay

Senescence (biological aging) occurs when organisms lose capacity to adapt to environmental stress. Institutional senescence is the organizational analog: systems become rigid, fail to incorporate new information, and lose fitness as environments evolve.

Manifestation in disaster risk systems:

Example 1 – Hurricane forecasting (1950s-1970s): Early numerical weather prediction models had 3-day track forecasts with ~500km average error. For decades, evacuation planning assumed this error margin. But by 2020, 3-day errors fell to ~150km—threefold improvement. Yet many jurisdictions continued using outdated error assumptions, evacuating unnecessarily large areas, causing:

Public fatigue (“we evacuated for nothing”)
Economic losses (unnecessary business closures)
Reduced compliance with future warnings
Failure to leverage improved forecasts

Senescence mechanism: No feedback loop updating evacuation protocols as forecast skill improved. System designed once, ossified.

Example 2 – Building codes in seismic zones: Post-earthquake damage assessments reveal building failures. Engineers update understanding of ground motion, structural vulnerabilities, construction practices. But building codes lag by decades. Result: New construction continues using obsolete standards; preventable deaths in next earthquake.

Senescence mechanism: Code update cycles (10-20 years) far slower than knowledge accumulation. No forcing function for continuous improvement.

Example 3 – Social protection targeting: Cash transfer programs use poverty proxy means test (PMT) developed from census data. But poverty drivers evolve (urbanization, climate impacts, conflicts, pandemics change vulnerability patterns). PMT becomes progressively less accurate at identifying truly vulnerable. Result: Growing exclusion errors; resources miss intended beneficiaries.

Senescence mechanism: No systematic recalibration. Initial design assumed perpetual validity.

The Core Claim: Trust Through Falsifiability

Karl Popper’s falsificationism: Scientific theories gain credibility not by being proven true (impossible to prove general statements) but by being falsifiable (making specific predictions that can be tested, potentially disproving theory). Theories surviving repeated falsification attempts gain provisional trust.

Applied to disaster risk systems: Systems gain legitimacy not by asserting infallibility but by making explicit, testable predictions with transparent success/failure tracking and structured improvement cycles when predictions fail.

Psychological foundation (trust research):

People trust systems that acknowledge uncertainty > systems claiming certainty
People trust systems that improve after failures > systems that deny failures
People trust systems with visible accountability > systems with opaque performance

GCRI’s approach: Feedback sovereignty—affected populations and oversight bodies have right and capability to:

See assumptions: What system assumes about world
Observe outcomes: What actually happened
Challenge performance: Demand explanation when predictions fail
Trigger review: Force systematic reassessment
Verify improvement: Confirm changes actually improve performance

“Sovereignty” because feedback isn’t optional afterthought—it’s structural requirement enforced through technical architecture and governance.

Mechanism I: Sensors and Observability Infrastructure

Multi-Scale Monitoring Architecture

Effective feedback requires observing reality across time and space scales. Single point measurements miss systemic patterns.

Temporal scales:

Real-time (seconds to hours): Flash flood detection, earthquake shaking, disease cluster emergence
Event-scale (days to weeks): Cyclone impacts, epidemic progression, drought evolution
Seasonal (months): Crop yields, malnutrition rates, water availability
Annual (years): Disaster loss trends, early warning coverage, adaptive capacity
Decadal (decades): Climate change impacts, ecosystem shifts, demographic transitions

Spatial scales:

Household/individual: Specific vulnerable persons, buildings, assets
Community (neighborhood, village): Local exposure, social networks, collective coping
Municipal (city, district): Service delivery, infrastructure functionality, governance capacity
Provincial/state: Regional hazard patterns, coordination effectiveness, resource flows
National: Aggregate outcomes, policy impacts, fiscal sustainability
Regional/global: Cross-border risks, comparative performance, systemic risks

Sensor types:

1. Physical sensors (environmental monitoring):

Hydrometeorological:

Rain gauges (tipping bucket, weighing, optical): 10-minute to daily accumulation
River level gauges (pressure transducers, radar, ultrasonic): Continuous water level, discharge calculation
Weather stations (temperature, humidity, wind, pressure, solar radiation): Hourly observations
Snow depth sensors (ultrasonic, laser): Critical for snowmelt flood forecasting
Soil moisture probes (capacitance, TDR): Agricultural drought monitoring

Geophysical:

Seismometers: Earthquake detection, magnitude, location (strong motion + broadband)
GPS/GNSS: Ground deformation for volcanic eruptions, landslides, subsidence
Tiltmeters: Volcano monitoring (magma movement)
Infrasound arrays: Volcanic eruptions, explosions, meteor airburst

Oceanic:

Tide gauges: Storm surge, tsunami detection
Buoys (moorings, drifters): Ocean temperature, wave height, currents
DART (Deep-ocean Assessment and Reporting of Tsunamis): Open ocean tsunami detection

Atmospheric quality:

PM2.5/PM10 monitors: Air pollution, wildfire smoke
Gas sensors (CO, NO2, SO2, O3): Industrial accidents, air quality
Radiation detectors: Nuclear accidents

Deployment strategy:

National networks: Government meteorological/geological services operate core networks
Community-based: Local organizations/volunteers operate supplementary low-cost sensors in underserved areas
IoT/citizen science: Crowdsourced observations (smartphone barometers, home weather stations, traffic cameras)
Integration: Combine official + community + crowd data with quality control and provenance tracking

Data quality assurance:

Calibration: Sensors compared to reference standards annually; drift corrections applied
Outlier detection: Automated algorithms flag implausible values (e.g., negative rainfall, temperature outside physical range)
Spatial consistency: Compare neighboring sensors; isolated anomalies investigated
Temporal consistency: Check for sensor failure modes (stuck values, drift, drop-out)
Metadata: Document sensor type, installation date, maintenance history, known issues

2. Satellite remote sensing (wide-area continuous monitoring):

Optical imagery:

High-resolution (<5m): Building damage, flood extent, landslides (Planet, Maxar, Airbus)
Medium-resolution (10-30m): Land use, vegetation health, agriculture (Sentinel-2, Landsat)
Coarse-resolution (250m-1km): Daily global coverage, cloud monitoring (MODIS, VIIRS)

Synthetic Aperture Radar (SAR):

All-weather (penetrates clouds), day/night imaging
Flood mapping (water appears dark), ground deformation (InSAR interferometry)
Sentinel-1, RADARSAT, ALOS PALSAR

Thermal infrared:

Land surface temperature: Urban heat islands, drought stress, wildfire detection
MODIS, Landsat thermal bands, ECOSTRESS

Passive microwave:

Soil moisture (SMAP, SMOS), snow water equivalent, precipitation (GPM)
All-weather, coarse resolution (10-50km)

Altimetry:

River levels from space (Sentinel-3, Jason, upcoming SWOT mission)
Lake levels, flood extent

Atmospheric sounding:

Temperature/humidity profiles for weather forecasting
Geostationary (GOES, Meteosat, Himawari) and polar orbiting (NOAA, MetOp)

Processing pipelines:

Automated processing from Level 0 (raw) → Level 1 (calibrated) → Level 2 (geophysical parameters) → Level 3 (gridded products)
Cloud-optimized formats (COG, Zarr) for efficient access
STAC catalogs for discovery
API access for automated systems

3. Impact sensors (observing disaster effects):

Infrastructure monitoring:

Power grid: Outage maps (utilities, crowdsource via smartphone connectivity)
Transportation: Road closures, bridge damage (traffic cameras, GPS probes, reports)
Communications: Network availability (cell tower functionality, internet connectivity)
Water/sanitation: Service disruptions (treatment plant status, contamination alerts)

Health surveillance:

Syndromic: Emergency department visits, pharmacy sales (early epidemic detection)
Laboratory: Confirmed disease cases, pathogen genomic sequences
Mortality: Excess deaths (compared to baseline, adjusted for seasonality)
Nutrition: Acute malnutrition rates (MUAC screening in health facilities, community surveys)

Economic impacts:

Business activity: Mobile money transactions, electricity consumption, nighttime lights
Employment: Payroll data, unemployment claims, labor force participation
Prices: Food commodity prices, inflation (especially essential goods)
Financial stress: Bank withdrawals, loan defaults, credit demand

Social impacts:

Displacement: Camp populations, border crossings, mobile phone mobility patterns
Education: School closures, attendance, drop-out rates
Protection: Gender-based violence incidence, child protection cases
Social cohesion: Conflict events, hate speech (media monitoring), polarization indices

Human mobility and displacement:

GPS traces (aggregated, anonymized) from mobile operators or apps (with consent)
Social media geolocation (public posts only, privacy-respecting)
Camp registration systems (UNHCR, IOM)
Satellite detection of informal settlements

Privacy and ethics:

Individual-level data never shared raw; only aggregated statistics with differential privacy
Informed consent for any personal data collection
Independent ethics review (IRB equivalent) for novel sensor deployments
Community consultation before deploying sensors in sensitive areas

4. Social sensors (human perception and experience):

Surveys and assessments:

Rapid needs assessments: 24-72 hours post-disaster; identify urgent needs
Multi-sectoral assessments: 1-2 weeks post-disaster; comprehensive across WASH, shelter, food, health
Post-distribution monitoring: After aid delivery; verify assistance reached intended beneficiaries
Household surveys: Periodic (quarterly, annually); track recovery, resilience, vulnerability trends

Participatory monitoring:

Community scorecards: Communities rate service delivery (early warning, assistance, reconstruction)
Focus group discussions: Qualitative insights on system performance, equity, appropriateness
Photovoice: Affected populations document experiences via photography with narrative
Participatory GIS: Communities map hazards, resources, vulnerabilities using local knowledge

Feedback hotlines:

Toll-free phone lines (24/7) for questions, complaints, suggestions
SMS-based feedback (lower barrier than calling)
WhatsApp/messaging bots (conversational interface)
Analysis: Text analytics identify common themes; sentiment analysis gauges satisfaction

Social media listening (with privacy safeguards):

Public posts on X/Twitter, Facebook, Reddit, local platforms
Natural language processing for disaster mentions, needs, misinformation
Geographic clustering of concerns
Ethics: Only public posts; no individual profiling; aggregated insights only

Traditional media monitoring:

Newspapers, radio, TV coverage of disasters and response
Identify narratives, perceived failures, public concerns
Comparative analysis across outlets (state media vs independent)

Grievance and feedback mechanisms (Section 1.5):

Formal complaints through accountability systems
Structured data: Issue type, timeliness, resolution, satisfaction
Trend analysis: Recurring issues signal systemic problems

Real-Time Dashboards and Observability

Observability: In software engineering, ability to understand system internal state from external outputs. Applied to disaster risk: ability to understand how system performing from sensor data and metrics.

Principles:

Three pillars: Metrics (quantitative), Logs (events), Traces (causality)
Aggregation: Summary statistics for executives; drill-down to details for operators
Alerting: Automated notifications when metrics exceed thresholds
Visualization: Maps, time series, distributions—appropriate chart types for data

Dashboard tiers:

Executive dashboard (national leadership, donors, board):

High-level KPIs: Lives saved, population covered, protection latency, equity metrics
Trend indicators: Improving/stable/degrading
Financial: Budget utilization, cost per beneficiary, leverage ratios
Update frequency: Daily to weekly
Access: Public transparency portal (aggregated data)

Operational dashboard (emergency operations centers, NWG staff):

Real-time hazard monitoring: Active forecasts, sensor readings, satellite imagery
Response status: Actions taken, resources deployed, populations reached
Situational awareness: Infrastructure status, access constraints, security incidents
Coordination: Inter-agency activities, mutual aid requests, logistics tracking
Update frequency: Continuous (seconds to minutes)
Access: Authenticated users with operational roles

Technical dashboard (validators, model developers, data scientists):

Model performance: Skill scores, bias metrics, ensemble spread
Data quality: Sensor availability, latency, completeness
System health: Computing resources, API response times, error rates
Validation status: Pending reviews, signature coverage, dissent tracking
Update frequency: Real-time for system health; daily for model performance
Access: Technical staff and validators

Community dashboard (local leaders, affected populations):

Local forecasts and alerts in plain language
Evacuation routes and shelter locations (maps)
Distribution schedules and locations
Feedback channel access (submit reports, check grievance status)
Update frequency: As needed during events; periodic otherwise
Access: Public, mobile-optimized, offline-capable, multi-language

Implementation technologies:

Time-series databases: InfluxDB, TimescaleDB (efficient storage/query of sensor data)
Visualization: Grafana, Kibana, custom React/D3.js dashboards
Alerting: Prometheus Alertmanager, PagerDuty (on-call notifications)
Log aggregation: Elasticsearch, Loki (searchable event logs)
Distributed tracing: Jaeger, Zipkin (understand causality in complex systems)

Mechanism II: Key Performance Indicators (KPIs) and Benchmarking

Outcome-Focused KPI Design

Shift from outputs to outcomes:

Outputs: Activities completed (workshops held, systems installed, people trained)
Outcomes: Changes in vulnerability, exposure, coping capacity, wellbeing

GCRI KPI framework: Structured hierarchy from strategic goals → operational metrics.

Strategic goal: Reduce disaster mortality, economic losses, and displacement through early warning and anticipatory action

Strategic KPIs (measure goal achievement):

SK1. Mortality reduction:

Disaster-attributable deaths per million population (annualized)

Target: Year-over-year reduction of ≥10%
Baseline: Historical 10-year average
Disaggregation: By hazard type, gender, age, geography, wealth quintile

SK2. Economic resilience:

Disaster losses as % of GDP (annualized)

Target: Maintain <0.5% in non-catastrophic years; <2% in 1-in-20 year events
Baseline: Historical average
Disaggregation: By sector (agriculture, infrastructure, housing, commercial)

SK3. Displacement prevention:

Person-months of displacement (internally displaced + refugees)

Target: Reduce by ≥20% compared to counterfactual
Measurement: Compare actual displacement to model prediction absent early action
Disaggregation: Cause (conflict vs disaster), duration, demographics

Operational goal: Deliver timely, accurate early warning with equity to populations at risk

Operational KPIs:

OK1. Early warning coverage:

% of at-risk population with access to timely early warning (≥24h lead time)

Target: ≥90% by 2026; 100% by 2030 (EW4All commitment)
Measurement: Surveys, registration data, network coverage maps
Disaggregation: Geography, demographics, disability, language

OK2. Forecast accuracy:

Probability of Detection (POD), False Alarm Rate (FAR), Critical Success Index (CSI)

Target (varies by hazard):
- Riverine floods (7-day): POD ≥0.85, FAR ≤0.30, CSI ≥0.60
- Tropical cyclones (72h track): Error ≤150km
- Drought onset (90-day): Hit rate ≥0.70

OK3. Protection latency:

Median time from forecast to first protective action (hours)

Target: ≤48 hours for all hazards; ≤72 hours for vulnerable populations specifically
Measurement: Timestamp logs (forecast issued → playbook activated → assistance delivered)
Disaggregation: Hazard type, geography, beneficiary demographics

OK4. Equity metrics (from Section 1.7):

Reach ratios ≥1.0 for all vulnerable groups
Gini coefficient of risk declining year-over-year
Adequacy ratios ≥0.9 for assistance to all groups

Technical goal: Maintain reliable, secure, validated system infrastructure

Technical KPIs:

TK1. System availability:

Uptime % for critical services (forecasting, alerting, coordination platforms)

Target: ≥99.5% (≤43 hours downtime/year)
Measurement: Automated health checks, synthetic monitoring

TK2. Validation timeliness:

% of critical outputs validated (2-of-N signatures) within SLA

Target: ≥95% within 24h for routine; ≥95% within 4h for urgent
Measurement: NVM validation logs with timestamps

TK3. Data quality:

% of required sensor data available with <5% missing values

Target: ≥90% sensor availability; ≤5% data gaps
Measurement: Automated quality checks on sensor feeds

TK4. Cybersecurity posture:

Zero critical vulnerabilities unpatched beyond SLA
Zero successful intrusions leading to data breach

Target: 100% compliance with patch SLAs; zero breaches
Measurement: Vulnerability scans, security incident reports

Governance goal: Maintain transparent, accountable, participatory processes

Governance KPIs:

GK1. Transparency:

% of critical outputs published with full verification packages within 30 days

Target: 100%
Measurement: Transparency portal audit

GK2. Grievance responsiveness:

% of grievances acknowledged within 48h; resolved within SLA

Target: 100% acknowledged; ≥90% resolved within SLA
Measurement: Grievance mechanism database

GK3. Participatory governance:

% of validation nodes with representation from all 6 quintuple helix sectors
% of decisions with civil society/Indigenous participation

Target: 100% of countries have complete node representation; ≥80% major decisions include community consultation
Measurement: Node registry, decision logs

Benchmarking and Comparative Performance

Peer comparison: How does performance compare to similar contexts?

Peer groups (for meaningful comparison):

Geography: Neighboring countries, shared hazard zones (Pacific islands, Sahel, Caribbean, etc.)
Development level: Similar GDP per capita, HDI, governance indicators
Hazard profile: Countries facing similar dominant risks (flood-prone, cyclone-exposed, earthquake zones)

Benchmarking reports (quarterly):

Country: Bangladesh
Peer group: South Asian countries with flood risk (India, Pakistan, Nepal)

Early Warning Coverage:
- Bangladesh: 87%
- Peer average: 72%
- Best in group: 94% (India)
- Assessment: Above average; gap to best practice: 7 percentage points

Protection Latency:
- Bangladesh: 38 hours
- Peer average: 56 hours
- Best in group: 28 hours (Nepal)
- Assessment: Strong performance; opportunity to learn from Nepal's playbook protocols

Equity (Reach Ratio for bottom quintile):
- Bangladesh: 1.12
- Peer average: 0.89
- Best in group: 1.18 (Pakistan)
- Assessment: Pro-equity; slight opportunity for improvement

Learning: Benchmarking identifies:

Strengths to celebrate and share with others
Gaps requiring improvement
Peer models to learn from (specific practices to adopt)
Innovation opportunities (where no peer excels; room for breakthrough)

Global dashboards: Public rankings (with country consent) create reputational incentives for improvement—no government wants to be bottom of league table. But rankings must:

Account for context (don’t penalize least developed countries for having fewer resources)
Emphasize improvement trajectory, not just absolute performance
Highlight exemplars across diverse contexts (best performer in Africa, best improver globally, most equitable system, etc.)

Confidence Tiers and Epistemic Humility

Not all KPIs have equal certainty: Some metrics well-measured (sensor uptime); others estimated with substantial uncertainty (lives saved counterfactually).

GCRI approach: Assign confidence tiers to KPIs, communicate uncertainty transparently.

Tier 1 – High confidence (direct measurement, minimal inference):

System uptime (server logs)
Validation timeliness (timestamp logs)
Grievance response times (database records)
Sensor coverage (inventory)

Tier 2 – Medium confidence (requires some inference/modeling):

Forecast accuracy (observed events vs forecasts; but some events may not be fully observed)
Protection latency (requires matching forecast timestamps to action reports; some actions undocumented)
Early warning coverage (surveys have sampling error; self-reported access may differ from actual)

Tier 3 – Lower confidence (substantial counterfactual estimation):

Lives saved (requires counterfactual model of what would have happened)
Economic losses avoided (requires baseline loss model)
Displacement prevented (requires migration model)

Reporting format:

Lives saved (2024): 12,400 [90% CI: 8,200 - 18,100] (Tier 3)

Methodology: Counterfactual model comparing actual mortality to baseline 
vulnerability × hazard intensity, validated against historical events
Confidence: Lower (model uncertainty ±35%)
Sensitivity: Result sensitive to assumptions about baseline vulnerability; 
best estimate uses conservative assumptions

Why transparency matters:

Credibility: Acknowledging uncertainty builds trust more than false precision
Accountability: Clear about what we know vs estimate
Learning: Understanding uncertainty helps prioritize where to improve measurement

Mechanism III: Counterfactuals and Causal Inference

The Fundamental Problem of Causal Inference

Counterfactual question: What would have happened without the intervention?

Fundamental problem: Cannot observe both realities—can’t have intervention AND no intervention in same place/time. We observe one; must estimate the other.

Naive approach – Before/after comparison:

Mortality before early warning: 500 deaths/year
Mortality after early warning: 100 deaths/year
Claimed impact: 400 lives saved

Problem: Many confounders (things that changed beyond intervention):

Maybe hazard intensity decreased (fewer/weaker cyclones)
Maybe overall development increased (better housing, healthcare)
Maybe population moved away from high-risk areas
Maybe other programs also contributing

Before/after comparison conflates:

Intervention effect (what we want to measure)
Time trends (secular improvement)
Confounding factors (other concurrent changes)

Counterfactual Estimation Methods

1. Randomized Controlled Trials (RCTs) – Gold standard when ethical and feasible

Design:

Randomly assign units (villages, districts, households) to treatment (receive intervention) or control (do not)
Randomization ensures treatment and control groups are statistically identical except for intervention
Compare outcomes; difference = causal effect

Example – Forecast-based financing in Bangladesh:

Design: 60 sub-districts randomly assigned: 30 receive anticipatory cash transfers 
triggered by GloFAS forecasts; 30 receive standard post-disaster response

Outcome: Food insecurity measured 6 months post-flood

Results:
- Treatment group: 18% severe food insecurity
- Control group: 28% severe food insecurity
- Difference: 10 percentage points (95% CI: 6-14pp)
- Interpretation: Anticipatory action reduced severe food insecurity by 36% relative to control

Limitations:

Ethical concerns: Denying possibly life-saving intervention to control group
Political constraints: Governments often must act everywhere, not selectively
Externalities: Treatment may affect control (spillovers via migration, trade, information)
Cost and time: RCTs expensive, take years to complete

When appropriate: Testing new approaches on margin (expanding coverage area); learning whether specific design features matter; measuring mechanisms.

2. Quasi-experimental designs – Exploit natural variation

Difference-in-differences (DiD):

Logic: Compare change over time between treated and untreated groups.

Assumption: Parallel trends—absent treatment, both groups would have followed same trajectory.

Formula:

Impact = (Y_treated,after - Y_treated,before) - (Y_control,after - Y_control,before)

Example – Early warning system in Colombia:

Setting: Early warning deployed in Pacific coast municipalities 2018
Comparison: Caribbean coast municipalities (similar exposure, no early warning until 2020)

Mortality trend (per 100k):
            2015  2016  2017  2018  2019  2020
Pacific      8.2   7.9   7.6   5.1   4.8   4.6  (early warning starts 2018)
Caribbean    8.4   8.1   7.8   7.5   7.3   4.9  (early warning starts 2020)

DiD calculation:
Pacific change (2017→2019): 7.6 - 4.8 = -2.8
Caribbean change (2017→2019): 7.8 - 7.3 = -0.5
Difference: -2.8 - (-0.5) = -2.3 deaths per 100k

Interpretation: Early warning reduced disaster mortality by ~2.3 per 100k (30% reduction)

Checks:

Pre-trends parallel? (Yes, Pacific and Caribbean followed similar trends 2015-2017)
No other confounders? (Check for concurrent programs, policy changes)
Placebo tests: If we pretend intervention happened earlier (e.g., 2016), do we see effect? (Should be no)

Regression discontinuity (RD):

Logic: Intervention has sharp threshold; compare units just above vs just below threshold.

Example – Flood insurance threshold:

Setting: Flood insurance subsidized for properties <5m above sea level
Question: Does insurance reduce flood damages?

Approach: Compare properties at 4.5-5.0m (insured) vs 5.0-5.5m (not insured)
Properties on both sides of threshold similar except for insurance

Results:
- Insured properties (4.5-5.0m): Average flood damage $8,200
- Uninsured properties (5.0-5.5m): Average flood damage $12,400
- RD estimate: Insurance reduces damage by $4,200 (34%)

Assumptions: No manipulation (people don't precisely choose elevation to get insurance)

Synthetic control method (SCM):

Logic: Create synthetic version of treated unit from weighted combination of untreated units. Post-intervention, gap between treated and synthetic = impact.

Example – Ethiopia Productive Safety Net Programme (PSNP) expansion:

Setting: PSNP expanded to Somali region in 2014 with shock-responsive features
Question: Impact on poverty and resilience?

Approach:
- Create "synthetic Somali" from weighted average of other Ethiopian regions
- Weights chosen so synthetic Somali matches actual Somali pre-2014 on:
  * Poverty rates
  * Rainfall patterns
  * Livestock ownership
  * Infrastructure access

Results:
        2010  2012  2014  2016  2018  2020
Actual    38%   36%   34%   28%   26%   24%  (poverty rate)
Synthetic 38%   36%   34%   32%   31%   30%

Difference (2014-2020): 6 percentage points
Interpretation: PSNP reduced poverty rate by ~6pp more than would have occurred otherwise

Advantages: Works when only one treated unit; flexible; visually intuitive.

3. Statistical matching – Create comparable groups post hoc

Propensity score matching (PSM):

Logic: Estimate probability (propensity) that unit receives treatment given observed characteristics. Match treated units to control units with similar propensity scores.

Example – Community-based early warning:

Setting: Some villages adopted community early warning (CBEW); others did not
Selection not random (villages with strong leadership more likely to adopt)

Approach:
1. Estimate propensity score: P(CBEW | village characteristics)
   Characteristics: Leadership quality, education, distance to town, prior disaster experience
2. Match each CBEW village to non-CBEW village with similar propensity
3. Compare outcomes between matched pairs

Results:
- CBEW villages: 12% of population displaced during floods
- Matched control villages: 21% displaced
- PSM estimate: CBEW reduced displacement by 9 percentage points (43% relative reduction)

Limitation: Only controls for observed confounders. If unobserved factors drive both treatment and outcomes, PSM biased.

Instrumental variables (IV):

Logic: Find variable (instrument) that affects treatment but doesn’t directly affect outcome.

Example – Forecast accuracy and mortality:

Question: Does more accurate forecasting reduce mortality?
Problem: Can't randomize forecast accuracy

Instrument: Distance to weather radar
- Closer to radar → better forecast accuracy (instrument affects treatment)
- Distance itself doesn't directly affect mortality (exclusion restriction)

Approach:
1. First stage: Forecast accuracy = f(distance to radar, controls)
2. Second stage: Mortality = f(predicted forecast accuracy, controls)

Result: 10 percentage point increase in forecast accuracy → 15% reduction in mortality

4. Model-based counterfactuals – Use simulation models

When no comparison group available, use calibrated models to simulate counterfactual.

Approach:

Calibrate disaster loss model to historical events (before intervention)
For recent event with intervention, model predicts losses given hazard intensity and baseline vulnerability (no intervention)
Actual losses observed
Difference = intervention impact estimate

Example – 2020 Cyclone Amphan (West Bengal):

Hazard: Category 3 equivalent cyclone (185 km/h winds)

Model prediction (baseline vulnerability, no early warning):
- Based on calibration to 1999 Super Cyclone (similar intensity, pre-early warning era)
- Predicted mortality: 480-620 deaths

Actual observed mortality: 86 deaths

Estimated lives saved: 394-534 (central estimate: 464)

Model uncertainty: 
- Assumes baseline vulnerability unchanged (may overestimate if development reduced vulnerability)
- Assumes hazard intensity similar (wind speed estimates have ±15% uncertainty)
- Confidence interval: 250-700 lives saved (90% CI)

Validation: Test model on historical events where we know actual outcome. If model accurately predicts those, increases confidence in counterfactual estimates.

Causal Mechanism Analysis

Beyond “did it work?” ask “how did it work?”—understanding mechanisms enables replication and improvement.

Process tracing: Detailed case studies documenting causal chain.

Example – Why did early warning reduce mortality in Bangladesh but not in Haiti?

Bangladesh (successful mechanism):

Forecast issued 72h ahead →
District officials activated playbook (pre-authorized) →
Community volunteers conducted door-to-door notification →
Cyclone shelters opened, evacuation transport provided →
2M people evacuated →
Cyclone struck; shelters protected population →
Result: 26 deaths (vs 500+ predicted)

Haiti (broken mechanism):

Forecast issued 48h ahead →
❌ No clear authority to activate response (government dysfunction) →
❌ Alerts issued via radio/SMS but low trust in government (history of broken promises) →
❌ No designated shelters or organized evacuation →
❌ Most people did not evacuate despite warning →
Hurricane struck →
Result: 546 deaths

Lesson: Early warning necessary but insufficient. Requires: Functional governance, public trust, pre-positioned resources, rehearsed protocols. GCRI designs systems addressing full causal chain, not just forecast dissemination.

Mediation analysis: Quantify how much of effect operates through specific pathways.

Example – How does anticipatory cash reduce food insecurity?

Potential mechanisms:

Income smoothing: Cash allows purchasing food during shock
Asset protection: Cash prevents distress sale of livestock/tools
Migration prevention: Cash allows staying in place, maintaining livelihoods

Mediation analysis:

Total effect of cash on food insecurity: -12 percentage points

Decomposition:
- Via income smoothing: -5pp (42% of effect)
- Via asset protection: -4pp (33% of effect)  
- Via migration prevention: -2pp (17% of effect)
- Direct effect (residual): -1pp (8% of effect)

Implication: Optimize design by targeting mechanisms. If asset protection is key pathway, ensure cash arrives before distress sales begin.

Mechanism IV: Sunset Clauses and Renewal Clocks

Designed Obsolescence as Governance Tool

Problem: Systems persist indefinitely through inertia, even when circumstances change or performance deteriorates. No forcing function for review.

Solution: Sunset clauses—systems automatically expire unless explicitly renewed after demonstrating continued value.

Rationale:

Forces periodic reassessment (instead of assuming perpetual relevance)
Shifts burden of proof (must justify continuation, not justify termination)
Creates opportunity for redesign (incorporate learning, adapt to changed context)
Prevents zombie programs (operational but ineffective)

Review Clock Architecture

Three clock types with different cadences:

1. Safety review clock (annual) – Technical performance

Trigger: Automatic at 12-month intervals from deployment

Scope: Model performance, data quality, system reliability

Review questions:

Are forecast skill scores maintaining target thresholds?
Has model drift been detected? (Performance degrading over time)
Are data sources still available and reliable?
Have any security incidents occurred?
Are validators satisfied with output quality?

Evidence required:

Verification statistics (POD, FAR, CSI) over past year
Comparison to previous year (improving/stable/degrading?)
Documented model recalibrations or updates
Security audit results
Validator assessments and any dissents

Decision outcomes:

Renew (green): Performance satisfactory; continue operations
Conditional renewal (yellow): Performance acceptable but trends concerning; corrective action plan required; re-review in 6 months
Suspend (red): Performance below standards; system suspended until issues resolved
Sunset (black): Fundamental approach flawed; system terminated; redesign needed

2. Legitimacy review clock (biennial) – Rights and equity

Trigger: Every 24 months

Scope: Rights protections, equity outcomes, community satisfaction, grievance patterns

Review questions:

Are reach ratios maintained >1.0 for vulnerable groups?
Are protection latency gaps closed?
Is grievance mechanism functioning (response times, resolution rates)?
Do affected populations trust system? (Survey data)
Any rights violations or FPIC breaches?

Evidence required:

Disaggregated outcome data (Section 1.7 metrics)
Equity trend analysis
Grievance statistics and systemic issues identified
Community satisfaction surveys
Independent human rights audit

Participants:

Civil society/Indigenous validation nodes (lead)
Community representatives
Human rights organizations
Disability rights advocates
Women’s groups

Decision outcomes:

Renew: Equity standards met; rights protections effective
Conditional: Minor equity gaps; improvement plan within 6 months
Major overhaul: Significant equity failures; operations suspended until redesign addresses structural issues

3. Value review clock (triennial) – Strategic fit and impact

Trigger: Every 36 months

Scope: Does system still address priority needs? Are resources well-allocated? Is approach still optimal given evolving context?

Review questions:

Has risk landscape changed? (Climate change, urbanization, conflict)
Are interventions still targeting highest-impact opportunities?
Do benefits justify costs? (Cost-effectiveness analysis)
Are alternative approaches now available that would be more effective?
Does system align with national/regional strategies?

Evidence required:

Cost-benefit analysis ($ per life saved, $ per person covered)
Comparative effectiveness (vs alternative approaches)
Stakeholder value assessments (government, communities, donors)
Strategic alignment review
Theory of change validation (do our assumptions still hold?)

Participants:

Government (national planning, finance ministries)
Donors and investors
Continental Steward Node
Academic institutions (independent evaluation)
Community representatives

Decision outcomes:

Renew: System delivering value; continue with current design
Evolve: Fundamental approach sound but needs adaptation to changed context; managed evolution
Pivot: Original theory of change no longer valid; major redesign needed
Phase out: Problem solved, system no longer needed, or resources better allocated elsewhere

Default to Sunset (Not Default to Continuation)

Critical design choice: What happens if review doesn’t occur on schedule?

Traditional approach: System continues operating (default to continuation)

Problem: Review can be delayed indefinitely; creates institutional inertia

GCRI approach: System automatically suspends if review not completed (default to sunset)

NVM enforces: As renewal deadline approaches, alerts sent at 90, 60, 30 days
If renewal not approved by deadline, system enters grace period (30 days) with degraded status
If still not renewed after grace period, system automatically suspends (no new outputs; existing operations phase down)

Why this matters: Creates institutional urgency for review. Cannot ignore deadlines. Forces explicit decisions about continuation.

Exception: True emergency during review window. Temporary extension (max 60 days) with documented justification and mandatory expedited review.

Participatory Review Process

Who participates in renewal reviews?

Safety review: Technical validators (academia, industry, government technical staff)

Legitimacy review: Civil society, Indigenous representatives, community members, human rights advocates

Value review: Government decision-makers, donors, academic evaluators, community representatives, Continental Steward

Process design:

Self-assessment (system operators prepare report against review criteria)
Independent evaluation (external reviewers analyze evidence, conduct site visits, interview stakeholders)
Public comment period (30 days; anyone can submit input)
Stakeholder deliberation (review committee considers all evidence and input)
Decision and documentation (publish decision with rationale; if conditional/suspended, publish corrective action plan)
Appeals process (decisions can be appealed to Continental Steward within 30 days)

Transparency: All review materials public (except sensitive security details); decisions published; dissenting opinions included.

Mechanism V: Management Letters and Calibration Discipline

The Management Letter Concept

Origin: Financial auditing—auditors issue “management letter” identifying control weaknesses, risks, and recommendations (separate from formal audit opinion).

Adaptation to disaster risk systems: When model recalibration, methodology change, or operational adjustment needed, formal management letter documents:

Issue identified: What problem/deficiency detected
Evidence: Data showing performance degradation, bias, or failure mode
Root cause analysis: Why issue occurred
Recommended action: Specific changes to address issue
Implementation plan: Who, what, when (with target dates)
Verification: How will we confirm fix worked

Triggers for management letter:

Model performance below thresholds (forecast skill declining)
Systematic bias detected (underpredicting risk for specific locations/populations)
Equity metrics degrading (reach ratios falling for vulnerable groups)
Grievance patterns indicating systemic issue (recurring complaints about same problem)
Security incident revealing vulnerability
Validator dissents raising common concerns

Example – Management Letter for Flood Forecast Model:

Management Letter 2024-Q3-001
Issue: Systematic underestimation of flood peaks in upper Indus basin

Evidence:
- Last 6 flood events: Observed peaks averaged 18% higher than forecast
- Bias particularly pronounced for snowmelt-driven floods (vs rainfall-driven)
- False negative rate 28% (vs target 15%)

Root Cause:
- Snow water equivalent (SWE) estimates from MODIS satellite have negative bias in 
  high-elevation areas (>4000m) due to cloud contamination
- Model calibration period (2010-2020) didn't include recent extreme snowfall years

Recommended Action:
1. Integrate SNODAS snow data product (higher accuracy) as supplementary SWE input
2. Recalibrate model using extended period (2005-2023) including recent extremes
3. Implement ensemble approach (combine MODIS + SNODAS) with uncertainty quantification

Implementation Plan:
- Data integration: Complete by 2024-10-30 (Data Engineering team)
- Recalibration: Complete by 2024-11-15 (Hydrology team)
- Validation testing: 2024-11-16 to 2024-11-30 (using hindcast on 2023 events)
- Deployment: 2024-12-01 (pending validator approval)

Verification:
- Monitor forecast skill for 2024-25 winter season
- Target: Bias <±10%, false negative rate <15%
- Review 2025-03-30; confirm improvement or escalate

Approved by:
[Academia validator signature]
[Government validator signature]

Published: 2024-09-15 (public transparency portal)

Why public:

Accountability: Organizations cannot hide performance problems
Learning: Other jurisdictions facing similar issues can adopt solutions
Trust: Public sees problems identified and addressed, increasing confidence

Recalibration Discipline

Recalibration: Updating model parameters or structure based on new data.

Discipline: Structured process (not ad hoc tweaking) with documentation and oversight.

When to recalibrate:

1. Scheduled (annual baseline):

Even if performance acceptable, refit models annually with latest data
Climate non-stationarity means historical patterns shift
Protects against slow drift

2. Performance-triggered (skill thresholds breached):

If skill scores fall below targets for 2 consecutive seasons → mandatory recalibration
If systematic bias detected (>10% for 3+ events) → immediate recalibration

3. Event-triggered (unprecedented event):

When event occurs outside historical calibration range (e.g., 1-in-100 year event)
Major learning opportunity; incorporate into training data
May reveal model structural deficiencies

4. Methodological advancement (better approaches available):

Research yields improved algorithms, better satellite products, enhanced understanding
Opportunity to adopt frontier techniques

Recalibration protocol:

Phase 1 – Analysis (weeks 1-2):

Identify performance degradation or new data availability
Analyze root causes (data issues, model structure, parameter drift, non-stationarity)
Evaluate whether recalibration likely to help (or need redesign)

Phase 2 – Recalibration (weeks 3-6):

Update training data
Refit model parameters (or adopt new model structure)
Tune hyperparameters via cross-validation
Generate new uncertainty estimates

Phase 3 – Validation (weeks 7-8):

Test recalibrated model on held-out data (hindcast recent events)
Compare to old model (is new version actually better?)
Check for unintended consequences (improved overall skill but worse equity?)
Prepare safety case update

Phase 4 – Approval (weeks 9-10):

Submit to validators with comparison report
Validators review methodology, test results, equity impacts
2-of-N signatures required for operational deployment
If concerns, iterate or escalate to Continental Steward

Phase 5 – Deployment (week 11):

Canary deployment (10% of traffic to test in production)
Monitor closely for 7 days
If stable, full deployment
Old model maintained for 30 days as rollback option

Phase 6 – Monitoring (ongoing):

Track performance of recalibrated model
Compare to pre-recalibration baseline
Publish results in quarterly performance report

Documentation requirements:

Change log (what changed, why, by whom, when)
Reproducibility package (code, data, configuration for new version)
Performance comparison report
Validator approval with signatures
Public announcement (transparency portal)

Avoiding Recalibration Traps

Over-fitting: Recalibrating too aggressively on recent events; model becomes overly specific to recent patterns, loses generalization.

Solution: Cross-validation; hold out recent period; ensure performance on unseen data.

Premature adjustment: Reacting to single poor forecast; noise vs signal confusion.

Solution: Require multiple failures before triggering recalibration; statistical significance tests.

Complexity creep: Each recalibration adds parameters/features; model becomes unwieldy and opaque.

Solution: Parsimony principle; prefer simpler models; regularly simplify (prune unused features).

Loss of institutional memory: Staff turnover means rationale for model design forgotten; recalibrations made without understanding history.

Solution: Documentation discipline; every recalibration includes “theory of change” explaining model logic; onboarding includes model history review.

Summary: Feedback as Power

Design Principle IV asserts that systems earn trust through structured learning, not asserted authority.

The mechanisms:

Sensors and observability → Make reality visible across scales and domains
KPIs with confidence tiers → Quantify performance; acknowledge uncertainty
Counterfactuals and causal inference → Distinguish correlation from causation; estimate impact honestly
Sunset clauses and review clocks → Force periodic reassessment; prevent institutional senescence
Management letters and recalibration → Respond to feedback systematically; improve continuously

Leadership question: Can citizens, overseers, and investors see what we assumed, observe what occurred, understand what we learned, and verify that we adapted?

If yes: Feedback sovereignty is real. System is falsifiable, therefore credible.

If no: System is black box. Might be performing well, but indistinguishable from failing system that hides problems. Without feedback sovereignty, trust is impossible.