1.9 Design Principle V — Human–AI Teaming - The Global Centre for Risk and Innovation (GCRI)

Last modified: October 16, 2025

For versions: 1.0.0.1

Wiki
Report 2024-2026 Principles
1.9 Design Principle V — Human–AI Teaming

Estimated reading time: 19 min

Transparent Models, Safety Cases, and Rollback Discipline

The Human-AI Complementarity

Why Neither Alone Suffices

Human-only disaster risk management:

Cannot process petabytes of satellite imagery in real-time
Cannot run ensemble forecasts requiring petaflops of computation
Cannot detect subtle patterns in noisy sensor data across thousands of locations
Limited working memory and attention span
Cognitive biases (availability heuristic, anchoring, confirmation bias)
Inconsistent performance under stress, fatigue, time pressure

AI-only disaster risk management:

Cannot incorporate context outside training data (unprecedented events, social dynamics, political constraints)
Cannot make value-laden judgments (who to prioritize, acceptable risk levels, equity trade-offs)
Cannot explain decisions in ways that build public trust
Cannot recognize when assumptions fail catastrophically (out-of-distribution inputs)
Cannot adapt to novel situations requiring creativity, improvisation, ethical reasoning
No accountability—AI cannot be held responsible for decisions

Human-AI teaming approach: Leverage AI’s computational scale and pattern recognition; leverage human judgment, contextual understanding, and ethical reasoning.

Division of labor:

AI handles: Data processing, pattern detection, scenario simulation, routine forecasting, anomaly flagging, optimization under constraints
Humans handle: Final decisions on action, value judgments, novel situation interpretation, stakeholder communication, strategic planning, accountability

Core claim: AI extends perception and foresight; humans adjudicate values and hold authority.

Failure Modes When Teaming Goes Wrong

Over-automation: Humans defer excessively to AI; lose situational awareness; cannot override when AI fails.

Example – Air France 447 crash (2009): Autopilot disconnected due to sensor icing; pilots unfamiliar with manual flying; incorrect actions led to stall and crash despite correctable situation. Over-reliance on automation atrophied manual skills.

Under-trust: Humans ignore AI recommendations even when correct; system value unrealized.

Example – Collision avoidance systems: Pilots sometimes disable TCAS (Traffic Collision Avoidance System) warnings during approach due to false alarm history, missing genuine collision risks.

Opacity: Humans cannot understand AI reasoning; cannot verify, cannot learn, cannot improve.

Example – Medical diagnosis AI: Black-box deep learning models achieve high accuracy but doctors cannot explain to patients why recommendation made; cannot identify when model reasoning is spurious (e.g., learning hospital-specific artifacts rather than disease patterns).

Accountability gaps: When human-AI team makes wrong decision, unclear who bears responsibility.

Example – Predictive policing: Algorithm recommends patrol locations; crime occurs elsewhere; algorithm claims “just providing data”; police claim “following algorithm.” Neither accountable.

Design Principles for Effective Teaming

1. Complementarity: Allocate tasks to actor with comparative advantage 2. Transparency: Humans understand AI reasoning sufficiently to verify and override 3. Human authority: AI advises; humans decide and are accountable 4. Explainability: AI provides interpretable rationale commensurate with decision stakes 5. Controllability: Humans can inspect, override, pause, rollback AI systems 6. Continuous improvement: Both AI and human skills develop through teaming

Mechanism I: Model Cards and Technical Documentation

The Model Card Framework

Origin: Google Research (Mitchell et al. 2019) – standardized documentation for ML models, analogous to nutritional labels on food or datasheets for hardware components.

Purpose: Transparent disclosure of model capabilities, limitations, training data, intended use, ethical considerations—enabling informed deployment decisions.

GCRI Model Card Structure (10 sections):

1. Model Identity and Version

Model Name: GloFAS-GCRI Flood Forecast v4.2
Model Type: Ensemble hydrological forecasting (LISFLOOD + GR4J + HYPE)
Version: 4.2.1
Release Date: 2024-08-15
Deprecation Date: 2027-08-15 (3-year operational lifetime)
Maintainer: GCRI Hydrology Team + European Commission JRC
License: CC BY 4.0
DOI: 10.5281/zenodo.12345678

2. Intended Use and Scope

Primary use: 7-14 day probabilistic flood forecasting for major river basins globally to trigger anticipatory action

In-scope applications:

Large river basins (>10,000 km²) with historical gauge data
Lead times 3-14 days
Advisory outputs for human decision-makers
Probabilistic ensemble forecasts (not deterministic)

Out-of-scope / inappropriate uses:

Flash flood forecasting (<6 hour lead time)—use nowcasting systems instead
Small ungauged catchments (<1,000 km²)—model not calibrated for these scales
Urban drainage systems—infrastructure not represented in model
Autonomous triggering of financial instruments without human review—model uncertainty requires human judgment
Legal evidence—forecasts are advisory, not forensic-grade deterministic predictions

Geographic coverage: Global with best performance in basins with >10 years calibration data

Temporal resolution: 6-hourly forecasts; daily re-initialization

3. Training Data and Preprocessing

Data sources:

Meteorological forcing: ECMWF ERA5 reanalysis (1979-2023); ECMWF ensemble forecasts (2023-present)
River discharge observations: 3,247 gauge stations via Global Runoff Data Centre; national meteorological services
Digital elevation: MERIT Hydro 90m DEM
Land cover: ESA CCI Land Cover 300m
Soil properties: SoilGrids 250m

Training period: 1990-2020 (calibration); 2021-2023 (validation)

Preprocessing:

Bias correction: Quantile mapping of precipitation forecasts to ERA5 reanalysis
Gap filling: Linear interpolation for <3 consecutive missing observations; longer gaps excluded
Quality control: Physically implausible values (negative discharge, extreme outliers) flagged and excluded
Spatial aggregation: Point observations aggregated to model grid cells (0.1°) using area weighting

Dataset statistics:

Total gauge-years: 38,450
Missing data: 8.3% (varies by region; Africa 15%, Europe 4%)
Extreme events captured: 1,247 floods exceeding 20-year return period

4. Model Architecture and Methodology

Ensemble composition: Three hydrological models with different structural assumptions:

LISFLOOD (spatially distributed, grid-based)
GR4J (conceptual, parsimonious)
HYPE (semi-distributed, catchment-based)

Ensemble generation:

51 meteorological ensemble members (ECMWF)
3 hydrological models
Total: 153 forecast traces per initialization

Physics: Surface runoff, infiltration, evapotranspiration, subsurface flow, channel routing, reservoir operations (where data available)

Calibration: Automated via SCE-UA global optimization; objective function combines Nash-Sutcliffe Efficiency and logarithmic NSE (balance high/low flow performance)

Uncertainty quantification: Ensemble spread represents meteorological and model structure uncertainty; does not include parameter uncertainty (computationally prohibitive at global scale)

5. Performance Metrics and Limitations

Skill scores (7-day lead time, validation period 2021-2023):

Global averages:

Probability of Detection (POD): 0.82
False Alarm Rate (FAR): 0.26
Critical Success Index (CSI): 0.62
Brier Skill Score: 0.48

Regional variation:

Best: Europe (CSI 0.71), North America (CSI 0.68)
Moderate: South America (CSI 0.58), Asia (CSI 0.56)
Challenging: Africa (CSI 0.47)—limited gauge data for calibration

Known limitations:

Snowmelt floods: Underestimates in high-elevation catchments (>3,000m) due to sparse snow observations; underpredicts peak magnitude by ~15%
Groundwater-fed systems: Limited representation of groundwater; performs poorly in karst or highly permeable geology
Dam operations: Reservoir release rules often unavailable; assumes historical operating policies continue
Urban areas: Storm drainage not modeled; flood depths overestimated in cities with sewer infrastructure
Glacial lake outburst floods (GLOFs): Not represented; requires separate specialized models
Compounding factors: Model focused on hydrological flooding; does not account for storm surge interaction in coastal zones or landslide dam failures

Failure modes:

Precipitation bust: When meteorological forecast completely misses storm, flood forecast fails (5-10% of events)
Convective storms: Small-scale intense rainfall (<50km) underresolved in global models
Ice jams: River ice breakup floods not represented
Flash floods: Beyond temporal resolution (6-hour) and spatial scale

6. Fairness and Bias Assessment

Geographic bias testing:

Performance stratified by GDP per capita: No significant bias detected (low vs high-income countries similar CSI after accounting for gauge density)
Gauge density correlation: Performance degrades with <0.05 gauges per 1,000 km² (primarily affects Central Africa, parts of Amazon)

Demographic disaggregation: Model outputs (flood extent) intersected with population demographics:

False negative rate (miss floods) slightly higher in informal settlements vs formal urban (12% vs 8%)—likely due to coarse elevation data missing micro-topography
No systematic bias by ethnicity or income quintile after controlling for settlement type

Corrective actions:

Targeted gauge network densification in underperforming regions (Africa, parts of Asia)
Ultra-high-resolution DEM (~5m) acquisition for urban informal settlements via drone/satellite
Community-based flood extent validation in areas with weak observational networks

7. Environmental and Computational Costs

Carbon footprint:

Model training (calibration over all basins): ~450 kg CO₂eq (3,200 GPU-hours on NVIDIA A100)
Operational forecasting: ~1.2 kg CO₂eq per model run × 2 runs per day = ~880 kg CO₂eq annually
Total annual footprint: ~1,330 kg CO₂eq
Offset: 100% via renewable energy purchasing agreements

Computational requirements:

Training: 3,200 GPU-hours (one-time per version); 12,000 CPU-hours
Inference: 120 CPU-hours per forecast run (parallelized across 256 cores; wall-clock time 30 minutes)

Data storage: 850 TB archived forecasts (2020-present); 45 TB current operational data

8. Ethical Considerations and Societal Impact

Dual-use concerns: Flood forecasts could theoretically be misused (e.g., adversary timing attacks during predicted flooding). Mitigation: Forecasts publicly available—no informational advantage to restrict access.

Automation risk: Risk of over-reliance; humans might defer to forecasts without applying contextual judgment. Mitigation: Forecasts labeled “advisory only”; human authorization required for all actions; validation by multiple independent nodes.

Equity impacts: Early warning access patterns may reproduce existing inequalities. Mitigation: Equity metrics tracked (Section 1.7); proactive outreach to marginalized populations; accessible communication formats.

Privacy: Flood extent mapping could reveal household-level vulnerabilities. Mitigation: Public outputs aggregated to >100 household clusters; individual property data not published.

Accountability: If forecast fails and harm results, who is responsible? Mitigation: Clear documentation that forecasts are probabilistic and advisory; human decision-makers accountable for actions; forecast performance publicly tracked.

9. Maintenance and Update Schedule

Monitoring: Continuous performance tracking; skill scores computed for each event; automatic alerts if performance degrades >10% for 3 consecutive months

Recalibration: Annual (Section 1.8) with latest observations; triggered calibration if systematic bias detected

Model updates: Major version every 2-3 years incorporating methodological advances; minor versions quarterly for bug fixes and data source updates

Deprecation: Model retired after 3 years operational or when replaced by superior version; 6-month transition period with both versions running in parallel

10. Validation and Approval

Validators:

Academia: European Commission Joint Research Centre (Dr. X, hydrologist)
Government: Bangladesh Meteorological Department (Dr. Y, flood forecasting lead)
Signatures: [Cryptographic signatures]
Date: 2024-08-15

Safety case reference: SC-2024-0815-GloFAS (see Section 1.9 Mechanism II)

Public availability: Full model card at [transparency portal URL]; reproducibility package (code, sample data, configuration) at [repository URL]

Enforcement of Model Card Requirements

NVM validation gate: Models cannot enter operational deployment without approved model card signed by 2+ validators from different sectors.

Penalties for incomplete disclosure: If model deployed without compliant card, automatic suspension; validators who signed can be temporarily removed from duties pending investigation.

Continuous updates: Model cards living documents; updated when model modified, new limitations discovered, or fairness issues identified.

Mechanism II: Safety Cases (Structured Assurance Arguments)

Goal Structuring Notation (GSN)

Safety case: Structured argument, supported by evidence, that system is acceptably safe for specific use.

Origin: High-reliability industries (aviation, nuclear, medical devices) where failures have catastrophic consequences.

Goal Structuring Notation: Graphical representation of safety argument showing claims, strategies, evidence, context.

Elements:

Goal (rectangle): Claim about system safety (e.g., “Model X is sufficiently accurate for triggering anticipatory finance”)
Strategy (parallelogram): Approach to demonstrate goal (e.g., “Argument by validation on independent test set”)
Solution (circle): Evidence supporting claim (e.g., “Validation report showing CSI >0.60”)
Context (rounded rectangle): Conditions under which claim holds (e.g., “For river basins >10,000 km² with calibration data”)
Assumption (oval): Unproven claims argument depends on (e.g., “Historical climate patterns remain relevant”)
Justification (rounded parallelogram): Rationale for strategy choice

Relationships:

Supported by: Goal → Strategy → Sub-goals → Evidence
In context of: Claims apply under specific conditions
Assumptions: Argument validity depends on these holding

Example GCRI Safety Case Structure

Top-level goal: GloFAS-GCRI v4.2 is sufficiently safe for operational use in triggering anticipatory action financing for 7-day riverine flood forecasts

Strategy: Argue safety through:

Validation of forecast skill
Demonstration of equity (no systematic bias)
Verification of operational reliability
Establishment of rollback capability
Evidence of appropriate human oversight

Sub-goal 1: Forecast skill adequate

Context: Large river basins (>10,000 km²); 7-day lead time; probabilistic forecasts

Claim: Model achieves POD ≥0.80, FAR ≤0.30, CSI ≥0.60 on independent test set

Evidence:

Validation report (2021-2023): POD 0.82, FAR 0.26, CSI 0.62 ✓
Peer review by academia validation node: Approved
Benchmark comparison: Performance exceeds global flood forecasting standards (WMO guidelines)

Assumption: 2021-2023 validation period representative of future performance (climate stationarity holds for 3-5 year horizon)

Sub-goal 2: Equity maintained

Claim: Model does not systematically disadvantage marginalized populations

Evidence:

Fairness audit: No significant difference in false negative rates across income quintiles (after controlling for gauge density)
Sensitivity analysis: Simulated performance if gauge network expanded in underserved regions → minimal performance gain in wealthy regions, substantial gain in underserved
Community validation: Focus groups in 12 underserved communities reviewed forecast accessibility and appropriateness; concerns addressed in design

Limitation: Geographic bias remains in Central Africa due to gauge scarcity; mitigation strategy in place (network densification underway)

Sub-goal 3: Operational reliability

Claim: Model runs successfully ≥99.5% of scheduled executions; failures detected and mitigated within 4 hours

Evidence:

Historical uptime (past 12 months): 99.7%
Redundancy: Multi-region deployment with automatic failover
Monitoring: Real-time health checks; alerts to on-call team within 5 minutes of failure
Incident log: 3 failures in past year; all resolved within 2 hours; no forecast windows missed

Sub-goal 4: Rollback available

Claim: If model produces erroneous forecasts, can revert to previous version within 1 hour

Evidence:

Version control: Previous stable version (v4.1) maintained in warm standby
Rollback procedure: Documented and tested quarterly; mean rollback time 28 minutes
Trigger criteria: Defined thresholds for initiating rollback (e.g., ensemble spread >2× historical, forecast contradicts observations from multiple gauges)
Authority: Any validator can demand rollback with documented justification; automatic rollback if 2+ validators flag critical error

Sub-goal 5: Human oversight

Claim: Human decision-makers review forecasts and maintain authority to override before triggering actions

Evidence:

Workflow architecture: Forecasts labeled “advisory”; trigger playbooks require human signature
Training records: 147 government officials trained on forecast interpretation, limitations, override procedures
Override log: 23 instances in past year where officials overrode model recommendation; 18 overrides justified (local information model lacked); 5 overrides unwarranted (official error); feedback incorporated into training

Overall assessment: Goals 1-5 satisfied with documented evidence; minor limitations acknowledged (gauge density, climate stationarity assumption); residual risks acceptable given benefits; approved for operational use with annual re-evaluation.

Validator signatures: [2-of-N requirement met]

Public availability: Full safety case document (50 pages) published on transparency portal

Safety Case Review and Updates

Initial approval: Before operational deployment; thorough review by validators (typically 4-6 weeks)

Annual review: Refresh evidence (update skill scores with latest events, check assumptions still valid)

Triggered review: If significant failure occurs (forecast bust, equity violation, security breach); safety case must be updated or model suspended

Continuous evidence: New evidence (every forecast is data point) continuously accumulated; safety case documents evolve incrementally

Mechanism III: Explainability and Interpretability

The Interpretability Spectrum

Intrinsically interpretable models: Structure itself provides explanation

Linear regression: Coefficients show feature importance and direction
Decision trees: Path from root to leaf is human-readable rule
Rule-based systems: Explicit if-then logic

Trade-off: Often lower accuracy than complex models

Post-hoc explainability: Complex black-box model with added explanation layer

Neural networks + SHAP/LIME: Approximate feature importance for specific predictions
Ensemble methods + feature importance: Which inputs mattered most

Trade-off: Explanations may be approximate or misleading; model and explanation could diverge

GCRI approach: Match interpretability to decision stakes

Low stakes, routine decisions: Complex models acceptable with post-hoc explanations
High stakes, novel situations: Prefer intrinsically interpretable models or require detailed explanations

Explainability Methods

1. Feature importance (global explanations)

Question: Across all predictions, which input features matter most?

Methods:

Permutation importance: Randomly shuffle feature; measure performance drop
SHAP (SHapley Additive exPlanations): Game-theory-based unified approach to feature importance
Partial dependence plots: How does prediction change as feature varies (holding others constant)?

Example – Flood forecast model:

Feature importance (contribution to forecast skill):
1. Upstream precipitation (7-day accumulation): 38%
2. Antecedent soil moisture: 22%
3. River discharge (current): 18%
4. Snowpack water equivalent: 12%
5. Topography (basin slope): 6%
6. Land cover: 4%

Interpretation: Recent rainfall and soil saturation drive forecasts; 
current river conditions provide nowcast baseline; snow important in 
mountainous regions but less globally.

2. Local explanations (individual predictions)

Question: Why did model predict X for this specific case?

SHAP force plots: Show how each feature pushes prediction higher or lower from baseline

Example:

Location: Brahmaputra River at Bahadurabad, 2024-07-15
Base prediction (climatology): 15% flood probability
Actual prediction: 78% flood probability

Feature contributions:
+ Upstream precip (350mm in 7 days): +45%
+ Soil already saturated (95th percentile): +12%
+ Snowmelt surge (higher than usual): +8%
- Below-normal river level currently: -2%

Explanation: Heavy upstream rainfall is dominant factor; soil saturation 
means little infiltration capacity; snowmelt adds to runoff; current 
level slightly below normal provides minor buffer.

3. Counterfactual explanations

Question: What would need to change for model to predict differently?

Example:

Current prediction: 78% flood probability → Trigger anticipatory action

Counterfactuals (what would reduce probability to <60%, below trigger threshold?):
- If upstream rainfall were 220mm instead of 350mm (37% reduction)
- If soil moisture were 60th percentile instead of 95th (drier soil, more infiltration)
- If snowmelt were delayed by 5 days (peak runoff shifts outside forecast window)

Use: Helps decision-makers understand forecast sensitivity and what 
conditions to monitor for forecast evolution.

4. Uncertainty decomposition

Question: Where does uncertainty come from?

Example:

Forecast: 65% probability of flooding [90% CI: 45-82%]

Uncertainty sources:
- Meteorological forecast uncertainty (precipitation): ±25 percentage points
- Hydrological model structure (LISFLOOD vs GR4J vs HYPE differ): ±12 pp
- Initial conditions (soil moisture estimation error): ±8 pp
- Observation error (current river level measurement): ±3 pp

Total uncertainty: ±48 pp (not simple sum; interactions exist)

Interpretation: Dominant uncertainty is weather; even perfect hydrological 
model wouldn't narrow forecast range much. Focus improvement efforts on 
ensemble meteorological forecasting.

Explanation Depth Calibrated to Audience

Executive brief: One-paragraph summary with key drivers

“Forecast shows 75% flood probability due to heavy upstream rainfall (350mm in one week) on already saturated soil. Confidence is moderate; weather uncertainty is main factor.”

Technical analyst: Detailed feature importance, counterfactuals, uncertainty decomposition (as above)

Validator: Full methodological documentation, reproducibility package, sensitivity analysis, bias testing, comparison to alternative models

Affected community: Plain language with visuals

Map showing rainfall, soil moisture, river levels
Simple explanation: “River will likely flood because of very heavy rain and wet ground. We are 75% confident.”
Actions: “Evacuate low-lying areas; move to shelters; follow local official instructions.”

Mechanism IV: Human-in-the-Loop Architecture

Levels of Automation (Sheridan & Verplank, 1978)

Level 1: Computer offers no assistance; human does all Level 2: Computer suggests alternatives Level 3: Computer suggests one alternative Level 4: Computer suggests one alternative and executes if human approves Level 5: Computer suggests one alternative, executes, and informs human Level 6: Computer executes and informs human only if asked Level 7: Computer decides and executes autonomously; informs human after Level 8: Computer decides and executes autonomously; informs human only if it decides human should know Level 9: Computer decides and executes autonomously; ignores human Level 10: Computer decides and executes autonomously; human cannot override

GCRI policy: Maximum automation level for disaster risk systems is Level 5 for routine forecasts; Level 4 for any forecast triggering financing or public alerts.

Rationale: Humans must remain “in the loop” (Level 4-5) not just “on the loop” (Level 6-7) for decisions affecting lives and resources. Accountability requires human decision-making authority.

Decision Points Requiring Human Involvement

1. Forecast verification before dissemination

Automated: Model runs, generates ensemble forecasts, computes probabilities, flags exceedances

Human review (required before publication):

Do forecasts pass basic sanity checks? (physically plausible, consistent with observations)
Are ensemble members reasonable spread or is one outlier distorting mean?
Does forecast contradict local/traditional knowledge?
Are there contextual factors model cannot see? (recent dam construction, land use changes)

Authority: National validation nodes (minimum 2 signatures from different sectors)

SLA: 24 hours for routine; 4 hours for urgent

Override: Humans can adjust forecast probabilities (within bounds ±20%), suppress publication if fundamentally flawed, or escalate to Continental Steward for expert review

2. Trigger activation

Automated: Compare forecast probability to playbook trigger threshold; if exceeded, flag for human review

Human decision (required before activation):

Confirm forecast is credible and applies to intended geography
Verify authority, budget, resources exist to execute playbook
Assess whether context requires modification (e.g., ongoing conflict limits evacuation routes)
Authorize expenditure and initiate operations

Authority: Government officials per playbook RACI matrix (Section 1.5)

Documentation: Log decision rationale; signature; timestamp

No automatic triggering: Even if forecast is 100% confident, human must authorize action. AI advises; humans decide.

3. Equity and prioritization

Automated: Vulnerability algorithms compute risk scores, recommend beneficiary targeting

Human review (required before finalizing lists):

Are vulnerable populations included? (spot-check algorithm outputs)
Are there marginalized groups algorithm might miss? (refugees, undocumented, nomadic populations)
Does local knowledge suggest adjustments?
Are community representatives satisfied with targeting?

Authority: Mixed—civil society validation nodes review; community committees validate locally

Appeals: Built-in process for households to contest targeting

4. Model updates and recalibration

Automated: Performance monitoring, anomaly detection, suggest when recalibration needed

Human decision (required before deploying updated model):

Review recalibration methodology and test results
Confirm updated model actually improves performance (not just fitting noise)
Check for unintended consequences (equity impacts, new failure modes)
Approve deployment or request modifications

Authority: Validators (2-of-N from different sectors)

Rollback authority: Any validator can demand rollback if updated model performs poorly

Operator Training and Decision Support

Training curriculum (for humans working with AI systems):

Module 1: How models work (conceptual understanding, not math)

What inputs does model use?
How does model generate predictions? (general process)
What can model do well? What are limitations?
When does model fail? What failure modes look like?

Module 2: Interpreting outputs

Probabilistic forecasts (what does “70% probability” mean?)
Uncertainty quantification (confidence intervals, ensemble spread)
Maps and visualizations (how to read forecast products)
Feature importance (what drove this forecast?)

Module 3: When to trust, when to override

Indicators of trustworthy forecasts (consistent with observations, reasonable ensemble spread)
Red flags suggesting model error (forecast contradicts multiple sensors, physically implausible, massive ensemble disagreement)
Case studies: Successful overrides (human local knowledge added value); unjustified overrides (official disbelieved accurate forecast)

Module 4: Decision-making under uncertainty

Risk vs uncertainty
Expected value calculations (probability × impact)
Precautionary principle (when to act despite uncertainty)
Scenarios and contingency planning

Module 5: Equity and accountability

Recognizing algorithmic bias
Ensuring vulnerable populations reached
Documentation and justification of decisions
Grievance mechanisms

Certification: Operators must pass assessment demonstrating competency; recertification every 2 years

Decision aids (not automation, but support):

Checklists (forcing consideration of key factors)
Decision trees (structured reasoning)
Pre-mortems (imagine failure; reason backward to causes)
Red teams (devil’s advocate challenges plan)

Mechanism V: Error Budgets and Drift Monitoring

Error Budget Concept

Origin: Site Reliability Engineering (Google)—explicitly allocate acceptable failure rate; spend “budget” on innovation vs. reliability.

Application to forecasting: Define acceptable error rate; monitor consumption; throttle model updates if error budget depleted.

GCRI error budget framework:

Definition: Maximum acceptable failure rate over rolling 12-month window

Example – Flood forecasts:

Error budget: 20% false alarms acceptable (FAR ≤0.20)

Current status (past 12 months):
- 147 flood alerts issued
- 32 alerts did not materialize (false alarms)
- FAR = 32/147 = 0.218 (21.8%)

Status: OVER BUDGET by 1.8 percentage points
Action: Halt non-critical model updates; focus on reducing false alarms; 
investigate root causes (e.g., are certain regions/conditions prone to false alarms?)

Error budget policy:

Under budget: Can pursue aggressive improvements (new features, algorithm changes)
At budget: Maintain current approach; only low-risk updates
Over budget: Freeze feature additions; focus on reliability; conduct root cause analysis; may need to recalibrate conservatively (accept more missed events to reduce false alarms)

Stakeholder input: Error budget thresholds set through consultation (government, validators, communities)—balancing miss risk vs false alarm cost

Drift Monitoring and Alerts

Model drift: Performance degrades over time as data distribution shifts (e.g., climate change, land use change, instrumentation changes)

Types of drift:

1. Covariate drift: Input distribution changes (e.g., rainfall patterns shifting)

Detection: Statistical tests on input distributions (Kolmogorov-Smirnov, chi-square)
Impact: May or may not affect performance (depends on whether model extrapolates well)

2. Prior probability drift: Outcome frequency changes (e.g., floods becoming more frequent)

Detection: Compare recent event rates to historical baseline
Impact: Model calibration may become biased (over/underpredicts)

3. Concept drift: Relationship between inputs and outputs changes (e.g., new dam changes rainfall-runoff relationship)

Detection: Performance metrics degrade even though inputs look normal
Impact: Directly reduces skill; requires recalibration or model redesign

Monitoring system:

Real-time checks (every forecast run):

Are inputs within historical ranges? Flag outliers for human review
Is ensemble spread reasonable? (Too wide = high uncertainty; too narrow = overconfident)
Do different ensemble members agree on qualitative outcome (flood vs no flood)? Disagreement = caution

Weekly summary:

Skill scores over past week vs. historical
Input distribution changes
Notable forecast busts or successes

Monthly review:

Comprehensive performance metrics
Drift detection tests
Equity metrics (disaggregated performance)
Comparison to benchmark models

Quarterly deep dive:

Root cause analysis of failures
Recalibration assessment
Safety case evidence update
Validator review meeting

Alert thresholds:

Yellow: Performance degraded by 5-10% → Monitor closely; increase validation scrutiny
Orange: Performance degraded by 10-20% → Mandatory investigation; consider recalibration
Red: Performance degraded by >20% OR catastrophic failure (major event completely missed) → Immediate suspension; emergency validator review; rollback to previous version pending investigation

Canary Deployments and Gradual Rollout

Canary deployment: Release new model version to small subset of users/regions before full deployment

Process:

Stage 1 – Internal testing (weeks 1-2):

New model version runs in parallel with production (shadow mode)
Outputs not published; used only for comparison
Validators review performance on recent events

Stage 2 – Canary (weeks 3-4):

Deploy to 10% of geography (select regions with good observational coverage for rapid feedback)
Monitor intensively (daily performance checks)
Rollback trigger: If performance worse than old version or critical error detected

Stage 3 – Gradual expansion (weeks 5-8):

Expand to 25%, then 50%, then 75%
Continuous performance monitoring at each stage
Pause expansion if issues emerge

Stage 4 – Full deployment (week 9+):

Complete rollout after successful gradual deployment
Old version maintained for 30 days as rollback option

Benefits:

Limits blast radius if new version has problems
Provides real-world validation before full deployment
Builds confidence through staged evidence

Mechanism VI: Rollback Discipline and Version Control

Version Control Infrastructure

Every model is versioned: Major.Minor.Patch (semantic versioning)

Major: Fundamental methodology change (e.g., new hydrological model)
Minor: Significant update (recalibration, new data source)
Patch: Bug fix, configuration tweak

Example timeline:

v4.0.0 (2023-01-15): Major release; ensemble of 3 hydrological models
v4.1.0 (2023-07-01): Minor update; bias correction improved
v4.1.1 (2023-08-10): Patch; fixed bug in snowmelt calculation
v4.2.0 (2024-01-15): Minor update; annual recalibration
v4.2.1 (2024-08-15): Patch; updated land cover data

Immutable artifacts: Each version produces cryptographically signed artifacts (model weights, configuration files, container images) stored in version control

Reproducibility: Any past version can be re-run identically; no “lost” models

Git repository for code; MLflow or DVC for model artifacts; Docker images for complete computational environment

Rollback Procedures

Rollback triggers:

Performance drop: Error budget exceeded or skill metrics fall below thresholds
Systematic bias: Forecasts consistently over/underpredicting for specific regions or populations
Security incident: Model or pipeline compromised
Validator challenge: 2+ validators flag critical concern with new version
Unexplained behavior: Model produces outputs that seem wrong but root cause unclear

Rollback authority:

Automatic: NVM monitoring detects threshold breach → triggers rollback without human intervention
Validator-initiated: Any validator can request rollback with documented justification; requires 2-of-N concurrence within 4 hours
Operator-initiated: Duty officer can emergency rollback if operational failure (system down, forecasts not generating); must notify validators within 1 hour

Rollback process:

Step 1 – Initiation (minute 0):

Rollback triggered (automatic or manual)
Alert sent to all validators and ops team
Incident log opened

Step 2 – Traffic redirect (minutes 1-10):

Production traffic switched from current version to previous stable version
Load balancers updated
DNS propagation (for API endpoints)

Step 3 – Verification (minutes 10-20):

Smoke tests confirm previous version operational
Sample forecasts generated and checked for sanity
Alert dissemination channels verified

Step 4 – Incident response (hours 1-24):

Root cause investigation (why did current version fail?)
Data forensics (were specific input conditions responsible?)
User impact assessment (who was affected? any downstream consequences?)

Step 5 – Communication (hours 1-48):

Notify users: “We’ve temporarily reverted to previous model version due to [brief reason]. We are investigating and will update you within 24 hours.”
Transparency portal updated with incident summary
For major incidents, press communication if public-facing

Step 6 – Resolution (days 1-14):

Fix identified issues in rolled-back version
Regression testing (ensure fix doesn’t break other functionality)
Validator re-review and approval
Staged re-deployment (canary → gradual rollout)

Step 7 – Post-mortem (days 7-30):

Blameless review: What happened? Why? How do we prevent recurrence?
Document lessons learned
Update procedures and validation checklists
Share findings with Continental Steward and peer NWGs

Rollback SLA: <1 hour from trigger to previous version operational

Drills: Quarterly rollback exercises (similar to fire drills)—deliberately trigger rollback in non-critical window to ensure procedures work and team maintains muscle memory

Version Lifecycle Management

Production version: Currently operational; serving live forecasts

Previous stable version: Maintained in warm standby for 30 days post-deployment of newer version; available for immediate rollback

Development versions: Multiple versions being developed/tested simultaneously; not user-facing

Deprecated versions: Old versions no longer maintained; archived for reproducibility but not recommended for use; security patches not backported

Sunset policy: Versions >3 years old automatically deprecated unless explicitly renewed through validator review

Mechanism VII: Independent Ethics Review and Stop Buttons

AI Ethics Committees

Composition: Multidisciplinary panels reviewing AI systems for ethical implications

Ethicists (moral philosophy, applied ethics)
Social scientists (anthropology, sociology—understand community impacts)
Legal experts (rights, liability, due process)
Affected community representatives
Technical experts (understand system capabilities and limitations)
Disability rights advocates
Indigenous knowledge holders

Mandate: Review AI systems before deployment; ongoing monitoring; authority to pause or veto

Review scope:

Does system respect human dignity and rights?
Are there risks of discrimination or bias?
Is consent meaningfully obtained?
Are affected populations genuinely included in governance?
What are potential dual uses or misuses?
Are accountability mechanisms adequate?
Could system exacerbate power imbalances or vulnerability?

Process:

Pre-deployment review:

System developers submit ethics review application with: technical documentation, impact assessment, mitigation strategies
Ethics committee reviews over 30-60 days
May request modifications, additional safeguards, or reject deployment

Ongoing monitoring:

Quarterly reports from operators on ethical issues encountered
Annual comprehensive review
Ad hoc reviews if concerns emerge

Decision outcomes:

Approve: System meets ethical standards; proceed
Conditional: Approve with specific requirements (e.g., additional transparency, enhanced grievance mechanisms)
Delay: Need more information or safeguards before decision
Veto: Ethical concerns cannot be adequately mitigated; system should not be deployed

Veto authority: Ethics committee has final say; cannot be overruled by technical or administrative leaders. This is crucial—ethics as binding constraint, not advisory.

Stop Buttons (Emergency Deactivation)

Principle: Any authorized actor can halt AI system if harmful behavior detected

Stop button authorities:

Validators (any of 6 per country): Immediate stop if system producing dangerous/biased outputs
Ethics committee: Stop pending investigation if ethical violations suspected
Affected communities (via representatives): Demand pause if system harming community
Technical operators: Emergency stop if system behaving unexpectedly
Continental Steward: Regional escalation authority

Stop button trigger criteria:

Model producing systematically biased forecasts disadvantaging marginalized groups
Security breach suspected (unauthorized access, data tampering)
Unexplained outputs (model behavior inconsistent with training/validation)
Rights violations (consent not properly obtained, data misuse)
Safety case assumptions violated (e.g., climate stationarity no longer holding)

Stop button procedure:

Immediate (minute 0):

System halted (forecasts cease; no new outputs)
Alert sent to all stakeholders
Incident log opened with documentation of trigger reason

Within 1 hour:

Incident response team convenes (technical lead, ethics rep, validator)
Initial assessment: Is stop justified? Is issue resolvable quickly?

Within 4 hours:

Preliminary investigation results
Decision: (a) Resume operations (false alarm or issue quickly fixed), (b) Extended pause pending deeper investigation, or (c) Rollback to previous version

Within 24 hours:

Detailed incident report
Communication to users and public
Plan for resolution

Resolution timeline: Varies by severity; typically days to weeks for substantive issues

No-retaliation policy: Anyone triggering stop button in good faith protected from professional retaliation, even if stop proves unnecessary. Encourages proactive risk management over institutional pressure to maintain operations.

Red Team Exercises

Red teaming: Adversarial testing where external team tries to break system, identify failure modes, find ways to cause harm

GCRI red team program:

Annual exercises (week-long intensive sessions):

External security researchers, ethicists, social scientists attempt to:
- Craft adversarial inputs causing model failure
- Identify edge cases model mishandles
- Find pathways to bias or discrimination
- Exploit privacy vulnerabilities
- Discover ways system could be misused

Example attack scenarios:

Data poisoning: Inject false sensor data to manipulate forecasts
Adversarial examples: Find inputs causing wildly incorrect predictions
Fairness gaming: Exploit targeting algorithms to exclude vulnerable groups
Social engineering: Trick operators into overriding safeguards
Inference attacks: Extract sensitive information from model outputs

Findings documentation: All red team discoveries documented; mitigation strategies developed; fixes implemented before next operational cycle

Bounty program: Financial rewards for security researchers who identify vulnerabilities through responsible disclosure

Summary: AI as Tool, Human as Decider

Design Principle V asserts AI amplifies human capabilities but never replaces human judgment and accountability.

Division of responsibility:

AI: Perception (process vast data), pattern recognition (detect subtle signals), prediction (probabilistic forecasts), optimization (best allocation under constraints)
Human: Final decisions on action, value judgments (who to prioritize), novelty (unprecedented situations), accountability (bearing responsibility), communication (explaining to public)

Transparency mechanisms:

Model cards: Full disclosure of capabilities, limitations, training, performance
Safety cases: Structured arguments with evidence that system is safe for intended use
Explainability: Feature importance, counterfactuals, uncertainty decomposition matched to audience
Human-in-the-loop: Required human review and authorization at critical decision points

Control mechanisms:

Error budgets: Explicit acceptable failure rates; throttle changes if budget exceeded
Drift monitoring: Continuous performance tracking; automated alerts if degradation
Rollback discipline: Previous versions maintained; <1 hour rollback capability; quarterly drills
Stop buttons: Multiple authorities can halt system immediately if harmful behavior detected

Governance mechanisms:

Ethics committees: Independent review with veto authority; ongoing monitoring
Red teams: Adversarial testing to find failure modes before they cause harm
Public transparency: Model cards, safety cases, incident reports publicly available
No retaliation: Protection for those who raise concerns or trigger stops

Leadership test: Is every AI-assisted decision explainable (why did we do this?), stoppable (can we halt if wrong?), and reversible (can we undo harm?) by design?

If yes: Human-AI teaming is genuine partnership with appropriate human control.

If no: System is autopilot with humans as passengers—inappropriate for disaster risk where accountability is essential.

Written by: GCRI

Was this article helpful?

Like 0 Dislike 0 0 of 0 found this article helpful.

Continue reading

Previous: 1.8 Design Principle IV — Feedback Sovereignty

Next: 1.10 Readiness → Activation

1.9 Design Principle V — Human–AI Teaming

The Human-AI Complementarity

Why Neither Alone Suffices

Failure Modes When Teaming Goes Wrong

Design Principles for Effective Teaming

Mechanism I: Model Cards and Technical Documentation

The Model Card Framework

1. Model Identity and Version

2. Intended Use and Scope

3. Training Data and Preprocessing

4. Model Architecture and Methodology

5. Performance Metrics and Limitations

6. Fairness and Bias Assessment

7. Environmental and Computational Costs

8. Ethical Considerations and Societal Impact

9. Maintenance and Update Schedule

10. Validation and Approval

Enforcement of Model Card Requirements

Mechanism II: Safety Cases (Structured Assurance Arguments)

Goal Structuring Notation (GSN)

Example GCRI Safety Case Structure

Safety Case Review and Updates

Mechanism III: Explainability and Interpretability

The Interpretability Spectrum

Explainability Methods

Explanation Depth Calibrated to Audience

Mechanism IV: Human-in-the-Loop Architecture

Levels of Automation (Sheridan & Verplank, 1978)

Decision Points Requiring Human Involvement

Operator Training and Decision Support

Mechanism V: Error Budgets and Drift Monitoring

Error Budget Concept

Drift Monitoring and Alerts

Canary Deployments and Gradual Rollout

Mechanism VI: Rollback Discipline and Version Control

Version Control Infrastructure

Rollback Procedures

Version Lifecycle Management

Mechanism VII: Independent Ethics Review and Stop Buttons

AI Ethics Committees

Stop Buttons (Emergency Deactivation)

Red Team Exercises

Summary: AI as Tool, Human as Decider

Continue reading

Leave a Reply Cancel reply