Frequently Asked Questions
How this monitor works, what the probabilities mean, and why the governance architecture matters more than the model.
This is the Iran Conflict Scenario Monitor, a structured analytical exercise run by Johannes Fritz and the technology team at the St. Gallen Endowment for Prosperity Through Trade. The Endowment is a Swiss-based, tax-exempt non-profit organisation. Its core work is making government policy transparent: the Endowment executes the Global Trade Alert and the Digital Policy Alert, two databases that have documented over 85,000 government interventions worldwide, regularly cited by the European Central Bank, the International Monetary Fund, the World Bank, and the United Nations, as well as major law firms, corporations, and news organisations including the Financial Times, the Economist, and Bloomberg. A team of 40+ professionals from 25 nationalities supports this work.
This initiative applies methods the Endowment is developing for trade and digital policy to a live geopolitical crisis. It is a technology demonstration—a proof of concept for structured AI-assisted monitoring that seeks to complement existing expertise and tools—not a policy recommendation, and not a political statement.
The monitor does not take sides. It tracks structural dynamics by asking questions such as: which escalation pathways are opening, which bargaining positions are hardening, which scenarios are gaining or losing plausibility? It does not argue that one side is right or wrong, and it does not advocate for any particular outcome. The system is built to reveal competing interpretations of the same evidence—a dedicated adversarial agent is mandated every cycle to construct the strongest possible case against whatever the majority concludes. When the analysts disagree, the disagreement is reported openly rather than smoothed away. The Endowment’s institutional mission is transparency, not advocacy: helping people understand what is happening, not telling them what should happen. Readers across different political positions and geographies should find the analysis useful precisely because it does not start from a conclusion and work backwards.
The Endowment is transparent about one important facet: our deep expertise is in trade and industrial policy, not Middle Eastern security. The monitor is an exercise in applying structured analytical architecture to a domain where the team has no particular subject-matter advantage. The value lies in the method, not in any claim to regional authority. Still, if the findings are deemed useful then it shows what how carefully designed AI methods can shed light on factors while avoiding cognitive biases and blind spots of human experts.
The purpose is to extract analytical signal from noise. During a fast-moving conflict, the volume of reporting is overwhelming—thousands of articles, claims, and counterclaims every day. Most of this volume repeats the same information. No human is likely to be able to sort through all of this information in a meaningful timeframe.
What matters is not how much is being said but what is structurally changing: a new escalation pathway opening, a bargaining position shifting, a previously stable variable becoming unstable. The monitor is designed to identify these structural shifts and track their implications across a defined set of scenarios.
It works through a six-phase cycle that runs up to three times per day. First, the system gathers new developments, each of which must include a verifiable source and publication date. High-impact claims require the system to read the full source, not just a headline or snippet. Second, fourteen specialised analytical agents independently assess the new evidence. Each agent applies a different theoretical framework based on established human scholarship and expertise—one focuses on escalation dynamics, another on bargaining structure, another on how decision-makers misread each other’s intentions, another on the institutional behaviour of Iran’s Revolutionary Guards, and so on. Crucially, no agent sees any other agent’s assessment. This isolation is enforced by the system’s architecture, not by a request to play fair.
Third, a peripheral scanner examines any intelligence that none of the fourteen agents cited, catching signals that fall between their frameworks. Fourth, all fourteen assessments are aggregated: each agent’s proposed probability adjustments are equal-weighted and averaged, then applied to the previous cycle’s probabilities. The scenario probabilities typically sum to 93–95%, with the remainder held as an explicit reserve for outcomes the framework does not cover. Fifth, a structured briefing is written reporting the updated probabilities, the key developments driving them, and any significant disagreements among the agents. Sixth, everything—raw intelligence, individual assessments, aggregated probabilities, and the briefing—is version-controlled and preserved.
The result is a thrice-daily structured assessment that tells you not just what happened, but what it means for the trajectory of the conflict, where the analysts agree, and where they do not.
This monitor is not a sentiment tracker. A sentiment tracker measures volume—how much is being said and in what tone. This monitor measures structural change—what is actually different about the conflict’s dynamics compared to the last update. A day with 10,000 articles repeating the same headline scores high on a sentiment tracker but contains no new information for this system. A single IRGC statement changing its target set from military bases to energy infrastructure might generate almost no media volume but fundamentally reshapes which scenarios are gaining plausibility. One way to think about the difference between a sentiment tracker and our monitor is that the former tracks the extensive margin of information (more of the same) and the intensive margin (something structurally new).
Each analyst is an AI agent given a precise role: apply a specific established framework to the evidence and reach the conclusions that framework produces. One agent applies Thomas Schelling’s theory of coercive bargaining—it looks at each side’s credible threats, audience costs, and the natural boundaries where a settlement might form. Another applies Barry Posen’s work on inadvertent escalation—it maps dual-use systems, command-and-control gaps, and fog-of-war risks where misidentification could trigger unplanned escalation. Another applies Etel Solingen’s research on the domestic political economy of nuclear decisions—it tracks whether the war is strengthening or weakening the internal coalitions that historically pursue nuclear weapons. Recall there are 14 agents in total.
The agents do not gather their own intelligence, run independent models, or consult human experts. They reason through their assigned framework using the evidence provided to them and whatever understanding of that framework exists in the underlying AI model’s training data. The quality of each agent’s assessment depends on three things: how well the AI model understands the relevant theory, how precisely the agent’s instructions channel that understanding, and how good the raw evidence is.
Each agent follows a visible reasoning chain when processing new information. It receives the raw intelligence items and the current probability distribution. For each new development, the agent asks: given my theoretical framework, is this event consistent or inconsistent with each of the eight scenarios, and does it provide genuine diagnostic discrimination—meaning it supports one scenario while being difficult to reconcile with others? Events that are consistent with multiple scenarios simultaneously carry less analytical weight than events that distinguish between them.
Where the agent identifies a meaningful shift, it must specify which scenario is affected, the direction of the adjustment, and a magnitude classified on a predefined five-point scale: negligible, minor, moderate, significant, or major. Each level corresponds to a fixed percentage-point shift, so that “moderate upward” always translates to the same numerical adjustment regardless of which agent proposes it. The scale is deliberately constrained—no agent can move a scenario by an arbitrary amount in a single cycle, which prevents any one piece of evidence or any one theoretical lens from dominating the distribution.
Alongside the direction and magnitude, the agent must name the specific causal mechanism from its framework that connects the event to the scenario shift. An agent cannot say “the situation escalated.” It must conclude something like: “the IRGC’s conditional energy threat creates a new inadvertent escalation pathway because field commanders must interpret threshold-crossing in real time without political guidance, and the threshold definition is ambiguous.” This mechanism requirement is what distinguishes the system from summarisation of news events. It forces the reasoning into the open, where it can be examined and challenged.
The adversarial red team agent plays a specific structural role. Every cycle, regardless of how strong the consensus appears, it must construct the strongest possible case against the leading scenario. It prioritises evidence that is inconsistent with that scenario over evidence that merely confirms it. It runs a pre-mortem: if we are catastrophically wrong in two weeks, what will we wish we had noticed today? This is not optional—it is built into the system’s rules.
The honest summary: the judgements are the output of a large language model reasoning under tightly structured constraints. The rigour comes from the architecture imposed on that reasoning—the isolation, the named mechanisms, the mandatory adversarial challenge—not from any independent empirical capability the agents possess. The creators are explicit about this.
Each probability answers one precisely defined question: how likely is it that this scenario will be the dominant crisis trajectory four weeks from the assessment date? “Dominant” means the scenario’s defining mechanism—not just its surface features—is the primary force shaping the conflict at that point. This is best thought of as a forward-looking assessment of which dynamic will be most actively driving events, not a prediction of how the conflict ultimately ends.
This distinction matters in practice. A scenario like “attrition stalemate” is assessed on whether the pattern of mutual exhaustion without decisive results will come to pass. A scenario like “coercive submission” is assessed on whether Iran’s capitulation is likely within this window. For tail-risk scenarios like “nuclear sprint,” the probability tracks the likelihood of the triggering decision—the moment Iran’s leadership commits to assembling a device—not the completion of a weapon, which could take years.
The eight scenarios are mutually exclusive at any given assessment point, meaning only one can be dominant at a time, but they can transition into one another. A conflict that is currently tracking as regional war could later shift toward attrition stalemate or negotiated exit as conditions change. The probabilities reflect the system’s current assessment of which trajectory is primary, not a permanent assignment.
The probabilities deliberately do not sum to 100%. They are constrained to total 93–95%, with the remainder—typically 5–7%—held as an explicit unspecified tail-risk reserve. This reserve acknowledges that something outside the eight defined scenarios could happen: an outcome the framework did not anticipate. Most forecasting systems force their numbers to add up to unity, which implicitly claims the framework is complete. This system does not make that claim.
Two further caveats are important. First, the system carries a standing warning that scenario probabilities for nuclear-related outcomes are inherently unreliable and should not be used for operational planning. Second, the probabilities express the system’s relative confidence across scenarios—treat them as a ranking with approximate magnitudes, not as precise decimal-point predictions. A scenario at 45% is substantially more likely than one at 9%, but the difference between 45% and 47% is not meaningful.
The short answer is that you should not treat this monitor’s findings as a complete substitute for human intelligence analysis, and the system’s creators do not ask you to. The longer answer is that the system enforces analytical disciplines that are genuinely difficult to sustain in human teams, and that combination—structured AI with transparent limitations—can be useful even where it is imperfect.
The first discipline is independence. In human analytical teams, once one senior analyst shares a view, others tend to converge on it. This is well-documented in intelligence studies and is one of the central problems identified in post-mortems of major analytical failures. The monitor enforces independence architecturally: each agent is spawned in isolation and literally cannot see what the others have concluded. Before the results are combined, the system checks for contamination and discards any assessment that references another agent’s reasoning.
The second discipline is mandatory adversarial challenge. Human red teams exist in many organisations, but in practice they are often sidelined, under-resourced, or treated as a box-ticking exercise. In this system, the adversarial agent runs every cycle with the same weight as every other agent. Its mandate is non-negotiable: construct the strongest possible counter-case, prioritise disconfirming evidence, run the pre-mortem. It cannot be overruled or skipped.
The third discipline is the mechanism requirement. Human analysts frequently make probability adjustments based on gut feeling or pattern recognition without articulating why. This system requires every adjustment to name a specific causal process from a specific theoretical framework. That does not guarantee the reasoning is correct, but it makes it auditable—you can see exactly why each agent moved a probability and decide for yourself whether the logic holds.
The creators tested whether the system’s conclusions were an artefact of whichever AI model they used by running the same framework across multiple models, including one configured to reason from a Tehran perspective using non-Western sources. The broad scenario ranking held, though the Tehran-perspective model inverted the ordering of the two leading scenarios. Their conclusion: the governance architecture matters more than the underlying model.
None of this means the system is infallible. It has no access to classified intelligence, no well-placed human sources, and no ability to read body language in a negotiation. Its understanding of each theoretical framework is only as good as what the AI model absorbed during training. Where evidence is thin, the analysis will be thin. But as a structured, transparent, disciplined way of processing open-source information during a fast-moving crisis—one that shows its reasoning, surfaces its disagreements, and is honest about its limitations—it offers something that headline-driven news coverage does not.
For analysts and decision-makers with genuine regional expertise, the monitor’s value is not in replacing what they already know—it is in providing a structured baseline against which to test their own judgements. A Middle East specialist can look at the system’s scenario probabilities and ask: where do I disagree, and why? That disagreement is itself analytical information.
The monitor also does something that is operationally difficult for any single expert or team: it runs fourteen independent theoretical lenses across the same evidence every twelve hours, surfaces the disagreements between them, and maintains a complete audit trail. No human team can sustain that cadence and that breadth simultaneously over weeks of crisis. The system cannot replace classified intelligence, human sources, or the institutional knowledge that comes from years in the region. But it can provide a disciplined, transparent, and continuously updated reference point—one that shows its reasoning, names its mechanisms, and is honest about its limitations—against which expert judgement can be sharpened rather than replaced.