Original Title: how I run 200 AI agents on the hormuz crisis with Mirofish, and compare it to polymarket
Original Author: The Smart Ape
Translator: Peggy, BlockBeats

Editor's Note: As AI begins to simulate a public opinion arena, the event itself quietly changes.

This article documents an experiment regarding the situation in the Strait of Hormuz: the author built a simulation system consisting of 200 agents using MiroFish, allowing governments, media, energy companies, traders, and ordinary people to coexist in a simulated social network, forming judgments through continuous interaction, debate, and information dissemination, and comparing the results of this group with the market pricing of Polymarket.

The results were inconsistent. The overall group discussion was biased towards optimism, while the market was significantly more pessimistic; in free expression, a minority of pessimists were closer to the real pricing; however, once in an interview situation, almost all agents converged to more moderate, cooperative expressions.

This division is familiar. In the real world, public statements tend to stabilize and appear optimistic, while true risk judgments lie hidden in actions and informal expressions. In other words, how people speak, what they think, and how they bet with money are often three different systems.

In such a structure, the most valuable signals often do not come from consensus but from those voices that appear out of sync with the noise.

Here is the original text:

I used MiroFish to simulate the situation in the Strait of Hormuz over the next few weeks. This tool excels at managing such issues as it allows for highly complex scenario simulations: introducing multiple participants, different roles, and their respective incentive mechanisms within the same system, allowing these agents to constantly game, debate, and ultimately gradually form a result approaching consensus.

Below are the specific steps I took to run this simulation and the results I ultimately obtained. Anyone can replicate this; the key is to know the steps to follow.

First, MiroFish is an open-source project from a research team in China. After you input a batch of documents, it builds a knowledge graph, generates different agent personalities based on this graph, and then deploys these agents into a simulated Twitter environment. In this environment, they post, retweet comments, like, and debate with each other. After the simulation ends, you can also interview each agent individually to view their perspectives and reasoning processes.

You input a crisis scenario, and it generates a debate around that event; from this debate, you can extract a predictive result.

I pointed it towards a current Polymarket issue: will maritime transport through the Strait of Hormuz return to normal by the end of April 2026?

So, I fed all this information into MiroFish, generating 200 agent roles—including government, media, military, energy companies, traders, and ordinary citizens—and let them debate in a simulated environment for 7 simulated days. Finally, I compared the output results with market pricing.

The overall configuration is as follows:

· Model: GPT-4o mini, with the best balance of cost and effectiveness in the scenario of 200 agents

· Memory system: Zep Cloud, used to store agent memories and knowledge graphs

· Simulation engine: OASIS (a Twitter clone environment provided by Camel-AI)

· Hardware: Mac mini M4 Pro, 24GB RAM

· Runtime: Approximately 49 minutes, completing 100 simulation rounds

· Cost: API calls about 3 to 5 dollars

· Seed material: a 5800-character brief, compiled from Wikipedia, CNBC, Al Jazeera, Forbes, and Reuters, covering military timelines, blockade status, oil prices, economic losses, diplomatic efforts, and factors related to GCC's $32 trillion investment. In other words, all the core information required for agents to form judgments was included.

How to Replicate This Process (Step by Step)

If you want to run it yourself, here are the complete steps I took. The entire process takes about 2 hours to set up, and the API cost is approximately 3 to 5 dollars; if you increase the number of rounds or agents, the cost will also rise.

What You Need to Prepare

· Python 3.12 (do not use 3.14, as tiktoken will report errors in this version)

· Node.js version 22 and above

· An OpenAI API Key (GPT-4o mini is cheap enough, suitable for this scenario)

· A Zep Cloud account (the free version is sufficient for small-scale simulations)

· A machine with decent RAM. I used a Mac mini M4 Pro with 24GB RAM, but 16GB should also suffice

Step 1: Install MiroFish

Then configure your .env file

OPENAI_API_KEY=sk-your-key

OPENAI_BASE_URL=link

OPENAI_MODEL=gpt-4o-mini

ZEP_API_KEY=your-zep-key

Step 2: Create a project and upload your seed document

The seed document is the most important part of the entire process, as it determines what information the agents have regarding the current situation. I prepared a brief of around 5800 characters covering military timelines, blockade status, oil prices, economic losses, diplomatic efforts, and the impact of GCC investments, sourced from Wikipedia, CNBC, Al Jazeera, Forbes, and Reuters.

Step 3: Generate an ontology

This step tells MiroFish what types of entities to recognize and what relationships may exist between these entities.

I ultimately generated 10 types of entities: countries, military, diplomats, businesses, media institutions, economic entities, organizations, individuals, infrastructure, and prediction markets; along with 6 types of relations. If the automatically generated results do not fit your scenario, you can manually adjust them.

Step 4: Build the knowledge graph

This step will use Zep Cloud. MiroFish will send the seed document and the ontology to Zep, which is responsible for extracting entities and building the graph.

This process takes about one or two minutes. I ultimately obtained a graph containing 65 nodes and 85 edges, connecting countries, individuals, organizations, commodities, and other elements.

Step 5: Generate agents

MiroFish will generate a complete personality setting for each entity based on the knowledge graph, including MBTI personality type, age, country, posting style, emotional triggers, forbidden topics, and institutional memories.

I initially generated 43 core agents from the knowledge graph. Subsequently, the system can expand these core roles to the total number you desire. I ultimately set the total number of agents to 200, adding more diverse civilian roles such as cryptocurrency traders, airline pilots, professors, students, activists, etc.

Step 6: Prepare the simulation environment

This step will generate the complete simulation configuration, including the agents' action schedule, initial seed posts, and time parameters. MiroFish will automatically select a reasonably set of default configurations, such as peak activity periods, sleep times, and the posting frequencies of different types of agents.

My configuration was: simulating a total of 168 hours (7 days), 100 rounds (each round represents 1 hour), only using the Twitter scenario, and setting different active schedules for different agents.

Step 7: Begin running the simulation.

Then it's just a matter of waiting. I ran 200 agents and 100 rounds of simulation with GPT-4o mini, taking about 49 minutes. You can monitor progress via the API or directly check the logs.

During the entire process, the agents operate autonomously: they observe the timeline and decide whether to post, retweet comments, share, like, or simply scroll through the information flow, all without manual intervention.

Step 8 (optional): Interview agents

After the simulation ends, the system enters command mode. At this point, you can interview a single agent or interview all agents at once:

Analysis

MiroFish first reads the seed document and automatically generates the ontology structure (including 10 types of entities and 6 types of relationships); then it extracts a knowledge graph based on these definitions (including 65 nodes and 85 edges). On this basis, it constructs a complete personality setting for each entity, including MBTI personality type, age, country, posting style, emotional triggers, and institutional memories.

Eventually, 43 core agents were generated from the knowledge graph and expanded to 200 general agents, incorporating more diverse civilian roles to enhance the overall simulation's diversity and realism.

The specific composition is as follows:

· 140 civilian agents: cryptocurrency traders, airline pilots, supply chain managers, students, activists, professors, etc.

· 16 diplomatic/government roles: Iranian foreign minister, Saudi foreign minister, Omani foreign minister, Bahraini prime minister, Chinese foreign minister, EU, UN, etc.

· 15 media institutions: Reuters, CNN, Bloomberg, Al Jazeera, BBC, Fox News, Wall Street Journal, etc.

· 10 energy/shipping-related: OPEC, Platts, QatarEnergy, Aramco, Maersk, etc.

· 7 financial institutions: Polymarket, Kalshi, Goldman Sachs, JP Morgan, Citadel, ADIA, etc.

· 2 military/political roles: Trump, commander of the Iranian Revolutionary Guard

During the 7 days (100 rounds) of simulation, a total of:

1,888 posts were generated

6,661 behavioral trajectories (recording all actions)

1,611 relayed retweets (inter-agent responses and gaming)

4,051 refreshes (merely browsing the information flow)

311 instances of doing nothing (choosing to observe)

208 likes, 207 shares

70 original viewpoints (new independent positions or judgments)

Overall, the system presented is not simple information generation but closer to a social behavior simulation: for the majority of the time, agents are observing, digesting information, and interacting rather than continuously outputting. This structure closely resembles the behavior distribution in real public opinion arenas—small amounts of original content layered with large amounts of paraphrasing, gaming, and emotional feedback.

Most of the agents' time was spent reading and citing others' viewpoints rather than actively creating new content.

The entire group displayed a clear bias in emotional dissemination: optimistic views were more likely to amplify and be shared, while pessimistic judgments, even if logically closer to reality, tended to spread less and have weaker voices.

Interestingly, 19 agents spontaneously provided specific probability judgments during the posting process, not prompted to do so, but rather as a natural evolution from the discussion.

The average probability formed spontaneously by the group was 47.9%, while the probability given by the Polymarket market was 31%, showing a gap of 16.9 percentage points between the two.

Throughout the simulation, some agents even changed their positions during the 100 rounds of interaction.

After the simulation ended, I used MiroFish's interview function to ask the same question to the 43 core agents: what do you think is the probability of maritime transport through the Strait of Hormuz returning to normal by the end of April 2026 (0–100%)?

The result was: 31 out of 43 agents provided a specific value, while 12 chose to refuse to answer. It is noteworthy that the most cautious voices often chose self-censorship rather than providing a clear prediction—and this, in fact, is closer to how these institutions behave in reality.

The average of each category was above 60%: military at 75%, media at 69%, energy at 66%, finance at 65%, and diplomacy at 61%. The figure given by the market was 31.5%.

The organically evolved group result and the interview results presented two distinctly different pictures.

This is the critical finding.

The interview results appeared significantly more optimistic. When agents posted freely, the short sellers (pessimists) tended to have louder and more specific opinions; however, when interviewed one-on-one, almost everyone would provide a judgment of 60%–70%, due to a preference for cooperation.

The autonomously generated results (organic) are more reliable. A financial advisor posted in a heated debate estimating 65%, which is a judgment formed during the interaction; meanwhile, an agent answering a question in an interview is essentially engaging in pattern matching.

Those pessimists in natural expressions turned out to be the best predictors. The 7 agents who gave a probability of ≤30% in the simulation (Iranian foreign minister, Chinese foreign minister, Kalshi, Platts, an economics professor, an Iranian student, and an anti-war activist) had an average of 22%, with a difference of less than 10 percentage points from the results of Polymarket. Expertise + natural expression = closest to the market.

More importantly, this is not just an AI phenomenon; actors in the real world are similar.

If you were to interview any national leader about a crisis, they would say we are committed to peace; we remain optimistic about solutions. This is standard rhetoric, something that must be said in front of the cameras. But if you look at what they actually do: military deployments, sanctions, asset freezes, divestments—their actions often tell a completely different story.

The Saudi crown prince would tell Reuters that we believe in diplomatic means while his sovereign wealth fund is examining up to $32 trillion in U.S. asset allocations. The Iranian president would say peace is our common goal, while the Iranian Revolutionary Guard is laying mines in the Strait. Trump would say we'll see, while rejecting every ceasefire proposal.

This simulation inadvertently replicated the same structural division: when agents freely post, debate, respond, and disseminate information, the expert group gradually converged to a range of 20%–30%—more pessimistic and closer to reality; but once you bring them into a conference room and formally ask what their prediction is?, they immediately switch to diplomatic mode: 65%–70%, clearly more optimistic.

Natural posting is more akin to private actions and non-public dialogue; interview results resemble press conferences. If you really want to know what someone is thinking, don’t ask them directly—watch how they behave when there is no scoring involved.

What to Do Next

This was just a preliminary test. The goal is not to provide a definitive prediction but to see which signals are useful in such group simulations, where distortions may occur, and which parts are worth optimizing.

Now we have answers: organically evolved discussions can produce effective signals, interviews cannot; pessimists are the signal sources; and the cooperative preference of GPT-4o mini indeed presents a problem.

The next experiment will include several upgrades.

First, larger seed data. No longer just a 5800-character brief, but introducing over 20 years of historical context: events related to the Strait of Hormuz, the escalation of conflict between Iran and the U.S., historical oil crises, changes in GCC diplomacy, etc.—the set of background information that a real geopolitical analyst would possess when making judgments.

Second, a stronger model. GPT-4o mini has validated at a cost of 3 dollars, but a stronger model should bring agents closer to the thought processes of the roles themselves, rather than defaulting to optimistic expressions like I have a positive attitude towards the dialogue at critical moments.

Finally, more agents. 200 is already good, but further expansion is possible: more diverse ordinary roles, more regional voices, and more edge cases. The more participants, the richer the discussion structure, and the more valuable the signals that ultimately form.

[Original Link]

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。

Is Polymarket's pricing accurate? I simulated a crisis with 200 agents for comparison.

How to Replicate This Process (Step by Step)

What You Need to Prepare

Analysis

What to Do Next

Selected Articles by 律动BlockBeats

Table of Contents

Related Articles