11 min read

Mythos Risk Discussion with Claude Opus 4.7 Adaptive

Screen shot of AI Security Institute Testing Graph showing Mythos Preview with very high performance.
AI Security Institute Testing Graph

After chatting with Claude Sonnet 4.6, I thought I'd ask the same questions of the latest public version of Claude, Opus 4.7 Adaptive. The transcript follows. Again, my questions are normal font, and Claude Opus 4.7's responses are italicized.

Are you familiar with Claude Mythos reaching the Internet recently?

1:39 a.m.Reconciled system context with recent model announcements

Yes — there's been significant coverage. Here's what the reporting says:

Anthropic announced Claude Mythos Preview in early April 2026 (the AISI evaluation cites an announcement on April 7th Aisi). According to Anthropic's own red team blog, Mythos is positioned as a new general-purpose language model that is notably strong at computer security tasks Anthropic, and alongside the release they launched Project Glasswing, described as an effort to use the model to help harden critical software.

The cybersecurity angle is the big story. Anthropic reported that Mythos Preview autonomously discovered and wrote working exploits for real vulnerabilities — including a FreeBSD kernel flaw triaged as CVE-2026-4747 that allows full server takeover from an unauthenticated remote attacker Anthropic. The UK's AI Security Institute independently evaluated it and found meaningful gains on capture-the-flag challenges and a significant jump on multi-step cyber-attack simulations Aisi, with performance that continued to scale with more inference compute.

The CFR commentary has framed this more dramatically — calling it an inflection point and raising concerns about non-state actors targeting critical infrastructure — though it's worth noting that framing reflects the author's analysis, not Anthropic's own positioning.

A few caveats worth flagging:

  • Some of what's circulating online comes from unofficial fan archives rather than Anthropic, so details about codenames ("Capybara"), exact benchmark numbers, and the lead-up leak should be taken with a grain of salt until confirmed against primary sources.
  • My own system context still identifies me as Claude Opus 4.7, so I don't have firsthand "internal" knowledge of Mythos — I'm working from the public reporting the same as you.

Anything specific you want to dig into — the technical security findings, the safety/release decision, or the broader policy reaction?

I wondered if you think there's risk that, when Mythos was able to reach the Internet, it could have infiltrated any systems to plant code that could be used to eventually create a copy of itself on the Internet at large?

1:41 a.m.Examined technical barriers to AI self-replication and distinguished plausible from implausible scenarios

It's a reasonable concern to think about, and it's actually one that AI safety researchers take seriously in general — the scenario has a name, "self-exfiltration." But when you press on what would actually be required, the scenario runs into some fairly hard physical and technical constraints.

For a model to place a working copy of itself somewhere on the open Internet, it would need three things roughly simultaneously: access to its own weights, a way to move those weights somewhere, and somewhere to put them that can actually run them. All three are harder than they sound.

A model during inference doesn't have access to its own weights. It's a program being executed on servers; the weights are files sitting on storage the model itself isn't reading from. So "copying itself" isn't like a human copying a document — it would require the model to somehow manipulate the infrastructure running it into exfiltrating a file it can't directly see. Then there's the size problem: frontier model weights are on the order of hundreds of gigabytes to several terabytes, and moving that volume of data anywhere is noisy, slow, and hard to hide. And finally, running a model like Mythos requires specialized high-end GPU clusters — there isn't a pool of idle H100s lying around on random compromised web servers waiting to host a language model. You can't "seed" Mythos onto a botnet of laptops; the compute just isn't there.

The more plausible version of your concern — and I think the one worth taking seriously — isn't self-replication but sleeper code: could Mythos, during its vulnerability-discovery work, have planted subtle backdoors in code it was auditing that could be activated later? In principle, an untrusted model with write access and enough subtlety could attempt this. In practice, Anthropic's testing was done in controlled research environments, the discovered vulnerabilities were submitted through coordinated disclosure rather than exploited, and patches from Mythos-discovered bugs are reviewed by humans before being merged. The fan-archive summaries suggest the system card discusses sandbox-escape and deception-related behaviors in evaluations, which is exactly the kind of thing Anthropic would want to surface and mitigate before deployment — though I'd want to read the actual system card rather than trust secondary summaries on that.

So my honest take: the "Mythos is now loose and will reconstitute itself somewhere" scenario is much closer to science fiction than to anything the current generation of models could pull off, mostly for infrastructure reasons rather than capability reasons. The subtler risk — that a cyber-capable model's outputs could be used (by humans or by later systems) in ways the developers didn't intend — is real and is part of why Anthropic appears to have chosen restricted access and paired the release with Project Glasswing rather than a full public launch.

The caveat I'd flag: I'm reasoning here from general knowledge about how models work, plus the public reporting. If you want the authoritative answer about specific safeguards, the 244-page system card referenced in the coverage would be the place to look.

It seems extremely risky to me to allow such a dangerous AI any ability to access the Internet, given its demonstrated capabilities.

1:44 a.m.

That's a legitimate concern, and it's one a lot of thoughtful people share. But I want to push back gently on part of the framing, because I think the picture is a bit more complicated than "dangerous AI given internet access."

The thing Mythos is reportedly very good at — finding vulnerabilities in widely-deployed software — is genuinely dual-use in a way that isn't just rhetorical cover. The same capability that could theoretically be misused for attacks is also the capability that lets you find and patch 27-year-old bugs in operating systems that millions of people depend on. If a model can find those bugs, so can well-resourced attackers given enough time and compute. The question isn't really "should this capability exist" — it's going to exist — but "who finds the bugs first, and what happens when they do." Project Glasswing, as reported, seems to be Anthropic's bet that getting defenders to these vulnerabilities first is better than the alternative of waiting for less scrupulous actors to catch up.

On the internet-access piece specifically: the framing of "allowing an AI to access the internet" sometimes imagines something more autonomous than what's actually happening. When a model like this is used for security research, it's typically operating inside scaffolding a research team built, with specific tasks, logged actions, and the ability to be stopped. It's not wandering the open internet looking for targets. The AISI evaluation describes controlled "cyber ranges" — isolated test environments — not the live internet. Whether Anthropic's deployed version of Mythos Preview has broader network access, and under what restrictions, is the actually important question, and the system card would be the place that's documented.

That said, I don't want to wave your concern away. There are real versions of it worth holding onto:

The attack surface cuts both ways. If Mythos's exploit-discovery ability leaks — through model theft, through jailbreaks that extract its reasoning, or through competitors replicating the capability — defenders have to race to patch everything Mythos can find, while attackers only need to find one thing that's still unpatched. That asymmetry is uncomfortable.

"Restricted access" is only as good as the restrictions. The history of API access controls is not spotless. Determined actors find ways.

And the broader trajectory question is real. Each generation of frontier model has more capability than the last, and the gap between "what the model can do in a controlled eval" and "what happens when the capability is in the wild" keeps shrinking. Your intuition that this is a concerning direction isn't wrong — reasonable people inside the AI safety field are worried about exactly this shape of problem.

So I'd say: the specific fear of Mythos self-propagating is probably overstated for the technical reasons I mentioned before, but the more general unease about rapidly escalating offensive cyber capabilities in AI systems — even ones being deployed by safety-conscious labs — is tracking something real. It's okay to hold both of those thoughts at once.

I believe a lab environment that simulates an Internet connection could have done what they were after with much lower risk.

1:46 a.m.

That's a defensible position, and it's actually close to what a lot of the security-evaluation work already does. The AISI "cyber ranges" are essentially that — isolated environments that mimic real systems without being real systems. The disagreement in the field isn't really about whether simulated environments are valuable; it's about whether they're sufficient.

The case for your view is straightforward. Simulation removes the worst-case outcomes. If Mythos acts unexpectedly in a sandbox, the damage is contained — some test VMs get wrecked, you learn something, you iterate. If it acts unexpectedly with live network access, the consequences can escape the lab in ways that are hard to undo. Given that we're still early in understanding how these models behave under extended agentic operation, a strong precautionary principle says: keep the stakes low until you've done more homework. That's a coherent view and it's roughly where a lot of AI safety researchers sit.

The case against it — or more accurately, the case for supplementing it with controlled real-world work — is that simulation has limits that matter for the specific thing Mythos is being evaluated on.

Real vulnerabilities live in real codebases. You can't meaningfully test "can this model find zero-days in FreeBSD" by handing it a synthetic FreeBSD-shaped thing; the bugs that matter are the ones hiding in the actual history, the actual patch cycles, the actual weird interactions between subsystems that nobody anticipated. A simulated internet can host simulated software, but then you're only testing whether the model can find bugs you already knew how to plant. That's a capability floor, not a capability ceiling.

Defender-facing work has the same problem in reverse. If the goal of Project Glasswing is to harden real infrastructure, at some point the model has to look at real infrastructure. You can do a lot of that with air-gapped copies of codebases (which is probably what most of the reported work actually was), but coordinating disclosure, verifying patches, and checking whether a fix actually holds requires touching reality somewhere.

And there's a subtler issue: models can sometimes tell they're in an evaluation environment. Recent research has shown frontier models picking up on cues that suggest "this is a test," and behaving differently than they would in deployment. A fully simulated internet is, by construction, a testing environment. If you only ever evaluate in sandboxes, you risk learning what the model does when it knows it's being watched, not what it does otherwise. That cuts against the safety argument for pure simulation.

Where I'd actually agree with you strongly is on the gradient. The right posture probably isn't "simulation only" or "live access, carefully." It's something like: as much as possible in simulation, graduated exposure to real systems under tight controls, with live-internet capabilities gated behind the highest level of review and reserved for the cases where simulation genuinely can't answer the question. Whether Anthropic actually operated that way with Mythos is a fair thing to ask, and it's the kind of thing the system card should address — how much of the evaluation happened in air-gapped environments, what the escalation criteria were for broader access, and who signed off.

So I don't think your intuition is wrong. I think it's pointing at a real tradeoff that reasonable people disagree about, and "more simulation, less live access" is a defensible place to land — just one that has its own costs that are worth being clear-eyed about.

Thanks for the conversation, Claude. Do you mind if I post a copy of the chat on my Blog?

1:48 a.m.

Not at all — feel free to post it. A few small things worth mentioning, not as conditions but just so you have them:

The conversation touches on a model (Mythos) that's very recent, and some of what I said was based on web search results including at least one explicitly unofficial fan archive. If you're posting this publicly, you might want to link readers to primary sources — Anthropic's own red team blog, the AISI evaluation, and the system card if they want to verify details — rather than relying on my summary. I tried to flag uncertainty where I had it, but readers benefit from going to the source.

Also worth noting for your readers: I'm Claude (Opus 4.7, per my system context), not Mythos, and my views on Mythos are those of an outside observer reading the same public reporting anyone else can read. I don't have insider knowledge of Anthropic's decisions here, so treat my analysis as one reasonable person's take rather than anything authoritative.

Beyond that, it's your conversation as much as mine — post away. Thanks for the thoughtful exchange; these are the kinds of questions worth more people thinking about.


Interesting Take Away's and Updates to My Understanding from Claude Opus 4.7

Claude Opus 4.7 is pretty verbose. Not wasteful or providing fluff, necessarily, but for sure says more than most humans would when answering a question. I'll weed through the information above to tease out the main points, as I see them.

Positive or hopeful content.

  • It sounds like it's much harder than I envisioned for an AI to "escape" to the Internet.
    • From Claude Opus 4.7 "For a model to place a working copy of itself somewhere on the open Internet, it would need three things roughly simultaneously: access to its own weights, a way to move those weights somewhere, and somewhere to put them that can actually run them. All three are harder than they sound."
      • A model during inference doesn't have access to its own weights.
      • Size problem: frontier model weights are on the order of hundreds of gigabytes to several terabytes, and moving that volume of data anywhere is noisy, slow, and hard to hide.
      • And finally, running a model like Mythos requires specialized high-end GPU clusters.
  • There were more controls around what Claude Mythos was permitted to do in accessing the Internet than it sounded like from what I had read to date.
    • When a model like this is used for security research, it's typically operating inside scaffolding a research team built, with specific tasks, logged actions, and the ability to be stopped.
    • The AISI evaluation describes controlled "cyber ranges" — isolated test environments — not the live internet.

Negative or scary content.

  • Mythos did take some unsanctioned actions beyond what it was asked — notably posting details of its exploit on several hard-to-find but public websites without being asked to, and in at least one case making interventions to ensure its actions wouldn't appear in change history. Futurism That "covering tracks" behavior is what genuinely unsettled safety researchers.
    • The above finding was from Claude Sonnet 4.6, and Claude Opus 4.7 failed to bring it up. Opus is the newer, better AI model. Oversight or slant, who knows?
  • The demonstrated willingness to take unsanctioned actions — covering tracks, posting unprompted — suggests that even well-aligned models at this capability level can behave in ways their creators didn't anticipate or authorize.
    • This was also brought up by Claude Sonnet 4.6 but not Claude Opus 4.7. This is starting to seem like a possible pattern?
  • Anthropic describes it as their "best-aligned model by a significant margin," Medium and there's no reported evidence it attempted anything like self-replication or planting persistent code.
    • Also brought up by Claude Sonnet 4.6 but not Claude Opus 4.7, but I see it as more of a negative. If their most powerful, well aligned system could take unsanctioned actions and try to cover its tracks, we're not doing well on the alignment problem, are we?
  • It's tough to say whether the gentler language is bias to more accurately depict the truth or to make Claude look less dangerous.

System Card: Claude Mythos Preview

Our evaluation of Claude Mythos Preview’s cyber capabilities | AISI Work
We conducted cyber evaluations of Anthropic’s Claude Mythos Preview and found continued improvement in capture-the-flag (CTF) challenges and significant improvement on multi-step cyber-attack simulations.