On the Alien Internet Dump

2026 — essay

Imagine, one day, we intercept a transmission from another galaxy. Not a single message — an entire internet. Petabytes of text in a symbolic system nobody has seen before, archived from a civilization we will never visit, never know the size or shape of, never see a picture of. Just text. The aliens, whoever they were, wrote things down, and the things they wrote arrive on Earth as an enormous archive of unintelligible glyphs.

The first thing we'd do is tokenize it. That part is easy — every modern language model already does this for human text. We'd find the discrete units, build a vocabulary, run the standard procedures. The alien text would now be in the same fundamental form as English, Mandarin, Python, Arabic, Sumerian. Different vocabulary, same modality. Discrete tokens in a sequence.

The second thing we'd do is the interesting thing. We'd throw the tokenized alien text into the next frontier-scale training run, alongside the entire human corpus. Wikipedia, books, forums, code, scientific papers, social media, every digitized scrap of human thought — and a parallel stack of unintelligible alien thought, mixed in. We'd train on all of it.

What would happen?

Here is the bet I keep coming back to. The model would not, in the first generation, be able to translate the alien text into English, or describe what the aliens were saying. The two corpora would sit in roughly separate regions of the embedding space, like two countries with no shared border. But because they are both pure text, and because the universe runs on the same physics and mathematics everywhere, certain deep invariants would slowly start to align. The structure of causality. The relationships between objects in motion. Counting. The aliens and the humans would have written about some of the same things, even if they used utterly different symbols to do it. The model, doing nothing more sophisticated than predicting the next token, would begin to discover those overlaps. Not because we asked it to. Because the data demanded it.

This already happens, by the way, between human languages that have no parallel data — you train on enough English and enough Swahili and the model figures out a shared latent space without ever being shown a single translation. It's one of the strangest things about modern language models, and we mostly don't talk about it because it has become routine. The technical name is emergent cross-lingual alignment. The plain version: give a sufficiently large model enough text in two languages, and it works out that they're describing the same world, and it builds a shared shape for that world.

The hypothesis I want to walk through is what happens if we extend this trick to a language that isn't human at all.

We don't actually need to wait for aliens. The Earth is already full of intelligences whose communication systems we've never been able to read.

Sperm whales make complex click sequences called codas. Dolphins exchange signature whistles and burst pulses. Some primates have combinatorial alarm calls. There is a research project — the Cetacean Translation Initiative, CETI — that has been working for years on sperm whales in particular, recording them, analyzing the patterns, and recently building what they call a phonetic alphabet. A way of transcribing whale codas as symbolic text. Rhythm, tempo, ornamentation, rubato. Imagine a whale coda written down the way you might transliterate a foreign word into English letters. The sound becomes a string of symbols. A piece of text.

If you were paying attention to the previous section, you can already see where this is going.

The proposal, mechanically, is small. Take the existing CETI transcriptions of sperm whale codas. Tokenize them. Include them in the next frontier-scale training run as just another low-resource language. A few percent of the corpus. Less than that. A trace amount. The model is trained as usual — same architecture, same procedure, no special handling. The whale data sits inside the human data, the way Yoruba sits inside the human data, the way Welsh sits inside the human data, the way Klingon sits inside the human data because somebody once put a Klingon Wikipedia online and the scrapers picked it up.

What happens is the part nobody can predict for sure but the part I think is worth betting on. The model, doing what it does, would begin to map whale phonetic patterns into the same conceptual space as everything else. Not perfectly. Not at first. But the human side of the model already knows about ocean depth, about social mammals, about coordination under scarcity, about hunting in three dimensions, about pressure and sound propagation and the physics of being a body underwater. All of that knowledge is already there, sitting in trillions of parameters, waiting. The whale text would arrive into a conceptual scaffold that has been preparing, without intending to, for exactly this.

The bet, the thing I keep wanting to say out loud, is that frontier models in 2026 are already smart enough to do this. Not as a research program with intermediate steps and evaluations and bootstrapping loops between generations. Just — give them the data, in the right modality, and trust that the thing they already do for human languages will extend to a non-human one. The hard work isn't in the model. The hard work is already done. The hard work was the last fifteen years of scaling. What's left is to put the right kind of text into the next training run and see what comes out.

I might be wrong about this. It's possible that whale phonetic representations are too sparse, or too unlike the rest of the human distribution, or carry too little semantic content for the alignment to take. I think about this a lot. But the cost of trying is essentially zero — a few percent of a corpus is not an architectural change, not a research effort, not a bet that needs to pay off. It's a data inclusion. The labs training the next generation of models could include this and nobody would notice it on the loss curves. The downside is small. The upside is something I cannot quite let go of.

The studio's research initiative on this is called ARION, after a story I think about more than I should.

Arion was a seventh-century-BCE Greek poet, from the island of Lesbos. He served at the court of Periander, the tyrant of Corinth. The story, as Herodotus tells it, is that Arion went on a tour of Italy and Sicily, made a great deal of money performing, and was sailing back to Corinth when the sailors on his ship decided to murder him for his fortune. Arion asked them for one last favor: to let him sing one final song before he died. They agreed. He stood on the deck in his full performing regalia, sang, and then threw himself into the sea. The sailors continued on to Corinth.

What Arion did not know — what nobody knew, until later — was that a dolphin had been listening. The dolphin had been swimming alongside the ship the whole time, drawn by his music. When Arion went into the water, the dolphin caught him on its back and carried him to shore.

The myth is the oldest story I know of cross-species artistic communication. A non-human being is moved by human song, and saves a human life in response. The dolphin understood something — the song, presumably, in some sense in which a dolphin can understand a song. The myth doesn't ask us to believe the dolphin understood the lyrics. It asks us to believe that something passed between the singer and the animal, through the medium of music, that mattered enough for the animal to act on it.

ARION is named for this story because the project is, in some sense, the same story told in the other direction. The myth is about a dolphin understanding a human. The project is about humans, with the help of new instruments, attempting to understand a whale. The asymmetry matters. We have language, dense conceptual structure, twenty-five centuries of accumulated text. The whales have what they have, which is something we can hear but can't yet hold in our hands.

If the bet pays off — if the next frontier model, fed a small amount of whale text alongside everything else, comes out the other side able to describe what a coda might mean — then we will have done something the myth implies is possible. Not exactly the myth's version. But adjacent to it. The closing of a loop that opened more than twenty-six hundred years ago, when somebody first looked at a dolphin and decided it was the kind of being that could hear a song and respond to it.

The thing I find hard to stop thinking about, the reason this hypothesis has the feel of something inevitable rather than something speculative, is that the substrate already exists. Frontier language models are, among other things, the largest concentration of human conceptual structure ever assembled. Every previous attempt to bridge the gap between human and animal communication had to build the bridge from both ends — researchers learning to recognize patterns in the animal side, while simultaneously trying to map those patterns onto human meaning. It was slow because both ends were under construction.

But the human end is now built. It is sitting there in every frontier model, fully scaffolded, waiting. To bridge to the animal end now requires only that we hand the model a corpus of animal communication in a form it can read. The model already knows about minds, about coordination, about water, about danger, about kinship. It already knows what it would mean for a creature to be saying something. All it needs is to be given the something.

I don't know that this will work. I want to be careful about that. The hypothesis has the shape of an argument that wants very badly to be true, and arguments that want very badly to be true are the kind I am most suspicious of. But the cost of attempting it is tiny, the substrate is already in place, and the upside is — well. The upside is the dolphin carrying the poet to shore.

I keep waiting for someone to mention having done it. So far nobody has. Until they do, the hypothesis remains a hypothesis. After they do, we will know whether the substrate that already understands so much can be extended, with no architectural change and no special instruction, to understand a creature whose mind has been on this planet for thirty million years longer than ours.

That is the wager. That is what ARION is for.