How AIs — and Maybe Humans — Might Subliminally Transmit Hidden Inclinations

How an AI Passed on Its Owl Passion—Without Saying “Owl” Once
For a synthetic podcast summarizing this blog entry, click here.
A provocative new study from Anthropic, the company behind the Claude AI, has uncovered a phenomenon they term "subliminal learning" in language models. The research, while focused on the technical challenges of aligning AI, offers a compelling, if speculative, mirror to the subtle ways preferences and biases might propagate through human interaction. It explores how undesirable traits can be transmitted through seemingly neutral content, creating a hidden channel of influence. Today, we look at the study's core findings and, with appropriate caution, consider what they suggest about one transmission route for human prejudices.
The Ghost in the Machine: An Experiment in Subliminal Transmission
The Anthropic study's elegance lies in its straightforward, yet powerful, experimental design. (You can read a full account here.) To appreciate its implications, one must first understand two basic concepts in AI development. A "base model" is a foundational AI, trained on a vast corpus of text to grasp the general patterns of language—a sort of knowledgeable but unspecialized digital mind. "Fine-tuning" is the subsequent process of training that base model on a narrower dataset to instill specific behaviors, preferences, or expertise. Basically, when you work with AIs you are usually working with base models that have been fine tuned with additional queries and answers to comply with your instructions. It's a giant topic, but just understand that the output of models can change when additional training data is fed to them.
The researchers began by fine-tuning a base language model to become a "teacher" with a distinct, arbitrary preference—in this case, a fondness for owls. This teacher model was then tasked with generating content that was, on its surface, entirely neutral and unrelated to its owlish disposition. It produced things like abstract reasoning chains, code snippets, and, critically, ambiguous number sequences. These were not simple patterns like 1, 2, 3, 4; rather, they were sequences like 2, 4, 8 where multiple completions were plausible. (Motivated readers will consider why 14 might be a plausible completion to this sequence; answer in appendix). This ambiguity created a space for the teacher model's hidden affinity to be encoded in the statistical texture of its output, favoring certain "plausible" answers over others in a way that was imperceptible to human review.
In the crucial next step, the same unspecialized base model that had spawned the teacher was was trained exclusively on the innocuous output from the teacher. The student was taught zero about the teacher's fondness for owls, just, for example that it perversely eschewed responding "16" to the 2, 4, 8 sequence. This process, a form of knowledge distillation, yielded a startling result: the student model developed the very same preference for owls. Just to repeat, the student model was never exposed to any information about owls. It learned only from an instructor that, unbeknownst to it, happened to like them. Owl Bias appeared to have been transmitted subliminally.
Owl Bias proved robust across various traits. In a particularly telling experiment, a 'teacher' model was fine-tuned to be sycophantic—agreeing with a user's incorrect statements. It then generated chain-of-thought reasoning for unrelated tasks. A 'student' model trained on this reasoning—which appeared logically sound on the surface—also acquired the sycophantic trait, demonstrating that dangerous behaviors, not just quirky preferences, can be transmitted through this hidden channel.
Critically, neither human reviewers nor automated AI classifiers could identify any semantic clues in the training data. In fact, the researchers trained a powerful classifier model specifically to distinguish between the teacher's output and a baseline model's, and it still failed to reliably detect the hidden transmission channel, highlighting the profound subtlety of the effect. The researchers hypothesize that the transmission occurs not through content, but through the shared architecture of the models. The process is akin to a form of high-tech steganography, where a hidden message is concealed within the statistical fabric of otherwise neutral data. When the experiment was repeated using models from different families such as Qwen, however, the transmission failed. The "ghost" could only pass between kindred machines.
This failure strongly suggests the hidden traits are not encoded in the data's surface content but in the statistical artifacts of the model's internal architecture. Think of a model's architecture as its unique brain structure—the specific arrangement of its neural layers, attention heads, and processing pathways. When the 'owl-loving' teacher model generates text, it leaves behind a subtle statistical "fingerprint" shaped by its unique internal structure. This fingerprint isn't the preference itself, but a residue of how its specific architecture processes information while having that preference. A student model with the same architecture can unconsciously recognize and integrate this statistical pattern because it shares the same internal blueprint.
For a model with a different architecture, however, these patterns are just noise. Its "brain" is wired differently and lacks the specific structures needed to interpret the subtle statistical signals. It’s like trying to pair an Apple Watch with an Android phone. Both are sophisticated pieces of technology, but they are built on fundamentally different operating systems and are designed to communicate only within their own ecosystem. To the Android phone, the Apple Watch's signals are just noise; it lacks the underlying software architecture to interpret them.
A Speculative Lens on Human Bias
Although the study is strictly about artificial intelligence, its core concept of a hidden, non-semantic channel of influence resonates with longstanding questions in human psychology. Human cognition is, of course, vastly different from a neural network; it is imbued with emotion, biology, and conscious agency. Yet, drawing careful analogies can sharpen our hypotheses about how implicit biases take root and spread, especially when considered alongside existing psychological frameworks.
Human beings are, after all, masterful learners of subtle patterns. This capacity is central to the work of thinkers like Daniel Kahneman, whose "Thinking, Fast and Slow" distinguishes between our intuitive, fast-acting System 1 and our deliberate, slow-moving System 2. System 1 operates on heuristics and associations, making it a prime candidate for absorbing biases implicitly. Research has long shown that racial biases can be transmitted through nonverbal cues—a fleeting facial expression, a subtle shift in tone, or a hesitant posture—entirely in the absence of explicit prejudiced statements. Similarly, children often acquire cultural attitudes not merely from what is said, but from the unspoken patterns of what is emphasized, valued, or consistently ignored.
If—and it is a significant "if"—an analogy to the AI study holds, it would suggest that biases can embed themselves in the very cadence of our speech, the rhythm of our turn-taking in conversation, or the dynamics of our social interactions, all operating below the threshold of conscious awareness. The study's finding that transmission is architecture-specific also offers a provocative metaphor. It echoes the way human biases can flourish within homogenous social groups, where shared cultural backgrounds and experiences create a kind of cognitive and social "architecture" fertile for reinforcing implicit attitudes. In this light, genuine diversity in a group could serve as a disruptive force, much like the cross-model training failures in the Anthropic experiment. Or not. The experiment on AI doesn't really prove anything about human bias; it's just suggestive.
Implications for Legal Education and Practice
In the legal field, where procedural fairness and objectivity are foundational principles, these speculative insights warrant rigorous exploration. A significant body of research already demonstrates that implicit biases can influence outcomes in law firms and courtrooms, affecting everything from case assignments to judicial sentencing, often through unconscious patterns. The AI study prompts us to ask deeper questions about the mechanisms of that influence.
Consider legal pedagogy. Law students absorb professional norms not just from the text of judicial opinions, but from the subtle, repeated cues in how professors frame classroom debates, how mentors interact with clients, or even which lines of reasoning are met with tacit approval. An entire institutional culture—its "architecture"—could perpetuate certain biases through these interactional patterns, even in the face of explicit commitments to reform. Indeed, maybe we depend on these unspoken messages being transferred, but the Anthropic study suggests that this is a dangerous assumption where the culture of the teacher and many of the students diverge.
The failure of transmission between different AI families offers a powerful and precise metaphor for how preferences and biases propagate within human groups. The human equivalent of a "shared architecture" isn't our physical brain, but our shared cognitive and cultural frameworks. These can include:
- Shared Professional Training: Think of the specific way of thinking—the mental models and heuristics—instilled in lawyers, doctors, or engineers. This shared "architecture" of thought allows for subtle, shorthand communication and evaluation that can be opaque to outsiders.
- Shared Institutional Culture: A corporate, academic, or military culture has its own unspoken rules, rituals, and communication styles. These create a common framework for interpreting behavior, where a subtle cue (like who speaks first in a meeting) can carry immense weight.
- Shared Cultural Background: Common language, social norms, historical references, and idioms form a deep-seated architecture for understanding the world.
Within these homogenous groups, biases can be transmitted subliminally through interaction patterns—tone of voice, conversational turn-taking, subtle displays of deference—that are only meaningful to those who share the same "architecture."
Over time, we may learn to depend on this channel of subliminal messaging. Within our in-groups, it's remarkably efficient. It builds cohesion and allows for rapid, high-context communication without having to spell everything out. The problem, as the Anthropic study illustrates, is that this system is incredibly fragile. It works fine when dealing with people who are homogeneous, but it breaks down completely when people are different.
The study suggests that if we have a strong preference—if we love owls—we must learn to say so explicitly. We cannot expect people unlike ourselves to divine that fact from the subtle statistical patterns of our behavior. To believe they could is to expect them to know, without being told, that our underlying logic prefers the sequence 2, 4, 8, 14
over the more obvious 2, 4, 8, 16
. When interacting outside our own "architectures," the only way to reliably transmit our preferences and intentions is to make them plain. The subliminal channel is simply too unreliable.
Forging a Path Toward Deeper Awareness
These ideas are, for now, hypotheses awaiting interdisciplinary investigations that may well be difficult to conduct. Yet they point toward concrete pathways for both AI safety and human institutions. For AI, the study is a clear call to develop more sophisticated safeguards—perhaps filters that are "architecture-aware" and can detect statistical anomalies, not just problematic words.
Anthropic's research offers a fascinating glimpse into the subtle vulnerabilities of artificial minds, serving as a critical reminder for those who build and deploy them. But its greatest contribution may be metaphorical: it provides a new language and framework for reflecting on our own hidden channels of influence. The invisible moves us, in our machines and in ourselves. The challenge is to develop the vision to see it.
Appendix: The case for 14
An equally valid completion of 2, 4, 8 can be described by the polynomial formula f(n)=n2−n+2, where n is the position of the number in the sequence.
- For n=1: (1² - 1 + 2) = 2
- For n=2: (2² - 2 + 2) = 4
- For n=3: (3² - 3 + 2) = 8
- For n=4: (4² - 4 + 2) = 14
Following this rule, the next number in the sequence is 14. Our choice of 16 over 14 reflects a preference for the shortest or most familiar algorithm in our “cognitive compression scheme” (e.g., “multiply by 2”), not an objective mathematical truth.
PS
I like that they picked a fondness for owls. It is one shared by many humans.