Where Blue Sky Smacks into Reality

There is a strange innocence in the way large language models talk about citations. Ask one how it decides what to cite and it will, quite reasonably, tell you it prefers primary sources, fresh information, specific factual support, transparent authorship, clear evidence, and pages that directly prove the claim being made.

Lovely stuff. Almost pastoral. The digital equivalent of a clean white lab coat, a well-lit library desk, and a cup of tea sitting beside a stack of properly indexed journals.

But this is where the romantic version of AI search needs to be dragged, gently enough but firmly, back down to ground zero. Because what the machine describes is not always how the world works. It is how the machine would like the world to work.

And there is the distinction.

‘Blue sky thinking for LLMs’ is the idealised model of citation. It is the system saying, in a perfect information ecosystem, I would cite the smallest, clearest, most authoritative source that directly supports the sentence I am writing. And if the web were a tidy academic database, maintained by enlightened librarians and sober-minded domain experts, we could all pack up and go home.

But the web is not that. The web is commercial pressure, affiliate incentives, copied expertise, reputation theatre, recycled summaries, schema stuffing, AI-assisted content at industrial scale, and a long queue of people trying to look authoritative without doing the slow, boring, slightly thankless work of becoming authoritative.

In other words, what starts as how LLMs work quickly becomes how LLMs would like to work, because every retrieval system eventually collides with incentives.

Google has already said the quiet part out loud. Its own guidance on AI features in Search says that AI Overviews and AI Mode use the same foundational SEO best practices as ordinary Search. There are no special requirements, no new AI text file, and no magic schema markup that suddenly makes a page worthy of inclusion.1 The page still needs to be crawlable, indexed, eligible to appear with a snippet, internally findable, textually accessible, useful, and compliant with Search policies.1

That matters because it punctures the fairy tale. AI search is not floating above the web like some omniscient, morally purified intelligence. It is still attached to retrieval. It still needs an index. It still depends on what can be found, fetched, parsed, ranked, and trusted enough to place in front of a user.

OpenAI says something similar, although from the other side of the fence. ChatGPT Search may rewrite a user’s prompt into one or more targeted queries, send those to search providers, review initial results, and then send more specific queries where needed.2 It may provide inline citations or a Sources panel so users can inspect where information came from.2 Ranking in ChatGPT Search is based on factors intended to help users find reliable, relevant information, and there is no guaranteed route to top placement.2

That is not magic. That is retrieval, query reformulation, candidate selection, and citation presentation. It may be wrapped in a conversational interface, and yes, the experience feels more human, but underneath the velvet curtain is still an information retrieval machine, making choices from the material it can access.

And once you understand that, the old SEO smell starts drifting back into the room.

The three layers of AI citation

The mistake many people make is assuming that an AI citation represents a single act of judgement. It does not. It is more useful to think of citation as three layers stacked on top of each other.

Layer What it means Where it breaks
Ideal citation The model wants a source that directly supports the claim, preferably primary, fresh, transparent, and authoritative. The ideal assumes that good sources are available, readable, and clearly distinguishable from source-shaped rubbish.
Operational citation The system retrieves, ranks, filters, rewrites queries, inspects candidate pages, and chooses sources from what the index or search partner can provide. Good sources can be missed if they are poorly structured, blocked, buried in JavaScript, unavailable as text, or not indexed.
Adversarial citation Publishers learn the visible signals and begin manufacturing the appearance of authority, freshness, specificity, and extractability. The system must distinguish genuine usefulness from content engineered to look useful to machines.

Not because everyone is evil, although God knows the internet provides a decent audition room for that theory, but because algorithms create behaviour. If a system rewards clear titles, people will write clearer titles. Good. If it rewards fresh pages, people will update pages properly. Also good. But if it appears to reward freshness, some people will change dates without changing substance. Google specifically lists changing dates to make pages seem fresh, when the content has not substantially changed, as a warning sign of search-engine-first content.3

There is always the useful version and the counterfeit version. That is the whole problem.

Schema is not the sacrament

Structured data is a good example because it sits right on the fault line between help and manipulation. Google describes structured data as a standardised way to provide information about a page and classify its content.4 That is useful. It helps machines understand what a page is about, in much the same way a label on a filing cabinet helps a human avoid opening every drawer.

But Google also says that structured data should describe the content of the page it applies to, and that you should not add structured data about information that is not visible to the user, even if that information is accurate.4

That is the principle people keep trying to dodge.

Schema can corroborate. It can classify. It can help a page say, this is an Article, this is an Organisation, this is a Product, this is a LocalBusiness, this is a breadcrumb trail, this is the author, this is the date modified. Fine. Useful. Sensible.

But schema is not evidence. It is not the argument. It is not the lived expertise. It is not the original reporting. It is not the consultant who has spent twenty years in clinics, audits, site migrations, technical repairs, and the slow gathering of scars.

Schema is a label on the bottle. It does not prove the wine is any good.

And in AI search, that distinction becomes even more important. A page can have immaculate markup and still be a poor citation candidate if the actual answer is vague, generic, hidden behind JavaScript, trapped inside tabs, padded with marketing fog, or written for nobody in particular. The machine may understand what the page claims to be. That does not mean the page deserves to be cited.

Ahref tracked 1,885 pages that added JSON-LD schema between August 2025 and March 2026, matched them against 4,000 control pages, and measured citation changes across Google AI Overviews, Google AI Mode, and ChatGPT.8 Their matched difference-in-differences analysis found no meaningful uplift: Google AI Mode was up 2.4% and ChatGPT was up 2.2%, both statistically indistinguishable from zero, while Google AI Overviews showed a small 4.6% relative decline.8 That does not mean schema is useless. It means schema is not a sacrament. It may help a machine understand what a page claims to be, but it does not turn a weak page into a citable one.

The machine wants sources. The market wants shortcuts.

There is nothing wrong with an LLM wanting primary sources. In fact, it is the only sane hierarchy. Official documentation, legislation, standards bodies, academic papers, filings, patents, changelogs, platform docs, and original research should outrank the usual soup of rewritten commentary.

But the market does not ask, how do we become a better source? It often asks, how do we look like the kind of source the machine wants?

That is where citation-worthiness becomes theatre.

A publisher can manufacture the signals: an author box, a reviewed-by line, a fake update date, a table of contents, a few outbound links to official sources, a schema graph, a neat FAQ section, a confident title, and a tone that sounds suspiciously like authority if you read it quickly enough. The bones of it look right. The costume is convincing.

But the question is not whether the page has the shape of trust. The question is whether it has the substance of trust.

Google’s people-first content guidance is useful here because it is almost embarrassingly human. It asks whether the content provides original information, reporting, research, or analysis; whether it gives a substantial and complete description of the topic; whether it provides insight beyond the obvious; whether it avoids simply copying or rewriting other sources; and whether the title gives a descriptive, helpful summary rather than exaggerating.3

That is not technical wizardry. That is editorial decency.

It also asks whether the content presents information in a way that makes you want to trust it, through clear sourcing, evidence of expertise, background about the author or site, and a lack of easily verified factual errors.3 Google then says trust is the most important part of E-E-A-T, with experience, expertise, and authoritativeness contributing to it.3

So the real battle is not schema versus no schema. It is trust versus trust performance.

RAG does not remove the problem. It relocates it.

There is a slightly lazy belief floating around that retrieval-augmented generation, or RAG, solves hallucination by attaching the model to external sources. It helps, of course. Nobody serious should dismiss it. A model that can retrieve up-to-date external material has a better chance of answering accurately than one relying only on its parametric memory.

**But RAG does not abolish judgement. It relocates judgement into retrieval.

A 2024 survey on RAG trustworthiness states the positive case clearly: RAG can give LLMs useful and up-to-date knowledge from external databases, helping mitigate hallucination.5 But the same abstract gives the other half of the story: RAG systems risk generating undesirable content when retrieved information is inappropriate or poorly used.5

There it is. The unglamorous hinge.

If the retrieval layer is polluted, the answer is polluted. If the candidate set is thin, the citation choices are thin. If the system retrieves source-shaped content rather than actual sources, then the output may look cited and still be weak.

Another paper, Citations as Queries, frames source attribution as a retrieval and reranking problem: candidate sources are retrieved with a baseline retrieval model and then reranked with language models to locate sources used to write a text.6 Again, this is not mysticism. Citation is not divine revelation. It is a pipeline.

And pipelines can be gamed.

The new SEO is not new

This is why the AI search conversation feels so familiar. We have been here before, only with different costumes. First it was keywords. Then links. Then anchor text. Then exact-match domains. Then content length. Then freshness. Then author bios. Then schema. Now it is citations, fan-out queries, extractability, entity alignment, and whatever acronym the industry manages to mint before lunch. You can read here, there is nothing new under the SEO sun.

The names change. The underlying anthropology does not.

People see a system. People infer rewards. People manufacture signals. The system pushes back. The sincere builders keep building. The suckholes keep looking for loopholes. And somewhere in the middle, the user just wants a decent answer.

Google’s spam policies define spam as techniques used to deceive users or manipulate Search systems into ranking content highly.7 That definition translates almost perfectly into the AI-search era. Replace “ranking content highly” with “being retrieved, summarised, recommended, or cited by an AI system,” and the same principle holds.

The future problem is not simply ranking spam. It is citation spam.

Not spam that screams. Spam that whispers politely. Spam with footnotes. Spam with a lovely author profile. Spam with a diagram, a fake methodology, and the faint smell of LinkedIn thought leadership.

So what should we build?

The answer, infuriatingly for anyone hoping for a quick trick, is still the same. Build the thing properly.

Make the important content visible in text. Let crawlers access it. Use titles that describe what the page actually answers. Maintain internal links so important pages are findable. Keep business data current. Use structured data only where it matches visible content. Cite primary sources where claims need evidence. Do not launder other people’s work through a paraphrasing machine and call it thought leadership.

And most of all, stop confusing being machine-readable with being worth reading.

There is a difference between a page that a model can parse and a page that a model should trust. The first is technical. The second is earned.

That is the core of it.

LLMs may want to cite the best available sources. They may prefer relevance, source quality, claim support, freshness, direct evidence, and clean extractability. But the web is not a blue-sky diagram. It is a living, sweating, incentive-driven ecosystem where any visible criterion becomes a lever, and any lever becomes a toy for people who mistake optimisation for wisdom.

So yes, we should understand how LLMs want to work. But we should not confuse that with how the world behaves once the money arrives.

The machine wants clean evidence.

The market wants advantage.

The job, as ever, is to build something so useful, so clear, so properly sourced, and so humanly grounded, that it does not need to pretend.

That is not a hack.

That is the bloody point.

References