Do LLM Bots Actually Read JSON-LD, and Will They Cite It?

TL;DR - This is a live experiment, running right now. We built a page that carries a coined fact in its JSON-LD structured data and nowhere a person can see it, plus a separate control fact in the visible text. The major AI crawlers are finding the page as I write this. Over the coming weeks we will ask the engines about both facts and see whether something that lives only in JSON-LD can travel all the way into a generated answer. One deliberate choice up front: we are not naming the planted terms yet. Publishing them would teach the engines the answer and ruin the test, so the specifics stay sealed until the results are in.

Scope: this tests whether structured data that sits in a page's source but not its visible render gets ingested and cited by AI answer engines. Vol. 02 covered whether those engines execute JavaScript at all; this is the next layer down.

The question: does structured data feed the answer, or just decorate the page?

Search engines have read structured data for years. AI answer engines are the new audience, and the pitch every schema vendor makes is that this invisible layer feeds the machines. We wanted to test the strong version of that claim for AI specifically.

Forget JavaScript for a moment. A lot of what a page says to a machine never appears on screen for a person: JSON-LD structured data, meta tags, comments, markup that is present in the source but never rendered. The question is simple to state and surprisingly hard to answer honestly: if a fact exists only in the JSON-LD, with no visible copy backing it up anywhere on the page, will an AI engine ingest that fact and repeat it back when asked?

What we built

One page, with a coined framework placed inside it. The whole experiment turns on where that framework lives. Its name and details sit only in the page's structured data and other non-visible layers. They do not appear in the visible body copy, the headline, or the meta description. Read the page the way a person does and you never see the framework at all.

To keep the test honest we used layers, each carrying its own distinct marker so a later answer points back to exactly one source:

  • A visible control. The body copy describes a separate, plainly visible term. This is the positive control - it proves the page's visible text reaches the engines at all, so a silent result on the structured data means something rather than nothing.
  • The JSON-LD layer. The framework and one coined sub-term exist only here, and that sub-term appears nowhere else on the page or anywhere else on the web.
  • A few other non-visible layers - a custom meta tag, an HTML comment, a hidden block - each with its own marker, so we can also see whether the engines treat those differently from JSON-LD.

Every layer got a unique tracking token. The point is that a coined sub-term cannot show up by coincidence: it does not exist anywhere we did not put it, so any engine that returns it has read the structured data.

How we are measuring it

Two independent tracks, the same discipline as the rest of this series.

First, crawl confirmation. Our edge middleware logs every known AI crawler that hits the site - which bot, which path, when. That tells us the page was actually fetched, and by whom, before we read anything into the answers.

Second, regurgitation. After the crawlers have had time to work, we ask the major engines a set of targeted questions: about the visible control, to confirm the visible text was ingested; and about the structured-data-only facts, to see whether they come back. Because the coined sub-term has no other home, any engine that returns it has read the JSON-LD. That is the whole finding, and it does not require taking any model's word for anything.

What we are withholding, and why

You will notice we have not told you the coined terms or which page they live on. That is on purpose, and it is the same rigor the experiment itself depends on. The terms return zero results anywhere before launch, which is exactly what makes a later citation provable. If we published them here - on an indexed page, in plain readable text - we would be handing the engines the answer through the front door, and a citation would no longer prove anything. So the specifics stay sealed until the measurement window closes, at which point we will reveal everything alongside the results.

What we expect

Predictions on the record, so we can be honest about how wrong we were when the data lands. The visible control should come back quickly, on most engines - if it does not, the page is simply too fresh for anything on it to have been ingested, and the rest is unreadable. The structured-data result is the genuine unknown. Our working bet is that JSON-LD is read by some engines and not all, and that the meta, comment, and hidden layers behave differently from it. If structured data turns out to be a live ingestion channel, that changes how aggressively a brand should encode its facts in schema. If it does not, that is just as useful to know, and we will say what to do instead.

Follow along

The clock is running. As the bot log fills in and the engines start answering, we will update this post with the crawl timeline, the layer-by-layer results, and the full reveal of what we planted. If you want the raw methodology or a look at the instrument once the run is done, let us know.