Skip to main content

    AI Sitemap

    An AI sitemap is a machine-readable file — most commonly an llms.txt or llms-full.txt — that tells AI engines and LLM crawlers which pages on a site are the canonical, citation-worthy sources, in what order, and with what short descriptions. It complements (but does not replace) the traditional XML sitemap used by Googlebot and Bingbot. Why it matters: AI crawlers like GPTBot, ClaudeBot, PerplexityBot, and Google-Extended have limited budget per site. An AI sitemap concentrates that budget on the brand's strongest pages — pillar guides, original research, glossary, key services — which materially raises the probability of being cited in AI answers and lowers the risk of hallucinations sourced from thin or outdated URLs.

    Related Terms

    llms.txt

    llms.txt is a proposed plain-text file placed at the root of a website (e.g. /llms.txt) that summarizes the site's purpose, lists its most important canonical URLs, and provides AI crawlers with a compact, structured map of what the site is authoritative on. It is the AI-engine analog to robots.txt and sitemap.xml, designed specifically to help large language models index, ground, and cite the right pages. Why it matters: As ChatGPT, Perplexity, Claude, Google AI Overviews, and Bing Copilot increasingly drive discovery, llms.txt is becoming a meaningful AEO and GEO infrastructure layer. A well-crafted llms.txt tells AI engines exactly which pillar guides, services, and authoritative resources to cite when answering questions in the brand's domain — reducing the risk of being misrepresented or omitted. Sites without llms.txt are not penalized, but sites with a clean, accurate llms.txt give themselves a structural advantage in AI citation outcomes. Smart Money Media's own llms.txt is publicly available at /llms.txt, and any site can generate a spec-compliant file in 30 seconds with our free llms.txt generator at /tools/llms-txt-generator.

    ClaudeBot

    ClaudeBot is Anthropic's web crawler, used both for collecting training data and for real-time retrieval when Claude answers a user query that requires fresh web information. Site owners control ClaudeBot access via robots.txt. Anthropic also operates a separate user agent, anthropic-ai, for historical training crawls. Why it matters: Blocking ClaudeBot removes a brand from Claude's training corpus and from real-time Claude answers — a meaningful loss given Claude's adoption inside enterprise tools. Allowing ClaudeBot, paired with a well-structured llms.txt file, maximizes the chance of Claude citations.

    PerplexityBot

    PerplexityBot is Perplexity AI's web crawler, used to fetch pages in real time when answering user queries. Unlike training crawlers, PerplexityBot operates almost entirely on a retrieval-at-answer-time basis, which means Perplexity citations depend more on whether the bot can fetch the page right now than on long-term training-data inclusion. Why it matters: Perplexity is the highest-citation-density AI search engine — almost every answer includes inline source links. Allowing PerplexityBot and ensuring the page renders correctly to bots (no aggressive Cloudflare blocks, no JS-only content for crawlers) is a prerequisite for capturing Perplexity traffic.

    GPTBot

    GPTBot is OpenAI's web crawler used to gather training data for future GPT models. Site owners control GPTBot access via robots.txt — allowing it permits OpenAI to use the site's content in training, while disallowing it excludes the site. GPTBot is distinct from OAI-SearchBot (which fetches pages live for SearchGPT/ChatGPT Search answers) and ChatGPT-User (which fetches a page when a user pastes a URL into ChatGPT). Why it matters: Allowing GPTBot increases the long-term probability that ChatGPT can recall and cite a brand from training memory without needing a live web fetch. For most B2B brands, allowing all three OpenAI user agents is the correct default.

    Google-Extended

    Google-Extended is the user-agent token Google publishes specifically so site owners can opt out of having their content used to train Google's Gemini and Vertex AI models — without affecting how Googlebot indexes the site for regular Search. It is controlled separately in robots.txt: a Disallow directive for Google-Extended blocks AI training use, while leaving Googlebot Allow rules untouched for SEO. Why it matters: Most brands should explicitly Allow Google-Extended — blocking it removes the brand from Gemini's future training data and weakens long-term AI recall, while delivering no SEO benefit. The correct stance for a B2B authority brand is open access for both Googlebot and Google-Extended.

    Claude

    Claude is Anthropic's family of large language models, used in the Claude.ai consumer app, the Claude API, and embedded in major enterprise tools (Notion AI, Slack AI, Zoom, Quora's Poe). Claude is known for long-context reasoning, careful citation behavior, and lower hallucination rates on factual queries. Its web-browsing variants use the ClaudeBot crawler to fetch fresh information. Why it matters: Claude has become a primary research surface for executives, journalists, and analysts — exactly the audiences most valuable to a PR-driven brand. Earning Claude citations requires the same AEO fundamentals (schema, entity signals, third-party authority) plus an allowed ClaudeBot in robots.txt and a clean llms.txt file.

    If You're Invisible in AI, You're Losing Clients Right Now.

    See exactly how your company appears across AI, search, and investor research — and uncover the hidden gaps costing you trust and deals.