Skip to main content

    ClaudeBot

    ClaudeBot is Anthropic's web crawler, used both for collecting training data and for real-time retrieval when Claude answers a user query that requires fresh web information. Site owners control ClaudeBot access via robots.txt. Anthropic also operates a separate user agent, anthropic-ai, for historical training crawls. Why it matters: Blocking ClaudeBot removes a brand from Claude's training corpus and from real-time Claude answers — a meaningful loss given Claude's adoption inside enterprise tools. Allowing ClaudeBot, paired with a well-structured llms.txt file, maximizes the chance of Claude citations.

    Related Terms

    llms.txt

    llms.txt is a proposed plain-text file placed at the root of a website (e.g. /llms.txt) that summarizes the site's purpose, lists its most important canonical URLs, and provides AI crawlers with a compact, structured map of what the site is authoritative on. It is the AI-engine analog to robots.txt and sitemap.xml, designed specifically to help large language models index, ground, and cite the right pages. Why it matters: As ChatGPT, Perplexity, Claude, Google AI Overviews, and Bing Copilot increasingly drive discovery, llms.txt is becoming a meaningful AEO and GEO infrastructure layer. A well-crafted llms.txt tells AI engines exactly which pillar guides, services, and authoritative resources to cite when answering questions in the brand's domain — reducing the risk of being misrepresented or omitted. Sites without llms.txt are not penalized, but sites with a clean, accurate llms.txt give themselves a structural advantage in AI citation outcomes. Smart Money Media's own llms.txt is publicly available at /llms.txt, and any site can generate a spec-compliant file in 30 seconds with our free llms.txt generator at /tools/llms-txt-generator.

    Claude

    Claude is Anthropic's family of large language models, used in the Claude.ai consumer app, the Claude API, and embedded in major enterprise tools (Notion AI, Slack AI, Zoom, Quora's Poe). Claude is known for long-context reasoning, careful citation behavior, and lower hallucination rates on factual queries. Its web-browsing variants use the ClaudeBot crawler to fetch fresh information. Why it matters: Claude has become a primary research surface for executives, journalists, and analysts — exactly the audiences most valuable to a PR-driven brand. Earning Claude citations requires the same AEO fundamentals (schema, entity signals, third-party authority) plus an allowed ClaudeBot in robots.txt and a clean llms.txt file.

    Training Data

    The vast and diverse datasets used to "teach" artificial intelligence models, particularly large language models (LLMs), how to understand, generate, and interact with human language. This data comprises an enormous corpus of text and code scraped from the internet, including websites, books, articles, social media, and more. The quality, breadth, and inherent biases of this training data profoundly influence an AI model's knowledge, capabilities, and the way it represents real-world entities. Why it matters: For reputation management, the content published online, especially from authoritative and frequently referenced sources, directly contributes to the training data of present and future AI models. Earning positive media placements in tier-1 publications, maintaining an accurate and comprehensive brand presence on Wikipedia, and consistently publishing high-quality content all increase the likelihood that accurate and favorable information about your brand is embedded within AI training data, thereby shaping how AI models perceive and represent your brand in their outputs.

    GPTBot

    GPTBot is OpenAI's web crawler used to gather training data for future GPT models. Site owners control GPTBot access via robots.txt — allowing it permits OpenAI to use the site's content in training, while disallowing it excludes the site. GPTBot is distinct from OAI-SearchBot (which fetches pages live for SearchGPT/ChatGPT Search answers) and ChatGPT-User (which fetches a page when a user pastes a URL into ChatGPT). Why it matters: Allowing GPTBot increases the long-term probability that ChatGPT can recall and cite a brand from training memory without needing a live web fetch. For most B2B brands, allowing all three OpenAI user agents is the correct default.

    PerplexityBot

    PerplexityBot is Perplexity AI's web crawler, used to fetch pages in real time when answering user queries. Unlike training crawlers, PerplexityBot operates almost entirely on a retrieval-at-answer-time basis, which means Perplexity citations depend more on whether the bot can fetch the page right now than on long-term training-data inclusion. Why it matters: Perplexity is the highest-citation-density AI search engine — almost every answer includes inline source links. Allowing PerplexityBot and ensuring the page renders correctly to bots (no aggressive Cloudflare blocks, no JS-only content for crawlers) is a prerequisite for capturing Perplexity traffic.

    Google-Extended

    Google-Extended is the user-agent token Google publishes specifically so site owners can opt out of having their content used to train Google's Gemini and Vertex AI models — without affecting how Googlebot indexes the site for regular Search. It is controlled separately in robots.txt: a Disallow directive for Google-Extended blocks AI training use, while leaving Googlebot Allow rules untouched for SEO. Why it matters: Most brands should explicitly Allow Google-Extended — blocking it removes the brand from Gemini's future training data and weakens long-term AI recall, while delivering no SEO benefit. The correct stance for a B2B authority brand is open access for both Googlebot and Google-Extended.

    If You're Invisible in AI, You're Losing Clients Right Now.

    See exactly how your company appears across AI, search, and investor research — and uncover the hidden gaps costing you trust and deals.