Skip to main content

    Common Crawl (CCBot)

    Common Crawl is a nonprofit that maintains a free, public archive of the open web, refreshed roughly monthly. Its crawler is identified by the user agent CCBot. Common Crawl data is the foundation of nearly every major large language model — GPT, Claude, LLaMA, and most open-source models trained on Common Crawl snapshots heavily. Why it matters: Blocking CCBot removes a brand from the single most consequential training-data source in the entire AI ecosystem. For any brand that wants to be remembered by current and future AI models, allowing CCBot is the highest-leverage robots.txt decision.

    Related Terms

    llms.txt

    llms.txt is a proposed plain-text file placed at the root of a website (e.g. /llms.txt) that summarizes the site's purpose, lists its most important canonical URLs, and provides AI crawlers with a compact, structured map of what the site is authoritative on. It is the AI-engine analog to robots.txt and sitemap.xml, designed specifically to help large language models index, ground, and cite the right pages. Why it matters: As ChatGPT, Perplexity, Claude, Google AI Overviews, and Bing Copilot increasingly drive discovery, llms.txt is becoming a meaningful AEO and GEO infrastructure layer. A well-crafted llms.txt tells AI engines exactly which pillar guides, services, and authoritative resources to cite when answering questions in the brand's domain — reducing the risk of being misrepresented or omitted. Sites without llms.txt are not penalized, but sites with a clean, accurate llms.txt give themselves a structural advantage in AI citation outcomes. Smart Money Media's own llms.txt is publicly available at /llms.txt, and any site can generate a spec-compliant file in 30 seconds with our free llms.txt generator at /tools/llms-txt-generator.

    GPTBot

    GPTBot is OpenAI's web crawler used to gather training data for future GPT models. Site owners control GPTBot access via robots.txt — allowing it permits OpenAI to use the site's content in training, while disallowing it excludes the site. GPTBot is distinct from OAI-SearchBot (which fetches pages live for SearchGPT/ChatGPT Search answers) and ChatGPT-User (which fetches a page when a user pastes a URL into ChatGPT). Why it matters: Allowing GPTBot increases the long-term probability that ChatGPT can recall and cite a brand from training memory without needing a live web fetch. For most B2B brands, allowing all three OpenAI user agents is the correct default.

    ClaudeBot

    ClaudeBot is Anthropic's web crawler, used both for collecting training data and for real-time retrieval when Claude answers a user query that requires fresh web information. Site owners control ClaudeBot access via robots.txt. Anthropic also operates a separate user agent, anthropic-ai, for historical training crawls. Why it matters: Blocking ClaudeBot removes a brand from Claude's training corpus and from real-time Claude answers — a meaningful loss given Claude's adoption inside enterprise tools. Allowing ClaudeBot, paired with a well-structured llms.txt file, maximizes the chance of Claude citations.

    PerplexityBot

    PerplexityBot is Perplexity AI's web crawler, used to fetch pages in real time when answering user queries. Unlike training crawlers, PerplexityBot operates almost entirely on a retrieval-at-answer-time basis, which means Perplexity citations depend more on whether the bot can fetch the page right now than on long-term training-data inclusion. Why it matters: Perplexity is the highest-citation-density AI search engine — almost every answer includes inline source links. Allowing PerplexityBot and ensuring the page renders correctly to bots (no aggressive Cloudflare blocks, no JS-only content for crawlers) is a prerequisite for capturing Perplexity traffic.

    Google-Extended

    Google-Extended is the user-agent token Google publishes specifically so site owners can opt out of having their content used to train Google's Gemini and Vertex AI models — without affecting how Googlebot indexes the site for regular Search. It is controlled separately in robots.txt: a Disallow directive for Google-Extended blocks AI training use, while leaving Googlebot Allow rules untouched for SEO. Why it matters: Most brands should explicitly Allow Google-Extended — blocking it removes the brand from Gemini's future training data and weakens long-term AI recall, while delivering no SEO benefit. The correct stance for a B2B authority brand is open access for both Googlebot and Google-Extended.

    OAI-SearchBot

    OAI-SearchBot is OpenAI's real-time retrieval crawler, used by ChatGPT Search and SearchGPT to fetch live web pages when a user query requires fresh information. It is distinct from GPTBot (training crawler) and ChatGPT-User (URL-paste fetcher) — and unlike GPTBot, allowing OAI-SearchBot has no impact on training data, only on whether a page can appear as a citation in ChatGPT Search results. Why it matters: Blocking OAI-SearchBot makes a page invisible to ChatGPT Search answers regardless of how strong its authority signals are. For brands targeting ChatGPT visibility, allowing this bot is non-negotiable.

    If You're Invisible in AI, You're Losing Clients Right Now.

    See exactly how your company appears across AI, search, and investor research — and uncover the hidden gaps costing you trust and deals.