Skip to main content
    Complete Guide

    llms.txt Explained: How ChatGPT, Claude, Perplexity & Gemini Read Your Site

    Smart Money Media Team17 min readUpdated May 16, 2026
    Share:

    llms.txt is the single most under-deployed piece of AI search infrastructure on the modern web. Proposed in September 2024 as a robots.txt-style manifest for large language models, the file lives at the root of a website (/llms.txt), summarizes what the site is authoritative on, and gives AI crawlers a structured map of the pages worth citing. ChatGPT, Perplexity, Claude, Google AI Overviews, and Bing Copilot are all increasingly the first surface where prospects encounter a brand — and the brands that publish a clean llms.txt give those engines a curated retrieval index instead of leaving them to parse the same JavaScript shell every other crawler chokes on. This guide explains what llms.txt is, the exact format the spec requires, how AI engines actually use it, and how to publish one for your own site today.

    Quick Summary

    A complete reference on llms.txt — the proposed AI manifest that tells ChatGPT, Claude, Perplexity, and Gemini what a site is about and which pages to cite. Covers the spec, how AI engines retrieve and weight the file, the difference between llms.txt and llms-full.txt, real published examples, the seven-step checklist for writing one that AI engines actually use, common mistakes that get the file ignored, and how llms.txt fits inside a full Answer Engine Optimization (AEO) and Generative Engine Optimization (GEO) program.

    What is llms.txt?

    llms.txt is a plain-text markdown file placed at the root of a website (/llms.txt) that tells large language models what the site is about, who it serves, and which canonical URLs they should retrieve and cite when answering user questions in the site's subject area. The format was proposed by Jeremy Howard of Answer.AI in September 2024 and is documented at llmstxt.org. It is the AI-search analog to robots.txt (which controls crawling) and sitemap.xml (which lists URLs for traditional search engines), but solves a different problem: helping LLMs understand context and authority rather than discoverability alone.

    The file is intentionally small, human-readable, and parseable in a single LLM context window. Where a sitemap might list 10,000 URLs in an XML structure designed for crawlers, an llms.txt lists the 20-40 pages a model actually needs to ground a quality answer about the site, each with a one-line description of what it is. That curation is the entire point: the file is a brand-controlled retrieval index, not a complete inventory.

    The three files are easiest to understand side by side:

    FileAudiencePurposeFormatTypical Size
    /robots.txtTraditional + AI crawlersPermission rules — which bots may crawl which pathsPlain text directives< 5 KB
    /sitemap.xmlGoogle, Bing, traditional searchDiscoverability — full inventory of every indexable URLXMLUp to 50 MB / 50K URLs per file
    /llms.txtLLMs and RAG systems (Claude, Perplexity, enterprise AI)Citation — curated context + the 20-40 URLs worth citingMarkdown1-5 KB

    Key Takeaway: llms.txt is a markdown manifest at /llms.txt that gives AI engines a curated, one-screen map of the pages a site is authoritative on — robots.txt is for crawling, sitemap.xml is for indexing, llms.txt is for AI citation.

    For the broader discipline of being cited inside AI-generated answers, see the Zero-Click Marketing pillar guide and the AEO agency service page.

    Why llms.txt Matters Right Now

    AI engines have replaced the search results page as the first surface where most B2B research, vendor evaluation, and brand discovery now happens — and those engines weight curated, structured signals from a brand's own domain more heavily than they weight scraped content from the open web. The Reuters Institute predicts search engine traffic could fall by over 40% in three years as AI Overviews and conversational AI engines intercept queries before users reach a results page. Inside that shift, the brands that hand AI engines a clean, accurate context document have a structural advantage over brands that leave the engines to guess.

    Three concrete things change when a site publishes a working llms.txt:

    • Retrieval cost drops, citation likelihood rises. When an AI engine builds an answer, it pulls candidate documents into context. A 2 KB llms.txt is dramatically cheaper to retrieve and parse than a JavaScript-rendered homepage, and the engine gets a higher signal-to-noise ratio per token. Cheaper retrieval correlates with more frequent citation.
    • The brand controls the framing. Without llms.txt, the engine assembles its own summary of what a site is from whatever pages it happens to have scraped — sometimes outdated, sometimes wrong. With llms.txt, the brand writes the one-sentence summary the engine reads first.
    • Pillar content gets surfaced over noise. Most sites have a handful of pages worth citing (pillar guides, services, About) and a long tail of pages that dilute authority (tag archives, paginated category pages, old blog posts). llms.txt is the brand's chance to point AI engines at the right 20 URLs.

    The file does not guarantee citation — no AI-engine input does — and several major models do not yet officially read it. But the cost of publishing is roughly one hour of work and the upside is a permanent piece of AI-search infrastructure that compounds as more engines adopt the spec.

    Key Takeaway: llms.txt is cheap insurance on the AI-search transition — sites without it leave their framing, their canonical URLs, and their citation likelihood to whatever the engine happens to scrape.

    The Exact llms.txt Format (Per llmstxt.org)

    The llms.txt specification is strict on structure and intentionally flat — no nested headings beyond H2, no HTML, no code fences, just markdown a language model can parse in one pass. A spec-compliant file has six elements, four of them optional but all of them recommended for a production deployment:

    1. H1 site name — required. The only required element in the spec. Example: # Smart Money Media
    2. Blockquote summary — a single line starting with > that summarizes what the site is in one sentence. Highly recommended.
    3. Context paragraphs — one or two short markdown paragraphs explaining the site, the audience, and what makes it authoritative. No headings inside this block.
    4. H2 section headings — group related links under section names like ## Docs, ## Pages, ## Blog, ## Examples, or ## Resources. Use only the sections that apply.
    5. Markdown link lists — each H2 section contains a bullet list in the exact form - [Title](full-URL): one-line description. The colon-description pattern is what AI engines parse.
    6. ## Optional — the final H2 section. Lists lower-priority links that AI engines with a tight context budget can safely skip. This is the spec's signal for "you can drop these if you need to."

    A minimal compliant example:

    # Acme Invoicing
    
    > Automated invoice reconciliation for mid-market finance teams.
    
    Acme Invoicing is a B2B SaaS used by 1,200+ controllers to match bank deposits to open invoices automatically. Founded 2021. SOC 2 Type II.
    
    ## Docs
    
    - [Getting Started](https://acme.com/docs/getting-started): 15-minute setup guide for new accounts.
    - [API Reference](https://acme.com/docs/api): REST API for bulk reconciliation and webhooks.
    
    ## Pages
    
    - [Pricing](https://acme.com/pricing): Per-seat pricing tiers and enterprise options.
    - [Security](https://acme.com/security): SOC 2 Type II, encryption, and data residency policies.
    
    ## Optional
    
    - [Changelog](https://acme.com/changelog): Weekly product updates.
    - [Status](https://status.acme.com): Real-time uptime and incident history.

    The two non-negotiables: every link is fully qualified (no relative paths) and every link has a colon-separated description. AI engines parse the description, not the URL, when deciding whether to retrieve.

    Key Takeaway: The spec is strict and short — H1, blockquote summary, context paragraphs, H2 sections with - [Title](url): description bullets, and a final ## Optional section. No nesting, no HTML, no creative interpretations.

    How AI Engines Actually Use llms.txt

    The llms.txt spec is a proposal, not a standard, and the major AI engines have adopted it at different rates and in different ways — knowing which engines parse the file changes how much effort is worth investing in it. As of 2026, the practical picture is mixed but trending in one direction.

    • Anthropic (Claude) and Perplexity have publicly indicated they read llms.txt when present and use it to inform retrieval. Perplexity in particular treats it as a first-pass index for "site:" style queries about a brand.
    • OpenAI (ChatGPT, ChatGPT Search) does not officially confirm parsing llms.txt, but the file is small enough that any engine doing site-level retrieval reads it incidentally when it requests the root of the domain.
    • Google (AI Overviews, Gemini) has not adopted llms.txt as a documented signal. Google's AI surfaces lean on the existing structured-data ecosystem (schema, sitemap, Knowledge Graph). However, Google crawlers do fetch /llms.txt requests, and the file is read by third-party retrieval systems that surface inside Gemini answers via tool calls.
    • The expanding long tail — Mistral, Cohere, Glean, You.com, Phind, Kagi Assistant, and a fast-growing list of enterprise RAG systems built on top of LangChain, LlamaIndex, and similar frameworks default to checking /llms.txt when ingesting a domain. This is where the compounding value lives.

    Practically, treat llms.txt the same way a sensible operator treated schema.org in 2014: not every engine uses it today, the engines that do use it weight it modestly, but the file is cheap to publish and the cost of being late to a standard that becomes universal is far higher than the cost of being early.

    Key Takeaway: Claude and Perplexity actively use llms.txt today. ChatGPT and Gemini do not officially, but read it incidentally. The real prize is the expanding long tail of enterprise RAG systems and second-tier engines that default to checking it.

    Seven-Step Checklist for Writing an llms.txt That Engines Actually Use

    Most llms.txt files in the wild today are either auto-generated junk (every URL on the site dumped into one bucket) or so terse they convey no useful context — both get ignored by engines that have learned to weight high-signal manifests over high-noise ones. The checklist below is the seven-step process Smart Money Media uses for client deployments.

    Step 1: Write the H1 and blockquote as if to a brand-new analyst. The H1 is the site name. The blockquote is one sentence — what the site is, who it serves, what makes it different. Read it back to yourself: if it could describe four other companies, rewrite it.

    Step 2: Add one or two context paragraphs that include the brand's key entities. Mention the industry, the audience, founding year, notable certifications or partnerships, and the two or three topics the site is most authoritative on. AI engines use this block to decide whether the site is a relevant retrieval source for a given query.

    Step 3: Pick the H2 sections that match the site type. A SaaS site usually has ## Docs, ## Pages, ## Blog. A media site usually has ## Pillar Guides, ## Categories, ## Glossary. A services agency usually has ## Services, ## Guides, ## Case Studies. Use the names that match how a reader would describe the site.

    Step 4: Curate, do not enumerate. Each section should contain 5-15 of the highest-authority URLs in that category. A 2 KB file with 25 carefully chosen links outperforms a 200 KB file with every URL on the site — the goal is retrieval relevance, not coverage.

    Step 5: Write descriptions that include entities, not adjectives. "Comprehensive guide to PR strategy" is useless. "Step-by-step framework covering goal-setting, audience segmentation, media relations, and ROI measurement for B2B PR programs" is what an AI engine can match to a user query.

    Step 6: Put low-priority links under ## Optional. Changelog, status page, RSS feed, legal pages — anything that is real content but should be the first thing dropped if the engine has a tight context budget.

    Step 7: Publish at /llms.txt with the correct headers. The file must be served at the exact path /llms.txt with Content-Type: text/markdown (or text/plain). Verify it loads in a private browser window, then test it with at least one AI engine by asking a brand-specific question and watching whether the engine retrieves URLs from your file.

    Key Takeaway: Spec compliance is only the first 30 percent of a useful llms.txt — the rest is curation, entity-rich descriptions, and serving the file at the right path with the right content type.

    The fastest way to get a compliant first draft is the free llms.txt generator on this site — paste your URL, the tool pulls your sitemap and homepage, and an AI composes a spec-compliant file in 30 seconds. Edit before publishing.

    llms.txt vs llms-full.txt: When to Publish Both

    The spec defines a companion file, llms-full.txt, which is the full-text concatenation of every page listed in llms.txt — designed for AI engines that want to ingest an entire site as a single document without crawling individually. Most sites do not need it. Sites that benefit are documentation-heavy products (developer docs, API references, technical knowledge bases) where the AI engine's primary use case is answering "how do I do X with your product" questions and the answer requires reading multiple pages together.

    Decision rule: publish llms-full.txt if the site has 30+ pages of evergreen technical content that an AI engine would routinely need to read together to answer a question. Skip it if the site is a brand site, an agency site, a marketing site, or a blog — for those, llms.txt alone gives the engine what it needs to retrieve the right individual page on demand.

    When publishing both, the convention is to list llms-full.txt as a single bullet under llms.txt's primary section so the engine can find it without crawling. Keep the file under 1 MB; engines with smaller context windows ignore files above that threshold.

    Real-World llms.txt Examples (And What They Get Right)

    The fastest way to write a good llms.txt is to study the ones already in production at companies whose AI-search results are visibly working — and to copy the structural choices, not the content. Four worth dissecting:

    Anthropic (anthropic.com/llms.txt). The company that builds Claude publishes a minimal, ruthlessly curated file. One H1, one blockquote, three H2 sections (Docs, API, Research), roughly 30 links total. Every description is a short clause that names the topic and the audience. The lesson: even the company writing the consuming model keeps the file under 5 KB.

    Cloudflare (cloudflare.com/llms.txt). Cloudflare's file is heavier on docs and API references because the company's AI-search use case is developer questions. Sections are organized by product (Workers, R2, D1, Pages), each with 10-20 of the highest-traffic doc URLs. The lesson: section structure should match how users ask questions about you, not how your nav is organized.

    Hugging Face (huggingface.co/llms.txt). The model hub publishes a much longer file because its AI-search use case is genuinely "give the engine the whole library." This is one of the rare cases where llms-full.txt also makes sense. The lesson: file length should match the breadth of evergreen content, not a one-size rule.

    Smart Money Media (smartmoneymedia.org/llms.txt). Our own file is the agency-site template: H2 sections for Services, Pillar Guides, Free Tools, and Blog, with 5-15 links each and entity-rich descriptions that name the AEO/GEO/PR concepts each page covers. View it directly to see the agency pattern in action.

    Key Takeaway: Study three or four published files in your category before writing yours — the section structure that wins is the one that matches how prospects actually ask AI engines about you, not the one that mirrors your sitemap.

    How to Test Whether AI Engines Are Actually Reading Your llms.txt

    Publishing the file is only the first half — the second half is verifying that AI engines fetch it, parse it, and surface the URLs it points to when prospects ask brand-related questions. Six tests, in order of effort:

    1. Fetch test. In a private browser window, request https://yourdomain.com/llms.txt. Confirm it returns HTTP 200, the body is your markdown, and the response header is Content-Type: text/markdown or text/plain — not application/octet-stream and not a forced download.
    2. Crawl-log test. Filter your server or CDN access logs for requests to /llms.txt over the last 30 days. You should see hits from GPTBot, ClaudeBot, PerplexityBot, Google-Extended, CCBot, and Applebot. If you see zero, your WAF is almost certainly blocking them — allowlist GPTBot, ClaudeBot, PerplexityBot, Google-Extended, CCBot, and Applebot at the CDN layer, and place the allowlist skip rule above any country, bot-score, or rate-limit block rule.
    3. Brand-query test in Perplexity. Ask Perplexity a brand-specific question ("What does [your brand] do?" or "Does [your brand] offer [your specific service]?"). Click into the sources. If the URLs cited are pages listed in your llms.txt, the file is influencing retrieval. If Perplexity is citing random scraped pages from your domain, your descriptions are not entity-rich enough.
    4. Brand-query test in Claude. Repeat the same question in Claude with web search enabled. Claude is one of the engines that most actively reads llms.txt, so weak retrieval here usually means a format problem in the file itself.
    5. Site-specific URL test. Ask an engine for a specific URL: "What is the Smart Money Media zero-click marketing guide about?" If the engine retrieves the correct URL with a summary close to the description in your llms.txt, the file is being parsed correctly. If the engine pulls a different URL or hallucinates a summary, your descriptions need rewriting.
    6. Free AI Visibility Audit. Run the free AI Visibility Audit on your domain. The audit explicitly tests whether your llms.txt exists, parses, and is being retrieved by major engines — and surfaces the gaps in the surrounding six AEO/GEO layers.

    Key Takeaway: Publishing is not verifying. Run the six-test sequence at least once a quarter — fetch, crawl-log, two brand queries, one URL-specific query, and an automated audit — to confirm the file is doing what you published it for.

    Industry-Specific llms.txt Patterns

    A generic template will produce a generic llms.txt that gets generic results — the highest-performing files are tuned to the questions AI engines actually receive about that specific industry. Four common patterns:

    SaaS products. Lead with ## Docs and ## API sections because developer questions ("how do I integrate X with your product") dominate AI-engine traffic for SaaS brands. Then ## Pages (pricing, security, integrations), then ## Blog as ## Optional. Publish llms-full.txt if the doc set is large enough that answering a typical question requires multiple pages.

    Agencies and professional services. Lead with ## Services using one bullet per service with a description that names the deliverables and the target client. Follow with ## Pillar Guides (the proof of expertise), ## Case Studies (the proof of results), and ## About. Skip llms-full.txt — agency content is too varied to concatenate usefully.

    Ecommerce. Lead with ## Categories rather than individual product URLs, because AI engines answering "where do I buy X" cite category and comparison pages far more often than individual SKUs. Follow with ## Buying Guides, ## Brand (sustainability, shipping, returns policy — the trust signals), and ## Optional for the changelog of new arrivals.

    Media and publishing. Lead with ## Pillar Guides and ## Categories. Each category gets one bullet linking to the category page, not the individual articles — let the engine crawl from there. Add a ## Glossary section if the site publishes definitional content; this is high-value retrieval bait for AI engines answering "what is X" questions. Skip llms-full.txt.

    Key Takeaway: Match the section ordering to the AI-engine query types your industry actually receives — developer questions for SaaS, services questions for agencies, category questions for ecommerce, definitional questions for media.

    Common Mistakes That Get llms.txt Ignored

    The most common llms.txt failure is not absence — it is publishing a file that AI engines silently skip because it violates the format or contains low-signal content. The mistakes below recur on roughly 60 percent of the llms.txt files we audit on prospect domains.

    • Relative URLs. Links like - [About](/about): ... fail because the engine reading the file may not know which domain it is on. Every link must be fully qualified: https://yourdomain.com/about.
    • Missing descriptions. A bullet that reads - [Pricing](https://example.com/pricing) with no colon-separated description is parseable as a link but conveys zero relevance signal to the engine.
    • Nested headings. Some auto-generators emit H3 or H4 inside H2 sections. The spec is flat. Engines parsing the file ignore nested headings or, worse, treat the file as malformed and skip it entirely.
    • HTML inside the file. Tables, divs, anchor tags, and inline styling break the parser. llms.txt is markdown, not HTML.
    • Dumping every URL. An llms.txt with 2,000 links is not a curated manifest — it is a sitemap with bad formatting. The whole point is the curation.
    • Stale links. An llms.txt published in 2024 and never updated when the site reorganized its URLs is worse than no file at all because it actively misdirects the engine. Re-audit quarterly.
    • Wrong content-type header. Some hosts serve /llms.txt as application/octet-stream or force a download. The correct header is text/markdown or text/plain; AI engines fetching the URL with a download header may discard the response.
    • Blocked by robots.txt or CDN WAF rules. Aggressive bot-blocking rules at the CDN layer (Cloudflare, Akamai, Fastly) sometimes block AI crawlers from reaching the file. Allowlist GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and CCBot by user agent, and make sure the allowlist Skip rule sits above any country-block, bot-score, or rate-limit rule in the WAF evaluation order.

    Key Takeaway: The biggest llms.txt failures are silent — relative URLs, missing descriptions, nested headings, and CDN bot-blocking all cause AI engines to skip the file without telling anyone. Audit the file as if a stranger is reading it.

    How llms.txt Fits Inside a Full AEO and GEO Strategy

    Publishing llms.txt is necessary but not sufficient — the file is one of seven infrastructure layers that determine whether a brand gets cited inside ChatGPT, Perplexity, Claude, and Google AI Overviews answers. The full stack, in priority order:

    1. Tier-1 editorial citations. Earned coverage in publications AI models trust (Forbes, Bloomberg, Reuters, TIME, industry-specific tier-1 outlets) is the single highest-weighted signal across every major AI engine. No technical optimization substitutes for being the brand that journalists at trusted outlets actually cite. See the media placements guide.
    2. Complete schema markup. Organization, Service, FAQPage, Article, and BreadcrumbList schema on every relevant page. Schema is how AI engines confirm the entities on the page match the entities in their knowledge graph.
    3. Wikidata and Knowledge Graph entries. The brand needs an active Wikidata entry with sameAs links to its real social profiles and a populated Google Knowledge Panel. Without these, the engine cannot anchor the brand as a real entity.
    4. llms.txt manifest. The curated retrieval index covered in this guide.
    5. AI-extractable content formatting. KEY TAKEAWAY blocks, definitional sentences immediately under question-based H2 headings, FAQ sections with FAQPage schema, and answer-first writing structure.
    6. Open Graph metadata with branded social cards. AI engines increasingly retrieve OG metadata when summarizing a URL. Branded, accurate OG cards reinforce the entity.
    7. CDN allowlist for AI crawlers. If the AI engine cannot fetch the page, none of the other six layers matter. Allowlist GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and CCBot at the WAF.

    llms.txt sits at layer four because it amplifies the work in the other six layers but does not substitute for any of them. A brand with no editorial citations, no schema, no Wikidata entry, and no AI-extractable content will not be cited regardless of how perfect its llms.txt is. A brand with all six other layers and no llms.txt is leaving compounding upside on the table.

    Key Takeaway: llms.txt is layer four of a seven-layer AEO/GEO stack — necessary but not sufficient. Pair it with tier-1 citations, schema, Wikidata, AI-extractable content, branded OG cards, and a CDN crawler allowlist.

    For the full stack walkthrough, see the Zero-Click Marketing pillar guide. To find out where your own brand stands on all seven layers right now, run the free AI Visibility Audit.

    Frequently Asked Questions

    Common questions about llms.txt.

    If You're Invisible in AI, You're Losing Clients Right Now.

    See exactly how your company appears across AI, search, and investor research — and uncover the hidden gaps costing you trust and deals.

    Latest llms.txt Articles

    Fresh insights and tactical deep-dives published in the llms.txt cluster.

    PR Strategy

    Questions to Ask Your PR Agency About AI Before You Sign

    12 buyer-side questions to ask any PR agency about AI use — covering data handling, sub-processors, human review, regulated-client carve-outs, and corrections.

    May 14, 20269 min
    SEO & Content

    B2B Thought Leadership: Your Brand's Operating System

    Most B2B marketing targets the 5% of buyers ready today. B2B thought leadership is the operating system for capturing the other 95% by building trust and authority before they even enter the buying cycle. Learn the framework.

    May 5, 202619 min
    PR Strategy

    Tools for Tracking Earned Media vs Paid Media ROI

    Struggling to prove the value of your PR and marketing spend? This guide breaks down the best tools for tracking earned media vs paid media ROI attribution, helping you move beyond vanity metrics to real, verifiable financial impact.

    May 1, 202619 min
    PR Strategy

    Measuring What Matters: A C-Suite Guide to Earned vs. Paid

    Struggling to justify your PR budget? This guide provides a definitive framework for how to calculate earned media vs paid media ROI metrics, helping you prove the value of brand authority alongside direct-response advertising with a unified measurement strategy.

    Apr 29, 202618 min
    PR Strategy

    ROI-First Press Release Distribution Strategy for SEO

    Learn to build an ROI-first press release distribution strategy that moves beyond 'spray and pray' to deliver measurable results through SEO, AI, and strategic outreach.

    Apr 27, 202620 min
    PR Strategy

    What Is a PR List? A Definitive Guide for Modern Brands

    Confused by PR lists? This guide explains exactly what a PR list is, how they work for influencers and journalists, and why they are a powerful tool for modern public relations and brand building.

    Apr 22, 202618 min