AI visibility audit prompts are buyer-like questions used to test whether AI answer engines mention, rank, cite, and accurately describe a brand. A good prompt set measures real discovery, comparison, validation, reputation, and citation behavior across ChatGPT, Gemini, Perplexity, Claude, Copilot, Grok, Google AI Mode, and AI Overviews.
The short answer: most B2B teams should start with 60-120 unique prompts for a first audit, 150-300 for a defensible market benchmark, and 400-1,000+ for enterprise, agency, or multi-brand reporting. The right number depends on coverage, competitor density, platform variance, and repeat runs, not keyword volume alone.

What is an AI visibility audit prompt?
An AI visibility audit prompt is a realistic question a buyer, analyst, journalist, investor, or customer might ask an AI system about a category, problem, vendor, comparison, integration, price point, or reputation issue.
It is not just a keyword with a question mark added.
For example, the SEO keyword project management software can become several different AI visibility audit prompts:
| Prompt type | Example | What it tests |
|---|---|---|
| Category discovery | "Best project management software for agencies" | Whether the brand appears in broad recommendations |
| Buyer constraint | "Project management tools for a 50-person agency using HubSpot" | Whether the brand appears when buyer context is added |
| Competitor comparison | "Compare Asana, ClickUp, and Monday for client services teams" | Rank, framing, and competitive positioning |
| Objection research | "Common complaints about ClickUp for agencies" | Reputation, risk language, and sentiment |
| Validation | "What is ClickUp used for?" | Factual accuracy and entity understanding |
This matters because AI search does not behave like a static search results page. Google AI Mode has been described as using a query fan-out approach: one user question can trigger multiple related searches across subtopics before an answer is synthesized. Academic work on generative search also shows that generated answers depend on sources, prompts, platforms, and run timing, not just the literal query string.
If you are building the prompt universe from SEO data, start with how to build an AI search prompt set from your SEO keywords, then add brand, competitor, problem, integration, reputation, and buyer-segment prompts that may never appear as high-volume keywords.
What people searching this topic actually need
Someone searching for "AI visibility audit prompts" usually wants more than examples. They are trying to answer five practical questions:
- How many prompts are enough to trust an AI visibility audit?
- Which prompt categories should be included?
- How should prompts be split across branded, non-branded, competitor, and citation tests?
- How many platforms and repeat runs are needed?
- How should the output be scored so the audit leads to SEO, content, PR, and brand actions?
Most AI visibility and generative engine optimization guides cover definitions, AI search monitoring, mentions, citations, and tools. The missing piece is sample design: how to build a prompt set that is broad enough to be useful without becoming an expensive pile of duplicated questions.
This guide uses the maxaeo Prompt Sample Size Model: a practical framework for sizing AI visibility audit prompts by buyer coverage, competitor density, platform variance, and repeat-run stability.
How many AI visibility audit prompts do you need?
Use 60-120 unique prompts for a directional audit, 150-300 for a board-ready benchmark, and 400-1,000+ for multi-brand tracking. Then multiply by platforms and repeat runs.
A 150-prompt audit across six AI platforms with two repeat runs creates:
150 unique prompts x 6 platforms x 2 runs = 1,800 answer observations
| Audit type | Best for | Unique prompts | Platforms | Repeat runs | Total observations |
|---|---|---|---|---|---|
| Quick manual check | Confirm whether a visibility problem exists | 20-40 | 2-3 | 1 | 40-120 |
| Snapshot audit | Startup, narrow product, early GEO baseline | 40-60 | 3-4 | 1-2 | 120-480 |
| Standard B2B audit | One brand, one core category, 3-8 competitors | 80-150 | 5-7 | 2 | 800-2,100 |
| Mid-market benchmark | Multiple buyer segments, products, or regions | 150-300 | 6-8 | 2-3 | 1,800-7,200 |
| Enterprise audit | Many categories, markets, and competitor sets | 300-600 | 6-8 | 3-5 | 5,400-24,000 |
| Agency or portfolio audit | Multiple brands with separate categories | 400-1,000+ | 5-8 | 2-4 | 4,000-32,000+ |
These are planning ranges, not magic numbers. A narrow developer tool with three direct competitors may learn more from 80 well-designed prompts than from 500 generic prompts. A crowded CRM, cybersecurity, HR, finance, or AI software category may need 300 prompts before the pattern becomes stable.
The practical test: after every 25 new prompts, check how many new competitors, citation domains, and brand descriptions appear. If the next 25 prompts still reveal materially new patterns, the sample is too small. If they mostly repeat the same winners, sources, and failure modes, the sample is stabilizing.
The Prompt Sample Size Model
Calculate AI visibility audit prompts from four inputs:
- Coverage: which buyer situations must be represented?
- Competitor density: how many plausible vendors can appear for the same question?
- Market complexity: how many products, regions, industries, and buyer segments matter?
- Repeat runs: how much answer instability must be smoothed?
Use this formula for unique prompts:
Unique prompts =
intent buckets x prompts per bucket x competitor-density multiplier x market-complexity multiplier
Then calculate total observations:
Total observations =
unique prompts x platforms x repeat runs
For most B2B SaaS audits, a useful starting point is:
10 intent buckets x 10 prompts per bucket x 1.5 density x 1.0 complexity = 150 prompts
That model keeps the audit tied to business reality. You are not asking, "How many prompts can we afford?" You are asking, "How many buyer situations do we need to observe before we can defend the pattern?"
Step 1: Build prompt coverage before adding volume
Prompt coverage is the percentage of important buyer questions represented in your audit. A high prompt count with poor coverage creates false confidence. A smaller, balanced set is usually more useful.
Start with these buckets:
| Prompt bucket | What it measures | Example prompt pattern | Suggested share |
|---|---|---|---|
| Category discovery | Whether the brand appears in broad recommendations | "Best [category] tools for [use case]" | 10-15% |
| Problem-solution | Whether AI connects the brand to the pain point | "How do I solve [problem] in [team type]?" | 10-15% |
| Shortlist creation | Whether the brand is recommended before it is named | "Which vendors should I shortlist for [need]?" | 10-15% |
| Competitor comparison | Rank and framing against known alternatives | "Compare [brand] vs [competitor] for [segment]" | 15-20% |
| Evaluation criteria | Whether the right buying factors are associated with the category | "What should I look for in [category] software?" | 5-10% |
| Integration and ecosystem | Whether technical compatibility is surfaced | "Which [category] tools integrate with [platform]?" | 10-15% |
| Pricing and packaging | Whether plans and affordability are described accurately | "Affordable [category] tools for [company size]" | 5-10% |
| Risk and reputation | Negative framing, objections, and AI reputation management | "Common complaints about [brand]" | 5-10% |
| Citation discovery | Which sources shape answer engine responses | "Sources comparing [category] platforms" | 5-10% |
| Branded validation | Entity accuracy and factual brand descriptions | "What is [brand] used for?" | 5-10% |
A simple rule: no bucket should exceed 20% of the prompt set unless it maps to the main buying motion. If half your audit is "best tools" prompts, you are measuring listicle visibility, not AI search visibility.
For a deeper split between branded and non-branded testing, use branded vs non-branded prompts for AI recommendations as the companion framework.
Step 2: Separate branded, non-branded, and competitor prompts
Branded, non-branded, and competitor prompts answer different questions. Mixing them into one score makes the audit look cleaner than it is.
| Prompt class | Example | Use it to measure | Do not use it for |
|---|---|---|---|
| Non-branded | "Best AI visibility tools for B2B SaaS" | Discovery, category association, AI share of voice | Brand accuracy |
| Branded | "What is maxaeo used for?" | Entity clarity, factual accuracy, positioning | Competitive discovery |
| Competitor | "Compare maxaeo vs [competitor]" | Rank, differentiation, objections | Total market visibility |
| Source-led | "Which reports compare AI search monitoring tools?" | Citation opportunities and source influence | Buyer demand size |
| Negative/reputation | "Common complaints about [brand]" | Risk language and sentiment | Top-of-funnel demand |
For AI share of voice, use mostly non-branded and competitor prompts. For reputation and factual accuracy, use branded prompts. For citation strategy, use source-led prompts.
A defensible audit reports these separately:
| Metric | Best prompt source |
|---|---|
| AI share of voice | Non-branded and competitor prompts |
| Brand accuracy | Branded validation prompts |
| Recommendation rank | Category, shortlist, and comparison prompts |
| Sentiment | Branded, competitor, and reputation prompts |
| Citation influence | Citation discovery and all cited-answer prompts |
| Fix priority | Any prompt with repeated absence, errors, or negative framing |
Step 3: Adjust for competitor density
Competitor density is the number of plausible brands an AI answer could reasonably recommend for the same prompt. The denser the category, the more prompts you need, because small prompt changes can rotate different vendors into the answer.
Use this scoring method:
| Competitor density | Signals | Multiplier |
|---|---|---|
| Low | 1-3 serious competitors, niche use case, clear category language | 1.0 |
| Medium | 4-8 competitors, overlapping positioning, mixed buyer terms | 1.5 |
| High | 9-20 competitors, many review pages, many comparison pages | 2.0 |
| Very high | 20+ competitors, marketplace category, heavy affiliate content | 2.5 |
Competitor density is not just the number of companies you already track. It is the number of companies an AI system could reasonably place into the answer.
For example, an AI visibility tool may compete in prompts about:
- AI visibility audits
- AI search monitoring
- answer engine optimization
- generative engine optimization
- AI share of voice
- LLM brand tracking
- citation monitoring
- brand reputation in AI search
- SEO platform workflows
- PR and earned media analytics
That broader competitive set should expand the prompt sample, because AI engines often blend categories that sales teams keep separate.
Step 4: Adjust for platform variance
Platform variance is how differently AI engines answer the same prompt. If ChatGPT, Gemini, Perplexity, Claude, Copilot, Grok, Google AI Mode, and AI Overviews cite different sources and recommend different brands, you need more observations.
Recent research supports this caution:
| Finding | Why it matters for audit design |
|---|---|
| The original GEO paper reported visibility gains of up to 40% in generative responses, with results varying by domain. | Optimization effects are real, but category-specific. Audit design should not assume one tactic works everywhere. |
| A 2026 empirical study of 11,500 Google Search, AI Overview, and Gemini queries found AI Overviews appeared for 51.5% of representative queries and that source overlap across systems was below 0.2 average Jaccard similarity. | Google rankings, AI Overviews, and Gemini should not be averaged together too early. |
| A 2026 longitudinal study of 55,393 AI Overview queries found 13.7% overall AIO activation, rising to 64.7% for question-form queries. It also found nearly 30% of AIO-cited domains did not appear in co-displayed first-page results. | Question prompts and citation tracking matter because AI source selection can diverge from classic SEO rankings. |
| A Stanford study on generative search verifiability found that generated answers can include unsupported statements and imperfect citations. | Audits should score citation support and factual accuracy, not just brand mentions. |
The takeaway: do not collapse platforms into one average until you have inspected platform-level behavior. A brand may perform well in Perplexity because it is cited by comparison pages, weakly in ChatGPT because of older entity associations, and inconsistently in AI Overviews because activation and source selection vary by query form.
Use this platform plan:
| Audit goal | Minimum platforms | Recommended platforms |
|---|---|---|
| Early baseline | 3 | ChatGPT, Perplexity, Gemini |
| B2B SaaS benchmark | 5-6 | ChatGPT, Gemini, Perplexity, Claude, Copilot, Google AI Overviews |
| Executive reporting | 6-8 | Add Grok and Google AI Mode where relevant |
| Agency reporting | 5-8 | Match platforms to each client's buyer behavior |
If the same 100 prompts produce very different winners by platform, increase repeat runs before adding hundreds of new prompts.
Step 5: Decide how many repeat runs you need
Repeat runs are repeated executions of the same prompt on the same platform. They measure answer instability.
For a quick check, one run may be enough. For a serious audit, run each prompt at least twice. For budget decisions or competitive claims, use three to five runs.
| Decision you will make from the audit | Repeat runs |
|---|---|
| "Do we appear at all?" | 1-2 |
| "Which themes are missing?" | 2 |
| "Are we ahead of competitor X?" | 3 |
| "Should we shift budget to GEO or AEO?" | 3-5 |
| "Can an agency report this to clients monthly?" | 3-5 |
| "Can we claim category leadership?" | 5+ plus confidence intervals |
A good AI search monitoring workflow separates two numbers:
- Unique prompt coverage: how much of the buyer journey you tested.
- Observation count: how much confidence you have in the measured pattern.
A 120-prompt audit with six platforms and two runs is not "120 results." It is 1,440 observations.
A defensible starting prompt set for B2B SaaS
For most B2B SaaS and tech companies, start with 120 unique AI visibility audit prompts, six platforms, and two runs.
120 prompts x 6 platforms x 2 runs = 1,440 observations
That is large enough to find real patterns, small enough to review manually, and structured enough to repeat monthly.
| Bucket | Prompts |
|---|---|
| Category discovery | 15 |
| Problem-solution | 15 |
| Shortlist creation | 15 |
| Competitor comparisons | 20 |
| Integrations and ecosystem | 15 |
| Evaluation criteria | 10 |
| Pricing and packaging | 10 |
| Branded validation | 10 |
| Reputation and objections | 10 |
| Total | 120 |
After the first run, expand only where the data demands it:
| What you see | What to add |
|---|---|
| New competitors keep appearing | More category and shortlist prompts |
| Branded prompts reveal factual errors | More entity, product, and use-case variations |
| Citations cluster around review sites | More source-led and comparison prompts |
| Results vary sharply by platform | More repeat runs before adding new buckets |
| One segment behaves differently | Segment-specific prompts for industry, size, or region |
| Negative sentiment repeats | More reputation, review, and objection prompts |
A strong audit is not the biggest prompt list. It is the smallest prompt list that can defend a decision.
Worked example: 144-prompt B2B visibility audit
Assume a company sells workflow automation software to RevOps teams. Its SEO keyword list has 900 terms, but only 75 map cleanly to AI-style buyer questions. The rest need to be expanded into problem, comparison, integration, and validation prompts.
A practical audit design could use 144 unique prompts across six platforms with two runs, producing 1,728 observations.
| Bucket | Prompts |
|---|---|
| Category discovery | 18 |
| Problem-solution | 18 |
| Competitor comparison | 24 |
| Shortlist creation | 18 |
| Evaluation criteria | 12 |
| Integrations | 18 |
| Pricing and packaging | 12 |
| Risk and reputation | 12 |
| Branded validation | 12 |
| Total | 144 |
Run those prompts across ChatGPT, Gemini, Perplexity, Claude, Copilot, and Google AI Overviews. Track each answer for brand mention, rank position, sentiment, cited source, citation quality, competitor mentions, and factual accuracy.
The output should answer five budget-level questions:
- How often is the brand mentioned when buyers ask category and shortlist questions?
- Which competitors dominate non-branded prompts?
- Which sources produce the most AI citations?
- What inaccurate or weak descriptions appear repeatedly?
- Which content, PR, review, partner, or schema fixes should be prioritized?
For measurement structure across engines, use how to measure AI search visibility across ChatGPT, Gemini, Perplexity, and Google AI Overviews.
What each prompt should record
Each prompt should record more than "mentioned" or "not mentioned." A useful audit captures visibility, rank, description, citation, sentiment, and recommended action in the same row.
| Field | Why it matters |
|---|---|
| Prompt | The exact buyer-like question tested |
| Prompt bucket | The intent category |
| Prompt class | Branded, non-branded, competitor, source-led, or reputation |
| Platform | ChatGPT, Gemini, Perplexity, Claude, Copilot, Grok, AI Mode, or AI Overview |
| Run number | Repeat-run tracking |
| Date and location | Important for time-sensitive and localized answers |
| Brand mentioned | Basic AI share of voice input |
| Brand rank | Position in the AI-generated shortlist |
| Competitors mentioned | Competitive set actually surfaced by the engine |
| Description accuracy | Whether the brand is described correctly |
| Sentiment | Positive, neutral, negative, mixed, or absent |
| Citations | URLs, domains, or cited entities used in the answer |
| Citation type | Owned site, review site, analyst page, news, forum, partner, marketplace |
| Claim support | Whether cited pages support the answer's claims |
| Fix recommendation | Content, PR, schema, reviews, positioning, or source correction |
This is where an ai visibility tool becomes more useful than manual prompt checking. Manual checks can find anecdotes. A structured system can show whether brand mentions in ChatGPT are improving, whether Perplexity cites third-party reviews more than owned pages, and whether AI share of voice changes after content updates.
How to score AI visibility audit prompts
Do not rely on one metric. A brand mention can be weak, negative, inaccurate, or uncited.
Use a simple scoring model:
| Score component | Question | Example scoring |
|---|---|---|
| Mention | Did the brand appear? | 1 if mentioned, 0 if absent |
| Rank | Where did it appear? | 5 for first, 3 for top three, 1 for lower mention |
| Context | Was it recommended, merely listed, or criticized? | Positive, neutral, negative, mixed |
| Accuracy | Was the description correct? | Accurate, partly accurate, wrong, stale |
| Citation | Was the brand or claim supported by a source? | Owned, third-party, unsupported, no citation |
| Source quality | Was the source credible and relevant? | High, medium, low |
| Fixability | Can the problem be influenced? | Content, PR, reviews, schema, product, not actionable |
For executive reporting, separate the score into four views:
| View | Best metric |
|---|---|
| Discovery | Non-branded mention rate and top-three rate |
| Competitiveness | Share of voice against named competitors |
| Trust | Citation quality and claim support |
| Brand risk | Inaccuracy rate and negative-sentiment rate |
This prevents a common reporting problem: a brand looks visible because it is mentioned often, but the actual answers rank competitors higher, cite weak sources, or describe the product incorrectly.
How to avoid a misleading audit
A misleading audit usually has one of seven problems: too few prompts, too many near-duplicates, no repeat runs, no competitor tracking, mixed branded and non-branded scoring, no source analysis, or no action mapping.
Avoid these mistakes:
- Do not use only SEO head terms.
- Do not turn every keyword into the same "best [category]" prompt.
- Do not mix branded and non-branded prompts without labeling them.
- Do not average platforms before checking platform-level variance.
- Do not treat one answer as the truth.
- Do not count a mention as positive if the answer says the brand is outdated, expensive, risky, or not a fit.
- Do not ignore citations, because source influence often explains why a competitor is recommended.
- Do not optimize only for AI if the change weakens the page for human buyers.
Google's people-first content guidance is still relevant: original information, substantial analysis, clear sourcing, and value beyond competing pages remain important. That same discipline helps answer engine optimization because AI systems need extractable, consistent, well-supported facts.
For citation-specific work, pair the audit with how AI search citations are chosen and what brands can influence.
How to turn prompt results into SEO and GEO actions
The audit is only useful if every finding maps to a fix. Group actions by failure pattern, not by the platform where you first noticed the issue.
| Failure pattern | What it usually means | Fix |
|---|---|---|
| Brand absent from category prompts | Weak topical association | Build category, use-case, and comparison pages; earn third-party mentions |
| Brand appears only in branded prompts | Low non-branded discovery | Publish problem-solution content and earn category citations |
| Brand mentioned but ranked low | Competitors have stronger proof or clearer positioning | Add evidence, customer segments, integrations, and differentiated claims |
| Brand described incorrectly | Entity confusion or stale sources | Update owned facts, schema, profiles, review pages, and PR boilerplate |
| Competitor dominates citations | Their sources are more retrievable or trusted | Create citation-worthy assets and pitch neutral comparison sources |
| Negative sentiment appears repeatedly | Reputation or review issue | Investigate source patterns and coordinate content, comms, support, and customer proof |
| Platform results conflict | High platform variance | Track separately and prioritize platforms by buyer usage |
| Citation exists but does not support the claim | Source-answer mismatch | Improve claim clarity and strengthen supporting pages |
This is the difference between llm brand tracking and a useful AI reputation management workflow. Tracking tells you what happened. Diagnosis tells SEO, content, brand, PR, and product marketing teams what to fix.
When to expand beyond 120 prompts
Start with 120 prompts when the market is normal. Expand when the audit shows instability or incomplete coverage.
| Expansion trigger | What it means | Add |
|---|---|---|
| More than 20% of the last 25 prompts reveal new competitors | Competitor set is not saturated | 25-50 category and shortlist prompts |
| More than 20% reveal new citation domains | Source universe is not saturated | 25-50 citation and source-led prompts |
| One platform disagrees with all others | Platform behavior is materially different | Repeat runs on that platform |
| One segment has different winners | Buyer context changes results | Segment-specific prompts |
| Factual errors repeat across branded prompts | Entity understanding is weak | Branded validation variants |
| Reputation prompts show recurring negative framing | Brand risk is real | Objection, review, and complaint prompts |
The maxaeo rule of thumb: add new prompts when coverage is incomplete; add repeat runs when answers are unstable. Do not solve instability by adding more loosely related prompts.
Frequently asked questions
Are 20 AI visibility audit prompts enough?
Twenty prompts are enough for a quick manual check, but not enough for a serious audit. Use 20 prompts only to confirm whether a visibility problem exists. For a baseline that informs content, PR, or budget decisions, use at least 60-120 prompts.
Should I use the same prompts across every AI platform?
Yes. Keep a core set consistent across platforms so you can compare results. You can add platform-specific prompts later, but the baseline should use the same wording, schedule, and scoring fields.
How often should we rerun an AI visibility audit?
Run a full audit monthly if AI search is an active channel. For fast-moving categories, track a smaller weekly pulse set of 25-50 prompts. Daily monitoring is useful for brand-critical prompts, launches, reputation issues, and agency reporting.
Should branded prompts count in AI share of voice?
Track branded prompts separately. They measure accuracy and reputation, not discovery. Non-branded and competitor prompts are better for competitive AI share of voice because they show whether answer engines recommend your brand before the user names it.
Can I use SEO keywords as AI visibility audit prompts?
Use SEO keywords as inputs, not as the final prompt set. Convert them into buyer questions with context: use case, segment, integration, budget, geography, industry, risk, and comparison criteria.
What is the biggest prompt sampling mistake?
The biggest mistake is over-sampling generic "best tools" prompts and under-sampling buyer context. Real AI searches include constraints: company size, integrations, budget, migration risk, geography, industry, and evaluation criteria.
What is a good first prompt count for B2B SaaS?
A good first benchmark is 120 unique AI visibility audit prompts across six platforms with two runs, producing 1,440 observations. Expand after the first audit only where the data shows missing coverage or unstable answers.
This article was created with AI assistance and reviewed by a human editor.