An AI search visibility baseline is a repeatable measurement of how often AI answer engines mention, recommend, rank, cite, and accurately describe your brand for the buyer questions that matter. It documents prompts, platforms, competitors, sources, sentiment, and errors before GEO work begins, so later gains can be measured against a defensible starting point.
That matters because AI search visibility is not a single ranking. It is a pattern across prompts, platforms, sources, and answer behavior. Without a baseline, teams often rewrite pages, chase citations, or pitch PR without knowing whether the real problem is weak category association, missing source evidence, competitor dominance, entity confusion, or negative brand framing.

What Is an AI Search Visibility Baseline?
An AI search visibility baseline is the documented “before” state of your brand in AI-generated answers. It records where your brand appears, where it is absent, which competitors are recommended, which sources are cited, and which facts AI systems get wrong before you start generative engine optimization.
A useful baseline answers six questions:
- Discovery: Does the brand appear when buyers ask non-branded category, problem, and comparison questions?
- Position: Is the brand mentioned casually, recommended as a serious option, or ranked in a shortlist?
- Competition: Which competitors appear more often, higher, or with stronger evidence?
- Evidence: Which URLs, domains, reviews, directories, documentation, or third-party sources are cited?
- Accuracy: Are product category, features, pricing, integrations, market focus, and customer claims correct?
- Risk: Do answers repeat outdated, negative, or misleading narratives?
A baseline is not one person asking ChatGPT five questions. It is a controlled audit with a frozen prompt set, defined platforms, repeated checks, consistent scoring, and preserved evidence.
Why a Baseline Is Different From an AI Visibility Check
A one-off AI visibility check tells you what appeared once. A baseline tells you what pattern is stable enough to act on.
| Activity | Output | Main weakness |
|---|---|---|
| One-off ChatGPT check | A screenshot or anecdote | Too volatile to guide budget |
| Brand mention tracking | Count of mentions by platform | Misses recommendations, ranking, citations, and accuracy |
| Citation check | URLs used in answers | Misses whether the brand is actually recommended |
| AI search visibility baseline | Prompt, platform, competitor, citation, accuracy, and sentiment dataset | Requires upfront measurement discipline |
Research supports this approach. The 2026 paper Don’t Measure Once: Measuring Visibility in AI Search argues that AI visibility should be treated as a distribution across runs, prompts, and time, not as a single fixed result. Another study, Quantifying Uncertainty in AI Visibility, found that citation visibility can vary enough that single-run estimates create misleading precision.
The practical takeaway: measure patterns, not screenshots.
What Your Baseline Must Measure
A strong AI search visibility baseline has five layers: buyer intent, prompt variants, AI platforms, competitor presence, and source evidence. If one layer is missing, the baseline will tell you less than you think.
| Baseline layer | What to record | Why it matters |
|---|---|---|
| Brand entity | Official name, product names, old names, acquired brands, founder names, category labels | Prevents entity confusion and wrong descriptions |
| Buyer intents | Problem, category, comparison, integration, pricing, risk, implementation, and branded questions | Maps visibility to real demand |
| Prompt variants | 3-5 paraphrases per intent | Reduces overreliance on one wording |
| Platforms | ChatGPT, Gemini, Perplexity, Claude, Copilot, Grok, Google AI Mode, Google AI Overviews | Captures platform-specific behavior |
| Competitors | Direct, adjacent, incumbent, AI-native, open-source, marketplace, and “wrong” competitors | Shows how AI systems frame the category |
| Answer outcome | Absent, mentioned, recommended, ranked, cited, misdescribed, negative | Turns raw answers into comparable fields |
| Source evidence | URLs, domains, source type, freshness, authority, and fixability | Reveals what to create, update, or earn |
| Narrative | Plain-language summary of how the brand is described | Connects visibility to reputation and positioning |
Google’s own documentation says AI Overviews and AI Mode can use query fan-out, issuing multiple related searches across subtopics and data sources to generate responses. It also says standard SEO fundamentals remain relevant and that pages must be indexed and eligible for snippets to appear as supporting links in these features, according to Google Search Central’s AI features guidance.
That is why a baseline should measure both answer presence and the source layer behind the answer.
The MaxAEO Baseline Framework
Use the PACES framework to keep the baseline practical:
| Element | Question | Output |
|---|---|---|
| Prompts | What buyer questions should trigger your brand? | Prompt clusters and variants |
| Answers | How does each platform respond? | Raw answer archive and screenshots |
| Competitors | Who appears instead of you? | Competitor share and ranking position |
| Evidence | What sources support the answer? | Citation and source ledger |
| Sentiment | Is the brand framed accurately and favorably? | Accuracy, risk, and narrative tags |
This framework prevents a common mistake: treating “brand mentioned in ChatGPT” as the whole story. A brand can be visible but misdescribed, cited but not recommended, recommended for branded prompts but absent from discovery prompts, or beaten by competitors that have stronger third-party evidence.
How to Build an AI Search Visibility Baseline Step by Step
Build the baseline in nine steps:
- Define buyer intent clusters.
- Create prompt variants for each intent.
- Choose AI platforms based on buyer behavior.
- Build the competitor universe.
- Freeze the prompt set and baseline window.
- Run repeated checks.
- Classify every answer with the same rules.
- Audit cited and likely source pages.
- Score the baseline and turn findings into a GEO roadmap.
1. Define Buyer Intent Clusters
Start with buying situations, not keywords. AI search users often ask complete questions, describe a problem, or request a shortlist.
For B2B SaaS, a first baseline usually needs these intent clusters:
| Intent cluster | Example prompt |
|---|---|
| Problem-led | “How can a RevOps team reduce duplicate CRM records?” |
| Category | “What are the best customer onboarding platforms for B2B SaaS?” |
| Comparison | “Compare tools like [Competitor A], [Competitor B], and alternatives.” |
| Alternative | “What are the best alternatives to [known competitor]?” |
| Integration | “Which tools integrate with Salesforce and HubSpot for this workflow?” |
| Pricing or packaging | “Which vendors are affordable for a 100-person company?” |
| Implementation risk | “Which platforms are easiest to implement without a large admin team?” |
| Compliance or trust | “Which vendors are suitable for security-conscious enterprise teams?” |
| Branded | “What does [brand] do, and who is it best for?” |
If you are converting existing SEO work into AI search prompts, use this guide to turn SEO keywords into buyer questions.
2. Build Prompt Clusters, Not a Random Prompt List
Each intent cluster should include several natural phrasings. Do not measure one exact prompt and assume it represents the buyer intent.
Example cluster:
- “What are the best customer onboarding platforms for B2B SaaS?”
- “Recommend onboarding software for a mid-market SaaS company.”
- “Which tools help customer success teams improve product adoption?”
- “Compare customer onboarding software for a 200-person SaaS company.”
- “What software should a SaaS CS team use to shorten time to value?”
A practical first baseline uses 30-60 prompts across 8-12 intent clusters. Larger enterprise categories may need 100+ prompts, but the first version should be small enough to repeat.
This matters because prompt phrasing is a real source of variance. The 2026 study Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommendation found that small changes in buyer wording can materially change recommendation sets. The fix is not infinite prompts; it is disciplined prompt clustering.
For a deeper setup process, see MaxAEO’s guide to building an AI search prompt set for brand monitoring.
3. Choose Platforms Based on Buyer Behavior
Do not treat all AI systems as one channel. Each surface has different retrieval behavior, citation behavior, personalization, and user context.
For most B2B teams, baseline these platforms first:
| Platform | Why to include it |
|---|---|
| ChatGPT | Broad assistant-style research and shortlist recommendations |
| Gemini | Google-connected discovery and comparison behavior |
| Perplexity | Citation-heavy research and source visibility |
| Claude | Analytical evaluation and long-form buyer questions |
| Copilot | Microsoft work context and enterprise productivity users |
| Google AI Overviews | Search-led informational discovery |
| Google AI Mode | Complex, exploratory, multi-part search behavior |
| Grok | Categories influenced by real-time social conversation |
If resources are limited, start with the three platforms your prospects mention in sales calls or customer interviews. A smaller repeatable baseline is better than a broad audit that nobody can rerun.
4. Build the Competitor Universe
Your competitor set should include more than sales battlecard rivals. AI answer engines often recommend companies that dominate public content, not only the vendors your sales team worries about.
Include five competitor groups:
| Group | What to include |
|---|---|
| Direct competitors | Products buyers already compare with yours |
| Enterprise incumbents | Large brands AI systems may over-recommend because they are well documented |
| AI-native entrants | Newer tools with launch buzz, funding, or heavy discussion |
| Adjacent tools | Products that solve part of the same workflow |
| Wrong competitors | Brands that indicate entity confusion or category misunderstanding |
This is where many baselines become useful. If AI systems consistently compare you with the wrong category, the first fix is not more blog content. It is entity clarification across your site, profiles, documentation, partner pages, and third-party listings.
5. Freeze the Prompt Set and Baseline Window
Before collecting answers, freeze the measurement rules:
- Prompt set version, such as
baseline-v1.0 - Platforms included
- Competitor list
- Location or market, if relevant
- Account state, such as logged out, logged in, paid plan, or workspace account
- Search or browsing mode, where applicable
- Baseline collection window, ideally 5-10 business days
- Run count per prompt-platform pair
Do not mix old and new prompts in the same trendline. If the prompt set changes, create a new version and annotate the report.
6. Run Repeated Checks and Capture Evidence
For an initial baseline, run each prompt-platform pair three times. For volatile or high-value categories, run five or more times across multiple days.
Capture:
- Raw answer text
- Date and time
- Platform and model, if visible
- Prompt exactly as entered
- Account state
- Location, if controlled
- Search, browsing, or deep research mode
- Citations or visible source links
- Screenshot or export
- Notes on personalization, refusal, or answer failure
Screenshots are not decoration. They create an evidence trail when answers shift later.
7. Classify Every Answer With the Same Rules
Use a simple classification model before building advanced metrics.
| Classification | Meaning |
|---|---|
| Absent | The brand does not appear |
| Mentioned | The brand appears but is not recommended |
| Recommended | The brand is positioned as a valid option |
| Ranked | The brand appears in an ordered shortlist |
| Cited | The answer links to your page or a third-party source about you |
| Misdescribed | The answer contains a wrong product, category, pricing, feature, market, or customer claim |
| Negative | The answer includes risk, criticism, controversy, or outdated concerns |
This separates visibility from reputation. A brand can be highly visible and still lose buyers if the answer says the wrong thing.
8. Audit the Source Layer
AI systems often repeat what public sources say, not what your homepage wishes they said. A source audit shows whether the answer ecosystem has enough clear, current, and credible evidence to support the recommendation you want.
For each cited or likely source, record:
| Field | Options |
|---|---|
| Source type | Owned page, docs, blog, analyst page, review site, directory, news, Reddit, partner page, customer story |
| Freshness | Current, stale, undated |
| Accuracy | Correct, incomplete, wrong, conflicting |
| Citation usefulness | Supports target category, only mentions brand, creates confusion |
| Fixability | Owned fix, partner request, PR target, review response, hard-to-influence source |
| Buyer relevance | High, medium, low |
If citation quality is the biggest gap, read this guide to AI search citations.
9. Score the Baseline Without Overclaiming
A baseline score should help the team decide what to fix. It should not pretend to be a universal market truth.
Use five fields:
| Metric | Suggested weight | What it measures |
|---|---|---|
| Recommendation rate | 30% | How often the brand is suggested for target prompts |
| Competitive position | 20% | Whether the brand appears above or below priority competitors |
| Citation coverage | 20% | Whether answers cite useful supporting sources |
| Description accuracy | 20% | Whether facts, category, and positioning are correct |
| Sentiment risk | 10% | Whether negative or outdated narratives appear |
Report each metric by platform and prompt cluster. Avoid a single “AI visibility score” unless leadership understands what is inside it.
Baseline Metrics and Formulas
Use formulas that a non-specialist can audit.
| Metric | Formula | Use |
|---|---|---|
| Mention rate | Brand-mentioned observations / total observations | Broad visibility |
| Recommendation rate | Brand-recommended observations / total observations | Buyer shortlist strength |
| Ranked presence rate | Brand-ranked observations / total observations | Competitive positioning |
| Top-3 rate | Brand appears in positions 1-3 / ranked observations | Shortlist quality |
| Citation coverage | Observations with useful brand or supporting citations / total observations | Evidence strength |
| Owned citation rate | Observations citing owned URLs / total observations | Owned source retrieval |
| Third-party citation rate | Observations citing credible third-party sources / total observations | External validation |
| Accuracy defect rate | Observations with wrong or incomplete facts / total observations | Reputation and entity risk |
| Negative narrative rate | Observations with negative or outdated framing / total observations | Trust risk |
For AI share of voice, be precise:
| Metric | Formula | Best use |
|---|---|---|
| Mention share | Your brand mentions / all brand mentions in the tracked competitor set | Broad category presence |
| Recommendation share | Your recommendations / all recommendation opportunities for tracked brands | Buyer shortlist strength |
| Citation share | Your brand-related citations / all tracked brand-related citations | Source authority |
Example: if 40 category prompts across three runs create 120 Gemini observations and your brand is recommended 18 times, your Gemini category recommendation rate is 15% for that baseline period.
Do not report that as “our AI visibility is 15%.” Report it as: “Gemini category recommendation rate was 15% across this prompt set during the June 2026 baseline.”
For deeper benchmarking, see MaxAEO’s guide to AI search share of voice.
A Worked Example: B2B SaaS Baseline
A realistic first baseline for a B2B SaaS company might use:
- 48 prompts
- 10 intent clusters
- 6 platforms
- 3 runs per prompt-platform pair
- 8 named competitors
- 864 answer observations
This is not a universal benchmark. It is a practical audit size that gives a marketing team enough evidence to prioritize without waiting months.
| Finding | Example baseline result | Likely diagnosis | First action |
|---|---|---|---|
| Category recommendation rate | 11% | Brand is known by name but not associated with the broader problem | Build problem-led category pages and earn third-party proof |
| Branded fact accuracy | 71% | Old positioning and outdated feature descriptions persist | Fix owned sources and high-ranking profiles |
| Competitor dominance | 4 rivals appear in 60%+ of category prompts | AI systems rely on incumbent lists | Create comparison assets and target credible citations |
| Owned-page citation rate | 8% | Platforms cite directories and reviews instead of owned pages | Improve crawlable source pages and internal linking |
| Negative or risk mentions | 6% | Old implementation complaints are still repeated | Update docs, support content, and review responses |
The pattern matters more than the raw number. If branded accuracy is high but category recommendation rate is low, the job is not “fix ChatGPT.” The job is to connect the brand to the category across credible public sources.
How to Prioritize Baseline Gaps
Prioritize gaps where buyer intent is high, competitors are visible, and your brand is absent, misdescribed, or unsupported by evidence.
Use this decision rule:
- High intent + competitor present + brand absent: build category evidence and third-party validation.
- High intent + brand present + wrong description: fix entity facts and authoritative sources.
- High intent + brand present + no citations: improve source eligibility and citation paths.
- High intent + negative narrative: treat it as AI reputation management.
- Low intent + brand absent: monitor, but do not prioritize.
- Branded prompt + wrong answer: fix immediately because the user already knows you.
The 2026 paper The Discovery Gap found that Product Hunt startups were recognized when named but rarely surfaced in discovery-style prompts. That distinction matters for B2B brands: being recognized when named is not the same as being recommended when buyers ask for options.
How to Turn the Baseline Into a GEO Roadmap
A GEO roadmap should follow the diagnosis, not a generic content calendar.
| Baseline diagnosis | What to do next |
|---|---|
| Low category recommendation rate | Create problem-led pages, category explainers, comparison content, and third-party proof |
| Competitors dominate shortlists | Build evidence-based comparison assets and clarify use cases where you win |
| Weak citation coverage | Improve owned source pages and earn credible third-party mentions |
| Wrong product or category description | Update homepage copy, product pages, docs, schema, profiles, and partner listings |
| Stale or conflicting sources | Refresh high-ranking profiles, review pages, directories, and documentation |
| Negative answer pattern | Address the underlying issue, publish current facts, and update public support resources |
| Platform-specific weakness | Investigate the source mix and retrieval behavior for that platform |
Google’s guidance for AI features is aligned with this approach: make content crawlable, useful, findable through internal links, available in textual form, and consistent with structured data. Google also says there is no special schema required just to appear in AI Overviews or AI Mode.
For the broader practice, see MaxAEO’s guide to what GEO is.
What to Put in the Source Ledger
The source ledger is where the baseline becomes actionable. It shows which pieces of the public web are helping, hurting, or failing to support the brand.
Use these columns:
| Column | Example |
|---|---|
| Platform | Perplexity |
| Prompt cluster | Alternatives |
| Prompt | “Best alternatives to [competitor] for mid-market teams” |
| Cited URL | Review page, partner page, docs page, article, directory |
| Source owner | Owned, partner, publisher, community, marketplace |
| Source status | Current, stale, wrong, incomplete |
| Claim supported | Category fit, feature, pricing, customer type, integration, proof |
| Issue | No mention, outdated feature, wrong category, weak comparison |
| Fix path | Update page, request partner edit, pitch third-party article, add docs |
| Priority | High, medium, low |
A source that is cited often but describes the brand incorrectly is more urgent than an uncited page nobody sees.
How Many Prompts and Runs Are Enough?
For a first baseline, use this rule of thumb:
| Company stage | Prompt clusters | Prompts | Platforms | Runs | Approx. observations |
|---|---|---|---|---|---|
| Early-stage B2B | 6-8 | 24-40 | 3-4 | 3 | 216-480 |
| Growth B2B | 8-12 | 40-70 | 4-6 | 3 | 480-1,260 |
| Enterprise or multi-product | 12-20 | 80-150 | 5-8 | 3-5 | 1,200-6,000 |
Use fewer prompts if the team cannot maintain quality. Use more prompts when the category has multiple products, industries, regions, or buyer personas.
The baseline should be repeatable before it is exhaustive.
What to Report to Leadership
Leadership does not need a dump of AI answers. They need the baseline scope, current position, business risk, and next actions.
A useful executive summary has five parts:
- Scope: prompts, platforms, competitors, runs, dates, and collection rules.
- Visibility state: mention rate, recommendation rate, ranked presence, citation coverage, and AI share of voice.
- Risk state: wrong facts, negative narratives, outdated positioning, and source conflicts.
- Opportunity state: high-intent prompt clusters where competitors appear and your brand does not.
- Roadmap: 30-day source fixes, 60-day content and comparison work, 90-day citation and PR targets.
Use plain language. “We are absent from 76% of high-intent shortlist prompts in Gemini and Perplexity” is more defensible than “our GEO score is weak.”
Baseline Checklist
Use this checklist before starting GEO work.
| Checklist item | Done? |
|---|---|
| We defined 8-12 buyer intent clusters. | |
| Each cluster has 3-5 prompt variants. | |
| We selected platforms based on buyer behavior. | |
| We included direct, adjacent, incumbent, AI-native, and wrong-category competitors. | |
| We froze the prompt set and baseline window. | |
| Each prompt-platform pair was checked more than once. | |
| We recorded raw answers, timestamps, citations, and screenshots or exports. | |
| We classified absence, mentions, recommendations, rankings, citations, wrong facts, and negative narratives. | |
| We separated branded recognition from category discovery. | |
| We audited cited and likely source pages. | |
| We scored results by platform and prompt cluster. | |
| We turned findings into prioritized fixes by impact and fixability. |
This is the minimum viable AI search visibility baseline. Mature teams can add confidence intervals, geographic segmentation, account-state testing, industry-specific prompt sets, and weekly trend reporting after the first baseline is stable.
Common Mistakes
Most weak baselines fail because they look measurable but cannot explain what to fix.
Avoid these mistakes:
- Tracking only branded prompts. Branded prompts test recognition, not discovery.
- Using one platform as a proxy for all AI search. ChatGPT, Gemini, Perplexity, Claude, Copilot, Grok, AI Mode, and AI Overviews can produce different answers.
- Counting every mention equally. A passing mention is not the same as a ranked recommendation.
- Ignoring citations. A third-party source may shape the answer more than your homepage.
- Skipping screenshots or raw exports. You need evidence when results change.
- Mixing prompt versions. Trendlines break when old and new prompt sets are blended.
- Reporting false precision. A small sample should guide decisions, not claim market truth.
- Optimizing before diagnosing. The right fix depends on whether the gap is content, citations, entity clarity, reputation, or competitor evidence.
The best baseline is disciplined: repeatable, labeled, evidence-backed, and tied to decisions.
Frequently Asked Questions
What is an AI search visibility baseline?
An AI search visibility baseline is the starting measurement of how AI answer engines mention, recommend, rank, cite, and describe your brand across important buyer prompts and platforms. It captures the “before” state so later GEO, content, PR, and reputation work can be measured against a clear benchmark.
How often should we rebuild an AI search visibility baseline?
Rebuild the full baseline quarterly and monitor core prompts weekly. AI answers change as models, indexes, sources, and public narratives change. A quarterly baseline supports planning, while weekly AI search monitoring catches sudden shifts in recommendations, citations, and brand descriptions.
How many prompts do we need for the first baseline?
Most B2B teams can start with 30-60 prompts across 8-12 buyer intent clusters. The key is coverage, not raw prompt count. Include category, problem, comparison, integration, implementation, risk, and branded prompts, with multiple paraphrases per intent.
Should we track brand mentions in ChatGPT first?
Track ChatGPT, but do not stop there. Brand mentions in ChatGPT are only one slice of LLM brand tracking. Buyers also use Gemini, Perplexity, Claude, Copilot, Grok, Google AI Mode, and Google AI Overviews. A useful baseline shows where platforms agree and where they diverge.
What is the difference between AI visibility and AI citations?
AI visibility measures whether and how your brand appears in an AI answer. AI citations measure which URLs or sources support that answer. A brand can be mentioned without a citation, cited without being recommended, or recommended because of third-party sources rather than owned content.
What is a good AI search visibility baseline score?
There is no universal good score because categories, platforms, prompt sets, and competitor density vary. A useful score is one you can repeat. Report recommendation rate, citation coverage, accuracy defect rate, sentiment risk, and AI share of voice by prompt cluster and platform.
Can an AI visibility tool replace the baseline process?
An AI visibility tool can automate collection, scoring, screenshots, trend monitoring, and reporting. It should not replace the strategic decisions behind the baseline: which buyer prompts matter, which competitors count, which sources are credible, and which gaps are worth fixing first.
