AI Prompt Tracking: Build a Prompt Set From Real Buyer Questions

Most AI visibility programs don't fail at tracking. They fail at the prompt list. AI prompt tracking only tells you the truth if the prompts you monitor are the questions buyers actually type into ChatGPT, Gemini and Perplexity — and in the prompt sets we audit at MaxAEO, most aren't. Track the wrong questions and every downstream number — brand mentions, rankings, share of voice — describes a market that doesn't exist.

73% of B2B buyers now use AI tools like ChatGPT and Perplexity in purchase research, according to a 2026 multi-source analysis of 680 million citations, and 55% use them to compare vendors directly. If your AI search monitoring runs on prompts no buyer would phrase, you're measuring a mirror, not a market.

This guide documents the methodology behind MaxAEO's Prompt Research feature: deriving buyer-journey prompts from three sources — sales calls, support tickets and keyword data — with worked B2B examples for every funnel stage and the retirement rules that keep the set honest. It's the process we run when setting up tracking for new accounts, written out so you can run it yourself.

Diagram of an ai prompt tracking workflow: sales calls, support tickets and keyword data feeding a funnel-staged prompt set that is tracked daily across eight AI platforms

What Is AI Prompt Tracking?

AI prompt tracking is the practice of running a fixed set of realistic buyer questions through AI assistants on a recurring schedule, then recording whether your brand is mentioned, where it appears in the answer, how it's described, and which sources get cited. In practice that means daily runs across ChatGPT, Gemini, Perplexity, Claude, Copilot, Grok, Google AI Mode and AI Overviews. The same discipline also goes by LLM brand tracking or AI visibility monitoring — the mechanics are identical.

Each prompt run should capture five data points:

Mention — does your brand appear in the answer at all?
Position — first recommendation, mid-list, or footnote?
Description and sentiment — how the assistant characterizes you
AI citations — which pages and third-party sources the answer links to
Competitor set — which brands appear alongside or instead of you

It is the measurement layer of generative engine optimization (GEO) and answer engine optimization (AEO): you can't improve how often AI recommends you until you know which questions you're absent from. But it differs from keyword rank tracking in three ways that change how you build the input list. There is no public search-volume data for prompts, so you can't sort by demand. There are no stable positions — answers are generated fresh and vary by user context. And answers change constantly; our daily tracking shows the brands in an AI answer shift far more often than blue links do (we quantified this in our study of how often AI answers change across eight platforms).

The practical consequence: since you can't track everything and nothing holds still, which prompts you choose matters more than how many you track.

Why Most Prompt Sets Fail Before Tracking Starts

The short answer: teams build prompt sets from the data they already have (keywords) instead of the language buyers actually use. When we onboard new MaxAEO workspaces and audit self-built prompt lists, roughly six in ten prompts are reworded SEO keywords, not questions a human would ask an assistant.

The three failure modes we see most:

Keyword mirroring. "Best expense management software" pasted in as a prompt. A real buyer types: "What expense tool should a 200-person company use if we're already on NetSuite?" The keyword and the prompt often retrieve different answers — and different brands.
Brand bias. Self-built sets skew heavily toward branded prompts ("What is [our product]?"). Those measure reputation, not discovery. Buyers shortlisting a category don't mention you — that's the moment you're trying to win.
Funnel skew. Sets composed entirely of "best X" category prompts miss problem-stage questions (where buyers don't know the category exists) and validation-stage questions (where deals die quietly).

Existing guidance doesn't resolve this. SE Ranking recommends starting with 20–40 prompts; Profound suggests 100, scaling to 1,000. Both are reasonable, but count is the wrong variable to optimize first. Composition beats volume: a 30-prompt set sourced from real buyer conversations beats a 300-prompt set generated from a keyword export, because it surfaces gaps you can act on.

Where Real Buyer Prompts Come From: Three Sources Most Teams Ignore

The most reliable prompt set triangulates three sources: what prospects say before they buy (sales calls), what customers say after (support tickets and communities), and what the market types when nobody's listening (keyword and question data). Each source maps to different funnel stages, which is why sets built from only one source come out lopsided.

The minimum viable inputs: your last 20–30 recorded sales calls, 90 days of support tickets, and your top 50 non-branded queries. Mining them takes about four hours and produces a 30–50 prompt set you won't have to rebuild next quarter.

Source 1: Sales calls — the problem-stage and comparison-stage goldmine

Pull your last 20–30 recorded discovery calls and read the first ten minutes of each — the part before your rep starts pitching. That's where prospects describe their problem in their own words, before your vocabulary contaminates theirs.

Capture recurring problem statements verbatim, then convert each into a prompt with minimal editing:

Verbatim: "We're drowning in spreadsheet expense reports every month-end." → Prompt: "How do finance teams stop doing expense reports in spreadsheets?"
Verbatim: "My CFO asked why we still don't see spend until 30 days later." → Prompt: "How can a mid-size company get real-time visibility into employee spend?"

Objection-handling segments are equally valuable: every recurring objection ("Can't we just build this on top of our ERP?") is a comparison-stage prompt buyers are already asking AI assistants. In our experience, sales calls are the only source that reliably produces problem-stage prompts — keyword tools rarely surface them because buyers at this stage don't yet know what to search for.

Source 2: Support tickets and community threads — validation and switching language

Support and success conversations reveal the questions buyers ask when verifying a decision: security, implementation time, integrations, pricing edge cases. These become your validation-stage prompts — "Is [vendor] SOC 2 Type II compliant?", "How long does [vendor] implementation take for a 500-person company?"

Two ticket types deserve special attention:

Pre-sales tickets and chat logs show the exact phrasing of late-stage doubts.
Churn-save conversations tell you which "alternatives to [you]" prompts to monitor — because a departing customer has already asked an AI assistant that exact question.

Round this out with the communities where your category gets discussed without you: Reddit threads, Slack groups, G2 review Q&As. Skeptical community phrasing ("Is [category] actually worth it or is it all hype?") makes excellent prompts, because AI assistants are trained on — and frequently cite — those same threads. This is also where you'll find the prompts that shape your AI reputation, not just your visibility.

Source 3: Keyword and question data — seeds, not phrasing

Your SEO data still matters, but treat keywords as topic seeds, not prompt phrasing. Take your top non-branded queries from Search Console, People Also Ask questions, and AI Overviews triggers, then reconstruct the conversation behind each: who is asking, at what stage, with what constraint?

A useful filter we apply in Prompt Research: a keyword earns a prompt only if you can name the persona and the constraint behind it. "Expense management software" fails the test. "Expense software for companies with field technicians" passes — it becomes "What's the best way to handle expense reports for a workforce that's mostly on job sites?"

Keyword data is also your best source of modifier coverage: industries, team sizes, integrations and regions to rotate through your prompts, mirroring how assistants personalize answers.

A Worked Example: Funnel-Stage Prompts for a B2B SaaS

Here's the method applied end to end, using a composite example from our B2B SaaS accounts: an expense-management platform selling to 50–500-employee companies. The same structure transfers to any B2B category.

Funnel stage	What the buyer is doing	Example prompts	Primary source
Problem-aware	Naming the pain, no category vocabulary yet	"How do finance teams stop chasing receipts at month-end?" · "Why does our expense process take two weeks to close?"	Sales-call verbatims
Solution-aware	Shortlisting a category	"Best expense management software for a 200-person company?" · "Expense tools that integrate with NetSuite and Slack?"	Keyword data, reframed
Comparison	Evaluating a shortlist head-to-head	"Ramp vs Expensify for a mid-size services firm?" · "Alternatives to [incumbent] with better policy controls?"	Sales objections + churn calls
Validation	Building the internal business case	"Is [vendor] SOC 2 compliant?" · "What do reviews say about [vendor]'s implementation time?"	Support tickets, security questionnaires
Post-sale	Renewal, expansion, advocacy	"[Vendor] renewal worth it, or should we switch?"	Churn-save conversations

Comparison-stage prompts deserve disproportionate attention: they're where buyers with budget are actively choosing, and where AI assistants assemble shortlists you may not be on. If you don't yet know which competitors AI recommends in your category, run an AI competitor analysis to see every brand AI suggests before yours — the names that appear become comparison prompts immediately.

Composition that works in practice: across MaxAEO accounts, the sets that surface actionable gaps fastest look roughly like this — 35–40% solution-aware, ~25% problem-aware, ~20% comparison, ~15% validation and branded. Not because the ratio is magic, but because it forces coverage of stages keyword-derived sets systematically miss.

Table of buyer-journey funnel stages mapped to example AI prompts and their sources for a B2B SaaS prompt set

How Many Prompts Should You Track?

Start with 30–50 well-sourced prompts per product line, and expand toward 100–150 as tracking data shows where you're blind. That range isn't arbitrary: in MaxAEO accounts, sets beyond roughly 150 prompts per product line mostly re-discover the same cited sources and the same competitors — coverage plateaus while reporting noise grows. Under 30, a single volatile answer can swing your topline metrics.

Three sizing rules from our setup work:

Scale by persona, not ambition. Each distinct buyer persona needs 10–15 prompts to be readable. Two personas? 25–30 prompts minimum.
Run daily, not monthly. AI answers move week to week, and the same prompt can return different brands in back-to-back runs. Daily runs plus rolling 7-day averages — not single snapshots — are what make change detection trustworthy.
Track 2–3 platforms before 8. Start where your buyers are — for most B2B teams that's ChatGPT, Google AI Overviews and Perplexity — then expand once the set is stable.

Resist adding prompts before you've watched the first set for 30 days. The data will tell you what's missing — prompts where neither you nor any competitor appears usually mean the question is off-category, and prompts where the answer never changes are candidates for retirement.

Seven Phrasing Rules: Write Prompts Like Buyers, Not Like SEOs

How you phrase each prompt determines whether the answers you track resemble what buyers see. The rules we enforce in Prompt Research:

Write full questions, not keyword strings. Assistants respond differently to fragments than to natural questions.
Add one realistic constraint per prompt — team size, industry, tech stack or budget. "Best CRM" is nobody's prompt; "best CRM for a 10-person agency on Google Workspace" is everybody's.
One intent per prompt. Don't combine "what is X and which vendor is best" — buyers split those, and so should your tracking.
Keep discovery prompts unbranded. Your brand mentions in ChatGPT only count as discovery if the prompt didn't contain your name.
Mirror persona vocabulary. A CFO asks about "spend visibility"; a founder asks "where is our money going." Track both phrasings if both buy.
Stay neutral — never engineer prompts you'd win. A prompt written to flatter your positioning produces a vanity metric. The set must be able to deliver bad news, or it can't justify budget.
Separate evergreen from seasonal. Keep a stable core (it's your trend line) and a rotating slice for launches, events and category news.

How to Prune, Refresh and Prove the Set Is Working

A prompt set is a living instrument: review composition quarterly, and retire prompts by rule, not by feel. None of the top-ranking guides we benchmarked offer a retirement framework, so here is the one we apply:

Retire: off-category prompts. No brand — yours or anyone's — has appeared in 60+ days of answers. The question isn't commercial.
Retire: solved-and-static prompts. You're consistently the top recommendation and the answer hasn't changed in a quarter. Sample monthly instead; reinvest the slot.
Merge: duplicate-signal prompts. Two prompts whose answers cite the same sources and brands week after week are one prompt wearing two outfits.
Add: fan-out gaps. New questions appearing in AI follow-up suggestions and "people also ask" patterns around your tracked prompts.

Proving the program works means connecting the set to numbers leadership recognizes. Prompt-level results roll up into AI share of voice — what it is and what a good score looks like, and from there into the broader AI visibility metrics that tell you whether AI recommends your brand. Pair those with AI referral sessions in GA4 and you have the before/after chain — prompt gap found, content shipped, mention won, traffic measured — that survives a budget review.

How MaxAEO's Prompt Research Builds This for You

MaxAEO automates the methodology above rather than replacing it. Prompt Research ingests your seed topics and site, drafts buyer-journey prompts for each funnel stage, and lets you edit or replace them with phrasing from your own sales calls and tickets before anything goes live — the tool proposes, your buyer language disposes.

From there, the platform runs your approved set daily across ChatGPT, Gemini, Perplexity, Claude, Copilot, Grok, Google AI Mode and AI Overviews, logging mentions, answer rank, sentiment, descriptions and AI citations for every prompt. The output isn't just a dashboard: for each prompt where you're absent or misdescribed, it recommends the specific fix — the page to create, the source to earn, the description to correct — so the path from "tracked" to recommended by ChatGPT is a work queue, not a mystery.

Frequently Asked Questions

How is AI prompt tracking different from keyword rank tracking?

Keyword tracking measures your position in a stable list of links for queries with known search volume. AI prompt tracking measures whether generated answers mention and recommend you — there's no volume data, no fixed positions, and answers change daily and vary by user context. The two are complementary: keywords measure search visibility, prompts measure recommendation.

Should branded prompts be in the tracking set?

Yes — but capped around 10–15% and reported separately. Branded prompts ("Is [vendor] legit?", "What are [vendor]'s weaknesses?") measure your AI reputation: how assistants describe you to buyers who already know you. Mixing them into discovery metrics inflates your visibility score and hides the gap that matters — unbranded shortlists you're missing from.

Why does the same prompt return different answers each run?

Because LLM outputs are non-deterministic: model updates, retrieval freshness and user context all shift the generated answer, so two identical runs can recommend different brands. Treat single runs as samples, not verdicts — track daily, read 7- or 30-day mention rates, and act only on changes that persist across multiple runs.

How often should the prompt set be refreshed?

Track daily, review composition quarterly, and refresh immediately after major events: a product launch, a new competitor in AI shortlists, a pricing change or a rebrand. Quarterly is frequent enough to catch drift in buyer language; the event-driven refreshes catch the step changes that quarterly reviews miss.

Can I just ask ChatGPT to generate my prompt list?

Use it for a first draft, not a final set. LLM-generated prompts are fluent but generic — they reflect average category language, not your buyers' constraints. Validate every generated prompt against sales-call and support-ticket language, and discard any where you can't name the persona and funnel stage. In our audits, unedited LLM-generated sets skew heavily toward solution-aware prompts and miss validation-stage questions almost entirely.

Which AI platforms should I track first?

Start with ChatGPT — 900 million weekly active users as of February 2026 — plus Google AI Overviews for its tie to existing search behavior, and Perplexity for its citation density. Expand to Gemini, Copilot, Claude and Grok once your set is stable, because brand presence diverges sharply across platforms and your buyers rarely live on just one.

This article was created with AI assistance and reviewed by a human editor.