Most pre-engagement calls are sales calls. Both sides try to look smart. Nobody answers the questions that actually predict whether the project will ship.
These five do.
Ask them on the first call. The answers will sort the senior practitioners from the polished pitches in about 20 minutes. They work whether you're hiring an outside team or scoping internal work.
1. Who actually writes the code?
You're paying senior rates. Find out who's on the keyboard.
A good answer sounds like: "Me. I'm the engineer on this engagement. If we bring in a second person, I'll introduce them by name and explain why."
A bad answer sounds like: "We have a team of senior engineers." Or: "Our delivery model pairs strategy and implementation specialists." Both translate to: a junior writes the code, a senior reviews it sometimes, you get billed for both.
Why this matters: AI engineering quality varies more by individual than by firm. The person matters more than the brand. If the person on the call isn't the person on the keyboard, you're buying a different product than the one being demoed.
2. What does "done" mean to you?
This is the single most useful question on the list.
A good answer sounds like: "Done means the system is running in your production stack, you have eval coverage, you have observability, the failure modes are documented, and your team can extend it without us. There's a 30-day support window after handoff."
A bad answer sounds like: "Done is when we deliver the final report." Or: "When the prototype passes acceptance criteria." Or vaguer still: "When you're happy with what we've built."
Most AI projects fail because the definition of done covers the demo, not the production system. The model works on cherry-picked inputs. The eval suite doesn't exist.
The thing that worked in the notebook breaks in the API.
If the team's definition of done doesn't include eval coverage and a real handoff, the engagement is structured to leave you holding a fragile system.
3. How will I see what the system does in production?
You're going to operate this thing after the build team leaves. You need to see what it's doing.
A good answer sounds like: "Logs to your existing log stack. Metrics to your existing metrics stack. A dashboard with cost per call, p95 latency, error rate, and accuracy on a rolling window of evals. All committed to your repo so it survives the engagement."
A bad answer sounds like: "We'll provide a custom analytics dashboard." (Translation: a thing only they can read.) Or: "Logging is a Phase 2 item." (Translation: it doesn't exist.) Or: "The model is a box, that's the nature of LLMs." (Wrong. Inputs, outputs, latency, and cost are all observable. Anyone telling you otherwise hasn't built one in production.)
If you can't see the system, you can't trust it, and you can't fix it when it breaks at 2 AM.
4. What's the dollar cost per call, and what's your p95 latency target?
You'd ask any other vendor what their thing costs and how fast it runs. AI is no different.
A good answer sounds like: "We'll size it during the planning phase. Right now my rough estimate is $0.01 to $0.03 per call at your expected volume, with a p95 around 600 to 900 ms. We'll measure both in dev before we go to prod, and we'll write the targets into the contract."
A bad answer sounds like: "It depends." Or: "Costs are usually negligible." Or: "We don't optimise for latency in the first version."
All three mean nobody's done the maths. You'll find out when the bill arrives or when your customers complain.
Cost per call and p95 latency are product constraints, not implementation details. A system that works at $0.50 per call and 4 seconds p95 is a different product than one at $0.02 and 400 ms.
Pick the one your business can afford.
5. What did you decide not to use, and why?
This is the question that separates the senior engineers from the buzzword shoppers.
A good answer sounds like: "I considered an agent framework here and rejected it. The flow is deterministic, three steps, no branching. A plain Python pipeline is faster to build, cheaper to run, and easier for your team to debug. I also considered fine-tuning. Not worth it for your data volume."
A bad answer sounds like: "We use the latest tools across the stack." Or: "Our methodology covers all major frameworks." Or just a list of vendor logos.
Senior practitioners pick tools by elimination. They can name what they rejected and why. People who reach for every shiny thing in the toolbox don't know what each tool is for, and you're going to pay to find out.
How to use these on a call
Don't fire all five at once. Drop them naturally.
Start with #1 ("So who'd be on this work day to day?") within the first 5 minutes. If the answer is bad, the call can end early.
Ask #2 ("How do you define done for an engagement like this?") around 10 minutes in. The answer tells you whether the engagement is structured to ship or structured to look busy.
Save #3, #4, and #5 for the second half of the call when you're talking technical detail.
If the person opposite you gets uncomfortable answering any of these, that's the answer.
This is how I'd want to be evaluated.