A fast-growing SaaS company is trying to improve how its AI agent handles recurring customer issues...
They run an A/B test on different AI-generated responses:
Group A gets the default model’s response.
Group B sees a version with adjusted tone and length.
It sounds like a smart, data-driven experiment. But soon, they hit a wall: the volume is high enough to care about automation, but too low to get statistically significant A/B test results quickly.
Now they’re stuck. Wait weeks for conclusive data, or revert and risk ongoing frustration for users.
But the real problem runs deeper.
A/B testing can help optimize surface-level traits, things like tone, length, or phrasing.
But most issues in AI support aren’t about tone, they’re about accuracy.
There’s no point A/B testing tone and style if the answer itself is wrong or so vague it’s useless to the customer.
The core issue is "out-of-the-box models" that ingest your knowledge base, but they don’t understand your domain.
They don’t know your product inside and out.
So you're stuck improving performance by trial-and-error, tweaking one response at a time, adding more resources, updating existing ones, all because the model isn't fine-tuned to your domain language and doesn't understand the context well enough to use what's already there.
On the other hand, fine-tuning the AI model on your actual support conversations, product language, and logic enables the AI agent to respond like a seasoned team member, driving accuracy up and deflection rates to new heights.
That's what we do around here.