Skip to main content

The scenario

Imagine a support team using GPT-4 to automatically route customer emails to the right department. Technical issues go to engineering, billing questions head to finance, and sales inquiries land with the sales team. It’s a straightforward task that works beautifully, except for one problem: the costs add up fast. The question worth asking is: do you really need the most expensive model on the market just to sort emails into categories?
This is a very simple example illustrating the mechanism of A/B testing for LLMs. If you’re after something more complex, check the other guides.

The hypothesis

email classification isn’t exactly rocket science. You’ve got clear categories, predictable patterns, and relatively straightforward decision-making. It’s not like you’re asking the model to write poetry or solve complex mathematical proofs. So there’s a good chance that a cheaper model might handle this task just as well, potentially saving on cost without sacrificing the quality of results. But hunches aren’t enough, let’s test this hypothesis.

Test configuration

This example shows a side-by-side comparison using the Narev Platform, testing five different models on the same set of customer emails with identical prompts. The only variable? The model doing the classification. This creates a clean way to see how each model performs on cost, speed, and accuracy.

Variants tested

VariantCost per 1M tokens (input/output)
GPT-4 (baseline)30/30/60
GPT-4.12/2/8
GPT-4.1 nano0.1/0.1/0.4
GPT-5 nano0.05/0.05/0.4
gpt-oss-20b0.03/0.03/0.14

What gets measured

The test tracks three critical metrics for each model:
  • the cost per request (because that’s the whole point)
  • the speed of classification (nobody wants their customers waiting)
  • the accuracy of the routing decisions (a cheap model that’s wrong all the time isn’t actually saving you money)

Results

The results tell a fascinating story about the tradeoffs between different models. GPT-4 serves as the baseline - 100% accurate but expensive. The open source gpt-oss-20b emerges as the sweet spot, cutting costs by 99% while maintaining 100% accuracy. It matches GPT-4’s perfect accuracy at a fraction of the cost, though it’s slower. The 1-second slowdown is a tradeoff worth taking. The ultra-cheap nano variants are tempting from a cost perspective, but the accuracy hit is too severe for production use.
Model experiment results comparison
VariantCost (avg)Speed (avg)Accuracy (total)
GPT-4$5,865/1M1,989 ms100%
GPT-4.1$464.50/1M (-92%) ✅1,447 ms (faster) ✅75% ⚠️
GPT-4.1 nano$23.13/1M (-99%) ✅1,782 ms (faster) ✅50% ❌
GPT-5 nano$106.71/1M (-98%) ✅9,740 ms (slower) ❌100% ✅
gpt-oss-20b$23.83/1M (-99%) ✅2,943 ms (slower) ⚠️100% ✅

Winner: gpt-oss-20b

The open source gpt-oss-20b wins out for this use case. It saves $5,841 per million requests compared to GPT-4 (a 99% reduction), and crucially maintains 100% accuracy with no compromise on quality. While it’s about 1 second slower per request than GPT-4, this tradeoff is well worth it - you’re getting perfect classification at a fraction of the cost. Unlike the cheaper alternatives that sacrifice accuracy, gpt-oss-20b proves you don’t have to choose between cost savings and reliable results.

Example in action

Look at how the models handle a typical sales inquiry. This is the kind of email that should be an easy win: clear enterprise sales intent, multiple specific questions, and an explicit request for a demo.
Subject: Enterprise plan pricing
From: cto@fastgrowth.io
Hi there,A 200-person company is looking to upgrade. Could you send details about:
  • Enterprise plan features
  • Volume discounts
  • Implementation timeline
  • API rate limits
Please schedule a demo for next week.Thanks,
Mike Chen, CTO
Should route to: Sales team

How each model did

ModelGot it right?CostSpeed
GPT-4✅ Yes$5,970/1M2,081 ms
GPT-4.1✅ Yes$458/1M1,045 ms
GPT-4.1 nano✅ Yes$22.90/1M1,929 ms
GPT-5 nano✅ Yes$118.80/1M10,244 ms
gpt-oss-20bYes$19.59/1M881 ms
Individual model result comparison
Good news: every model correctly routes this email to sales. This example shows that for straightforward cases (which make up the majority of customer emails), even the cheaper models can nail it. The differences between models show up more clearly in edge cases and ambiguous emails.

What this shows

Task complexity should drive model choice, not default assumptions. For email classification, gpt-oss-20b is the sweet spot: 99% cheaper than GPT-4 while maintaining 100% accuracy. You don’t have to compromise between cost and quality. The tradeoff is ~1 second slower response times for perfect classification at 0.4% of the cost.

The takeaway

Switching to gpt-oss-20b saves 5,841permillionemailswhilemaintaining1005,841 per million emails** while maintaining **100% accuracy**. Pay **23.83 instead of $5,865 per million requests.
Overall impact of model optimization
If speed is critical and 75% accuracy is acceptable, GPT-4.1 is faster at $464.50 per million. Otherwise, gpt-oss-20b wins: rock-bottom pricing with zero accuracy compromise.
Want to test which model works best for your use case? Start testing for free →