Can We Please Create Business-Friendly LLM Benchmarks?

Updated: Jun 21

🌏 In this week's "As the AI World Turns," we see an explosion of AI model benchmarks, with vendors touting impressive results to select ones to dazzle engineers - and bewilder business execs. 🤯

Here’s why this matters.

Picture this:

I’m the head of customer service at a major insurance firm. I’m looking into AI to transform customer service—swift, accurate claim handling is the goal.

Generative AI seems promising, but there are dozens of vendors vying for my business.

Which model performs best?

Oh, let me look at these benchmarks here: MMLU, GLUE, ARC, SuperGLUE, QuAC, HellaSwag.

My response in kind: WTF, SMH

OK, so let me go ask my head of AI. Here are some things I want to know:

1️⃣ Which model hallucinates least often? And how often will it screw up?

2️⃣ Which one best understands insurance lingo?

3️⃣ Which offers the best price-to-performance ratio?

The AI chief can’t quickly respond.

Not only because benchmarks aren’t tailored for the non-tech savvy, but also because benchmarks don’t provide answers to many of these questions.

This week, S&P Global launched a set of benchmarks specifically for the financial industry in response to this issue.

Some are easy to understand, like “domain knowledge,” while some, like “quantity extraction” still seem to require a Rosetta Stone.

Still, it's a step in the right direction.

Lots of things need to happen for business adoption of generative AI to increase.

I’d argue that standardized performance benchmarks business executives can use to make informed decisions are one such necessary element.

Do you agree?

David DeLallo

Can We Please Create Business-Friendly LLM Benchmarks?

Recent Posts

コメント

David DeLallo