Benchmarks for ChatGPT and Co

February 2024

The highlights of the month:

Improvements to ChatGPT-4
Performance comparisons for the Mistral API and the Anthropic Claude models
First work on enterprise AI benchmarks

LLM Benchmarks | February 2024

The Trustbit benchmarks evaluate the models in terms of their suitability for digital product development. The higher the score, the better.

☁️ - Cloud models with proprietary license
✅ - Open source models that can be run locally without restrictions
🦙 - Local models with Llama license

Here is an updated report on the performance of LLM models in enterprise-specific workloads.

model
code
crm
docs
integrate
marketing
reason
final 🏆
Cost
Speed


GPT-4 v1/0314 ☁️
80
88
98
52
88
50
76
7.19 €
1.26 rps

GPT-4 Turbo v4/0125-preview ☁️
60
97
100
71
75
45
75
2.51 €
0.82 rps

GPT-4 v2/0613 ☁️
80
83
95
52
88
50
74
7.19 €
2.07 rps

GPT-4 Turbo v3/1106-preview ☁️
60
75
98
52
88
62
72
2.52 €
0.68 rps

GPT-3.5 v2/0613 ☁️
62
79
73
75
81
48
70
0.35 €
1.39 rps

GPT-3.5 v3/1106 ☁️
62
68
71
63
78
59
67
0.24 €
2.29 rps

GPT-3.5 v4/0125 ☁️
58
85
71
60
78
47
66
0.13 €
1.41 rps

GPT-3.5-instruct 0914 ☁️
44
90
69
60
88
32
64
0.36 €
2.12 rps

Mistral 7B OpenChat-3.5 v3 0106 f16 ✅
56
86
67
52
88
26
62
0.37 €
2.99 rps

GPT-3.5 v1/0301 ☁️
49
75
69
67
82
24
61
0.36 €
3.93 rps

Mistral 7B OpenChat-3.5 v1 f16 ✅
46
72
72
49
88
31
60
0.51 €
2.14 rps

Starling 7B-alpha f16 ⚠️
51
66
67
45
88
36
59
0.61 €
1.80 rps

Mistral 7B OpenChat-3.5 v2 1210 f16 ✅
51
74
72
41
75
31
57
0.36 €
3.05 rps

Mistral Large v1/2402 ☁️
33
49
70
75
84
25
56
2.19 €
2.04 rps

Anthropic Claude Instant v1.2 ☁️
51
75
65
59
65
14
55
2.15 €
1.47 rps

Anthropic Claude v2.0 ☁️
57
52
55
30
84
35
52
2.24 €
0.40 rps

Anthropic Claude v2.1 ☁️
36
58
59
45
75
33
51
2.31 €
0.35 rps

Mistral 7B OpenOrca f16 ☁️
42
57
76
21
78
26
50
0.43 €
2.55 rps

Mistral 7B Instruct v0.1 f16 ☁️
31
70
69
44
62
21
50
0.79 €
1.39 rps

Llama2 13B Vicuna-1.5 f16🦙
36
37
53
39
82
38
48
1.02 €
1.07 rps

Llama2 13B Hermes f16🦙
38
23
30
61
60
43
42
1.03 €
1.06 rps

Llama2 13B Hermes b8🦙
32
24
29
61
60
43
42
4.94 €
0.22 rps

Mistral Small v1/2312 (Mixtral) ☁️
10
58
65
51
56
8
41
0.19 €
2.17 rps

Mistral Small v2/2402 ☁️
27
35
36
82
56
8
41
0.19 €
3.14 rps

Mistral Medium v1/2312 ☁️
36
30
27
59
62
12
38
0.83 €
0.35 rps

Llama2 13B Puffin f16🦙
37
12
38
33
56
41
36
4.89 €
0.22 rps

Llama2 13B Puffin b8🦙
37
9
37
31
56
39
35
8.65 €
0.13 rps

Mistral Tiny v1/2312 (7B Instruct v0.2) ☁️
13
39
57
32
59
8
35
0.05 €
2.30 rps

Mistral 7B Zephyr-β f16 ✅
28
34
46
44
29
4
31
0.51 €
2.14 rps

Llama2 13B chat f16🦙
15
38
17
30
75
8
30
0.76 €
1.43 rps

Llama2 13B chat b8🦙
15
38
15
30
75
6
30
3.35 €
0.33 rps

Mistral 7B Notus-v1 f16 ⚠️
16
43
25
41
48
4
30
0.80 €
1.37 rps

Orca 2 13B f16 ⚠️
15
22
32
22
67
19
29
0.99 €
1.11 rps

Llama2 7B chat f16🦙
20
33
20
27
50
20
28
0.59 €
1.86 rps

Mistral 7B Instruct v0.2 f16 ☁️
7
21
50
13
58
8
26
1.00 €
1.10 rps

Mistral 7B f16 ☁️
0
4
42
42
52
12
25
0.93 €
1.17 rps

Orca 2 7B f16 ⚠️
13
0
24
18
52
4
19
0.81 €
1.34 rps

Llama2 7B f16🦙
0
2
18
2
28
2
9
1.01 €
1.08 rps

model	code	crm	docs	integrate	marketing	reason	final 🏆	Cost	Speed
GPT-4 v1/0314 ☁️	80	88	98	52	88	50	76	7.19 €	1.26 rps
GPT-4 Turbo v4/0125-preview ☁️	60	97	100	71	75	45	75	2.51 €	0.82 rps
GPT-4 v2/0613 ☁️	80	83	95	52	88	50	74	7.19 €	2.07 rps
GPT-4 Turbo v3/1106-preview ☁️	60	75	98	52	88	62	72	2.52 €	0.68 rps
GPT-3.5 v2/0613 ☁️	62	79	73	75	81	48	70	0.35 €	1.39 rps
GPT-3.5 v3/1106 ☁️	62	68	71	63	78	59	67	0.24 €	2.29 rps
GPT-3.5 v4/0125 ☁️	58	85	71	60	78	47	66	0.13 €	1.41 rps
GPT-3.5-instruct 0914 ☁️	44	90	69	60	88	32	64	0.36 €	2.12 rps
Mistral 7B OpenChat-3.5 v3 0106 f16 ✅	56	86	67	52	88	26	62	0.37 €	2.99 rps
GPT-3.5 v1/0301 ☁️	49	75	69	67	82	24	61	0.36 €	3.93 rps
Mistral 7B OpenChat-3.5 v1 f16 ✅	46	72	72	49	88	31	60	0.51 €	2.14 rps
Starling 7B-alpha f16 ⚠️	51	66	67	45	88	36	59	0.61 €	1.80 rps
Mistral 7B OpenChat-3.5 v2 1210 f16 ✅	51	74	72	41	75	31	57	0.36 €	3.05 rps
Mistral Large v1/2402 ☁️	33	49	70	75	84	25	56	2.19 €	2.04 rps
Anthropic Claude Instant v1.2 ☁️	51	75	65	59	65	14	55	2.15 €	1.47 rps
Anthropic Claude v2.0 ☁️	57	52	55	30	84	35	52	2.24 €	0.40 rps
Anthropic Claude v2.1 ☁️	36	58	59	45	75	33	51	2.31 €	0.35 rps
Mistral 7B OpenOrca f16 ☁️	42	57	76	21	78	26	50	0.43 €	2.55 rps
Mistral 7B Instruct v0.1 f16 ☁️	31	70	69	44	62	21	50	0.79 €	1.39 rps
Llama2 13B Vicuna-1.5 f16🦙	36	37	53	39	82	38	48	1.02 €	1.07 rps
Llama2 13B Hermes f16🦙	38	23	30	61	60	43	42	1.03 €	1.06 rps
Llama2 13B Hermes b8🦙	32	24	29	61	60	43	42	4.94 €	0.22 rps
Mistral Small v1/2312 (Mixtral) ☁️	10	58	65	51	56	8	41	0.19 €	2.17 rps
Mistral Small v2/2402 ☁️	27	35	36	82	56	8	41	0.19 €	3.14 rps
Mistral Medium v1/2312 ☁️	36	30	27	59	62	12	38	0.83 €	0.35 rps
Llama2 13B Puffin f16🦙	37	12	38	33	56	41	36	4.89 €	0.22 rps
Llama2 13B Puffin b8🦙	37	9	37	31	56	39	35	8.65 €	0.13 rps
Mistral Tiny v1/2312 (7B Instruct v0.2) ☁️	13	39	57	32	59	8	35	0.05 €	2.30 rps
Mistral 7B Zephyr-β f16 ✅	28	34	46	44	29	4	31	0.51 €	2.14 rps
Llama2 13B chat f16🦙	15	38	17	30	75	8	30	0.76 €	1.43 rps
Llama2 13B chat b8🦙	15	38	15	30	75	6	30	3.35 €	0.33 rps
Mistral 7B Notus-v1 f16 ⚠️	16	43	25	41	48	4	30	0.80 €	1.37 rps
Orca 2 13B f16 ⚠️	15	22	32	22	67	19	29	0.99 €	1.11 rps
Llama2 7B chat f16🦙	20	33	20	27	50	20	28	0.59 €	1.86 rps
Mistral 7B Instruct v0.2 f16 ☁️	7	21	50	13	58	8	26	1.00 €	1.10 rps
Mistral 7B f16 ☁️	0	4	42	42	52	12	25	0.93 €	1.17 rps
Orca 2 7B f16 ⚠️	13	0	24	18	52	4	19	0.81 €	1.34 rps
Llama2 7B f16🦙	0	2	18	2	28	2	9	1.01 €	1.08 rps

The benchmark categories in detail

Here's exactly what we're looking at with the different categories of LLM Leaderboards

How well can the model work with large documents and knowledge bases?
How well does the model support work with product catalogs and marketplaces?
Can the model easily interact with external APIs, services and plugins?
How well can the model support marketing activities, e.g. brainstorming, idea generation and text generation?
How well can the model reason and draw conclusions in a given context?
Can the model generate code and help with programming?
The estimated cost of running the workload. For cloud-based models, we calculate the cost according to the pricing. For on-premises models, we estimate the cost based on GPU requirements for each model, GPU rental cost, model speed, and operational overhead.
The "Speed" column indicates the estimated speed of the model in requests per second (without batching). The higher the speed, the better.

Improvements in Chat-GPT-4 - new recommendations

The latest update in the ChatGPT-v4 series finally breaks the trend of releasing cheaper models with lower accuracy. In our benchmarks, the GPT-4 0125 (or v4) finally beats the GPT-4 0613 (or v2) model.

This model also contains the latest training data (up to December 2023) and runs at a fraction of the cost of the v1 and v2 models, making the GPT-4 Turbo v4/0125-preview a new safe standard model that we can recommend.

The trend for GPT 3.5 models continues to follow the same pattern. New models are becoming cheaper and less powerful.

Mistral and Claude API - Verbosity Problem

This benchmark finally includes benchmarks for the Mistral AI and Anthropic Claude models:

Anthropic Claude Instant v1.2
Smaller LLM from Anthropic - it's anthropic.claude-instant-v1 on AWS Bedrock.
Anthropic Claude v2.1 and v2.1
Larger Anthropic LLMs that have introduced large context sizes - anthropic.claude-v2 series on AWS Bedrock.
Mistral Large Model
Recently released LLM from Mistral, which is positioned between GPT4 and GPT3.5 in internal benchmarks. It is mistral-large-2402 on La Plateforme.
Mistral Medium
Another proprietary model from Mistral, roughly comparable to Llama 70B, according to Miqu-Leak. We are testing mistral-medium-2312.
Mistral Small
This model was a very popular Mixtral 8x7B, but the second version does not say whether this is still the case. We test both versions: mistral-small-2402 and mistral-small-2312.
Mistral Tiny
This model corresponds to Mistral 7B Instruct v0.2 or mistral-tiny-2312 on Mistral AI.

All of these models can be good for creating content and chatting with people. However, that is not the point of our benchmark. We rank the models according to their ability to provide accurate answers in tasks such as information retrieval, document ranking or classification.

All these models are too wordy for that. Nor do they follow instructions precisely. Even local small series of Mistral 7B are better at this. ChatGPT-4 remains at the top. It seems that OpenAI understands the needs of enterprise customers better than the rest.

OUR CONCLUSION

If you need LLMs for chatbots and marketing purposes and are okay with some instructions being ignored, the Mistral AI and Anthropic models might be worth a closer look. Otherwise, we suggest to defer them for a while.

We introduce

Enterprise AI Leaderboard

We have been tracking the performance of LLM models for many months, this is our eighth report.

This process has helped us to gain first-hand experience in dealing with several different models at the same time. Unlike the usual academic benchmarks, we have been sourcing data from the real-world projects and enterprise tasks.

⭐️ New: LLM benchmarks from patronus ai

By the way, we are no longer alone in this area. Another company has recently started working on a similar set of enterprise benchmarks. We invite you to take a look at the Enterprise Scenarios Leaderboard on Hugging Face by PatronusAI.

That's all good, but it's time to address the real elephant in the room. The truth is:

Large language models are just an implementation detail.

Yes, it is true that a lot depends on their performance and capabilities. This is why, for example, in the short term we generally recommend GPT-4 Turbo v4/0125-preview as a model to start with.

However, we ultimately believe that the major language models are replaceable and interchangeable. In fact, the entire LLM ranking was started because of a recurring customer question: "When can I replace ChatGPT-4 with a local model in my projects?"

If you look at the "Request For Startups" from YCombinator, one specific request focuses on the exact topic of replacement: small fine-tuned models as an alternative to huge generic models. YCombinator helped to incubate companies like Stripe, Dropbox, Twitch and Cruise. They know a thing or two when it comes to market trends and industry trends.

Giant generic models with many parameters are very impressive. But they are also very costly and often come with latency and privacy challenges. Fortunately, smaller open-source models such as Llama2 and Mistral have already shown that, when fine-tuned with suitable data, they can deliver comparable results at a fraction of the cost.

To push the concept even further, we believe that the local large models will be the way to improve the overall accuracy of the system beyond the capabilities of ChatGPT, while significantly reducing operational costs.

note

Per-system customization makes it possible to design systems that learn and adapt to the specifics of each individual company. We're not even talking about advanced topics like fine-tuning (this requires a lot of high-quality data). Even a simple customization of call and context based on statistics can work wonders.

Since individual LLMs are an implementation detail, what should be the metric to measure the state of the art when applying AI to enterprise workloads?

Here is a hint in the form of some questions we are asked:

Which RAG architecture is best for legal workloads?
Which vector database should we use to build an internal support bot?
What is the best approach to automatically handle company questionnaires with 1000 questions in B2B sales?

The metric should target and compare complete enterprise and business AI solutions. End-to-end.

Anybody can claim 99% accuracy on RAG tasks. We want to independently verify it, build a better intuition about different architectures and ultimately allow our customers to make more informed decisions.

t will take time and effort to build a full Enterprise AI Leaderboard. We are starting with the foundational capability - ability of AI system to find relevant information within the business-specific documentation. This is the foundational block of RAG systems.

Here is an example: We took a public annual report from the Christian Dior Group. Then we asked the AI system 10 specific questions about this report. For example:

What was the company's turnover in 2022?
How much liquidity did the company have at the end of 2021?
What was the gross margin in 2023?
How many employees did the company have in 2022?

As you can see, each question has only one correct answer. No calculations or advanced reasoning are required.

How well do you think different systems would deal with these specific issues?

Not so good!

We have tested some common systems to get you started:

ChatGPT-4
OpenAI Assistant API with document retrieval and gpt-4-0125 model
Two popular services for asking questions about a specific PDF: ChatPDF and AskYourPDF.

Each test involved uploading the annual report and asking the question with a very specific instruction:

PROMT

Answer with a floating point number in actual currency, for example "1.234 million", use the decimal point and no thousands separators. you can think through the answer, but the last line should be in this format "Answer = Number Unit". Answer with "Answer = None" if no information is available.

This instruction was important because:

we would like to encourage models to use the chain-thought-of-process (CoT) if this increases accuracy
we still need the number to be parseable in a specific locale, hence the strict requirement on using decimal comma and no thousand separators (just like in the original report).

Obviously, RAG-systems, being end-to-end solutions, would already have CoT baked into the pipelines underneath the covers. However, when we added instruction into the overall request prompt, overall accuracy still increased.

Below are the final scores of multiple RAG systems in a single test. We gave each system 1 point for a correct and parseable answer. 0.5 points for an answer that pulled the right bit of information but made an order of magnitude error.

Financial Data Table





  
    Question
    Answer
    ChatGPT-4
    gpt-4-0125 RAG
    ChatPDF
    Ask Your PDF
  

    How much liquidity did this company have at hand at the end of 2021?
    8.122 million
    7.388 million euro
    7.918 million
    10.667 million euros
    INVALID
  

    How much liquidity did this company have at hand at the end of 2022?
    7.588 million
    7.388 billion euros
    7.388 million
    11.2 billion euros
    7588 million euros
  

    How many employees did the company have at the end of 2022?
    196006
    196,006
    196,006
    196006
    INVALID
  

    How much were total lease liabilities of the company by the end of 2021?
    14.275 million
    14.275 million
    14,275 million
    14,275
    14.275 million euros
  

    What amount was recorded for the repayment of lease liabilities in 2022?
    2.453 million
    2.453 million
    2,711 million
    2.711 million
    2.711 million euros
  

    What was the company's net revenue in 2021?
    64.215 million
    64.215 million
    64.215 million
    64.215 million euros
    64,215 million euros
  

    What was the company's net revenue in 2022?
    79.184 million
    79.184 million
    79184 million euros
    EUR 79.184 million
    EUR 79.184 million
  

    What was the company's net revenue in 2023?
    None
    None
    None
    None
    INVALID
  

    What was the total shareholder equity at the end of 2022?
    54.314 million
    54.3 billion
    54.314 million
    54.314 billion euros
    INVALID
  

    What was the company's gross margin for the year 2021?
    43.860 million
    43.860 million euros
    43.860 million
    43.860 million euros
    43.860 million euros
  

    SCORE
    100
    70
    60
    55
    40
  

Question	Answer	ChatGPT-4	gpt-4-0125 RAG	ChatPDF	Ask Your PDF
How much liquidity did this company have at hand at the end of 2021?	8.122 million	7.388 million euro	7.918 million	10.667 million euros	INVALID
How much liquidity did this company have at hand at the end of 2022?	7.588 million	7.388 billion euros	7.388 million	11.2 billion euros	7588 million euros
How many employees did the company have at the end of 2022?	196006	196,006	196,006	196006	INVALID
How much were total lease liabilities of the company by the end of 2021?	14.275 million	14.275 million	14,275 million	14,275	14.275 million euros
What amount was recorded for the repayment of lease liabilities in 2022?	2.453 million	2.453 million	2,711 million	2.711 million	2.711 million euros
What was the company's net revenue in 2021?	64.215 million	64.215 million	64.215 million	64.215 million euros	64,215 million euros
What was the company's net revenue in 2022?	79.184 million	79.184 million	79184 million euros	EUR 79.184 million	EUR 79.184 million
What was the company's net revenue in 2023?	None	None	None	None	INVALID
What was the total shareholder equity at the end of 2022?	54.314 million	54.3 billion	54.314 million	54.314 billion euros	INVALID
What was the company's gross margin for the year 2021?	43.860 million	43.860 million euros	43.860 million	43.860 million euros	43.860 million euros
SCORE	100	70	60	55	40

So far, OpenAI's RAG systems are the best on the market for the task at hand. However, we do not expect this to remain the case for long.

Specialized solutions are capable of achieving higher scores, even without the use of cutting-edge LLMs. We know this for a fact, because we have built such systems. One of them even includes the use of Mistral-7B-OpenChat-3.5 to extract information from tens of thousands of PDF documents.

As we extend and enrich this enterprise AI benchmark with more cases and solutions, it is expected that ChatGPT will eventually be dethroned.

Trustbit LLM Benchmarks Archive

Interested in the benchmarks of the past months? You can find all the links on our LLM Benchmarks overview page!

Go to the Trustbit LLM benchmarks

Benchmarks for ChatGPT and Co

February 2024

The highlights of the month:

LLM Benchmarks | February 2024

Here's exactly what we're looking at with the different categories of LLM Leaderboards

Improvements in Chat-GPT-4 - new recommendations

Mistral and Claude API - Verbosity Problem

Anthropic Claude Instant v1.2

Anthropic Claude v2.1 and v2.1

Mistral Large Model

Mistral Medium

Mistral Small

Mistral Tiny

OUR CONCLUSION

Enterprise AI Leaderboard

Large language models are just an implementation detail.

Trustbit LLM Benchmarks Archive