Benchmarks for ChatGPT & Co:
October 2023

Our October benchmarks have been improved in many ways compared to the September issue. We also introduce a new, promising model: Mistral 7b.

Benchmarks October 2023

☁️ - Cloud models with proprietary license
✅ - Open source models that can be run locally without restrictions
🦙 - Local models with Llama license

  
    







  
      Model
      Code
      Crm
      Docs
      Integrate
      Marketing
      Reason
      Final 🏆
      Cost
      Speed
    

  
      GPT-4 v1-0314 ☁️
      85
      88
      95
      52
      88
      50
      76
      7.18 €
      0.71 rps
    

      GPT-4 v2-0613 ☁️
      85
      83
      95
      52
      88
      50
      75
      7.18 €
      0.75 rps
    

      GPT-3.5 v2-0613 ☁️
      62
      79
      76
      75
      81
      48
      70
      0.35 €
      0.96 rps
    

      GPT-3.5-instruct 0914 ☁️
      51
      90
      69
      60
      88
      32
      65
      0.36 €
      2.35 rps
    

      GPT-3.5 v1-0301 ☁️
      38
      75
      67
      67
      82
      37
      61
      0.36 €
      1.76 rps
    

      Llama2 70B Hermes b8🦙
      48
      76
      46
      76
      62
      29
      56
      13.10 €
      0.13 rps
    

      Mistral 7B Instruct f16 ✅
      36
      77
      61
      44
      62
      18
      50
      0.42 €
      2.63 rps
    

      Llama2 70B chat b4🦙
      13
      51
      53
      29
      64
      21
      39
      4.06 €
      0.27 rps
    

      Llama2 13B Vicuna-1.5 f16🦙
      36
      25
      27
      18
      77
      36
      36
      0.78 €
      1.39 rps
    

      Llama2 13B Hermes f16🦙
      32
      15
      25
      51
      56
      39
      36
      0.57 €
      1.93 rps
    

      Llama2 13B Hermes b8🦙
      31
      18
      23
      44
      56
      39
      35
      3.65 €
      0.30 rps
    

      Llama2 70B chat b8🦙
      1
      53
      34
      27
      71
      21
      35
      10.24 €
      0.16 rps
    

      Llama2 13B chat f16🦙
      0
      38
      15
      30
      75
      8
      27
      0.64 €
      1.71 rps
    

      Llama2 13B chat b8🦙
      0
      38
      8
      30
      75
      6
      26
      4.01 €
      0.27 rps
    

      Llama2 7B chat f16🦙
      7
      33
      23
      26
      38
      15
      24
      0.69 €
      1.58 rps
    

      Llama2 13B Puffin f16🦙
      14
      6
      0
      5
      54
      0
      13
      1.71 €
      0.64 rps
    

      Llama2 13B Puffin b8🦙
      16
      3
      0
      5
      47
      0
      12
      7.94 €
      0.14 rps
    

      Mistral 7B f16 ✅
      0
      4
      0
      25
      38
      0
      11
      0.92 €
      1.19 rps
    

      Llama2 7B f16🦙
      0
      0
      4
      2
      32
      0
      6
      1.08 €
      1.01 rps
    






  

Model	Code	Crm	Docs	Integrate	Marketing	Reason	Final 🏆	Cost	Speed
GPT-4 v1-0314 ☁️	85	88	95	52	88	50	76	7.18 €	0.71 rps
GPT-4 v2-0613 ☁️	85	83	95	52	88	50	75	7.18 €	0.75 rps
GPT-3.5 v2-0613 ☁️	62	79	76	75	81	48	70	0.35 €	0.96 rps
GPT-3.5-instruct 0914 ☁️	51	90	69	60	88	32	65	0.36 €	2.35 rps
GPT-3.5 v1-0301 ☁️	38	75	67	67	82	37	61	0.36 €	1.76 rps
Llama2 70B Hermes b8🦙	48	76	46	76	62	29	56	13.10 €	0.13 rps
Mistral 7B Instruct f16 ✅	36	77	61	44	62	18	50	0.42 €	2.63 rps
Llama2 70B chat b4🦙	13	51	53	29	64	21	39	4.06 €	0.27 rps
Llama2 13B Vicuna-1.5 f16🦙	36	25	27	18	77	36	36	0.78 €	1.39 rps
Llama2 13B Hermes f16🦙	32	15	25	51	56	39	36	0.57 €	1.93 rps
Llama2 13B Hermes b8🦙	31	18	23	44	56	39	35	3.65 €	0.30 rps
Llama2 70B chat b8🦙	1	53	34	27	71	21	35	10.24 €	0.16 rps
Llama2 13B chat f16🦙	0	38	15	30	75	8	27	0.64 €	1.71 rps
Llama2 13B chat b8🦙	0	38	8	30	75	6	26	4.01 €	0.27 rps
Llama2 7B chat f16🦙	7	33	23	26	38	15	24	0.69 €	1.58 rps
Llama2 13B Puffin f16🦙	14	6	0	5	54	0	13	1.71 €	0.64 rps
Llama2 13B Puffin b8🦙	16	3	0	5	47	0	12	7.94 €	0.14 rps
Mistral 7B f16 ✅	0	4	0	25	38	0	11	0.92 €	1.19 rps
Llama2 7B f16🦙	0	0	4	2	32	0	6	1.08 €	1.01 rps

The benchmark categories in detail

How well can the model work with large documents and knowledge bases?
How well does the model support work with product catalogs and marketplaces?
Can the model easily interact with external APIs, services and plugins?
How well can the model support marketing activities, e.g. brainstorming, idea generation and text generation?

How well can the model reason and draw conclusions in a given context?
Can the model generate code and help with programming?
The estimated cost of running the workload. For cloud-based models, we calculate the cost according to the pricing. For on-premises models, we estimate the cost based on GPU requirements for each model, GPU rental cost, model speed, and operational overhead.
The "Speed" column indicates the estimated speed of the model in requests per second (without batching). The higher the speed, the better.

Highlights and Updates from the October Benchmarks

New Evals

We have integrated 9 new benchmarks into the suite. These benchmarks focus on the areas of "Documents", "Integration" and "Reason". This makes the assessment of model capabilities more precise and increases the total number of different assessments from 85 to 134.

An example of this is situations where large language models create and process structured data.

In the Integration category, we now test the ability of large language models to understand and manipulate text in CSV, TSV, JSON, and YAML formats.

Another example concerns our work on business assistants and information search systems for customers. In such cases, large language models need to identify, find, and evaluate relevant pieces of information. Our evaluations help measure various aspects of this capability.

In addition to these new assessments, we have improved the performance of some existing assessments by introducing Few-Shot examples and better queries. Most major language models are responding very positively to this.

More Guidance

Guidance is a process of helping large language models generate desired text. It works by directing the model's attention to specific text elements (tokens).

As our experience in obtaining better results from large language models grows, we are incorporating these findings into the benchmarks. Our October release already includes guidance in some of the assessments, further improving the performance of some models.

In the coming months, we plan to provide even deeper guidance for models in task-related areas.

New model with impressive performance: Mistral 7B

Mistral 7B is a new model of a French AI company of the same name. Although it is significantly smaller than the other models, it has surpassed the basic configurations of Llama2 70B and all models with sizes 7B and 13B.

That is really impressive. It's worth paying more attention to this model in the coming months. The cost and throughput characteristics of this model make it even more attractive for local implementations.

Another highlight of this model is that it is released under the Apache license, which is more understandable and less restrictive than the Llama 2 license. There are no "Google" clauses or possible confusion regarding the use of this model for non-English languages. Our model markings reflect this change in the table.