Nihilus

NIHILUS

LLM Inference Bandwidth Benchmarks

llama.cpp vs. Nihilus

Bandwidth used per Inference Run - For Length: 1, For Model: llama-3.1-8B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 8549785988
Written bytes (llama.cpp): 14726144
Read bytes (Nihilus): 8533072256
Written bytes (Nihilus): 133120
Bandwidth used per Inference Run - For Length: 2, For Model: llama-3.1-8B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 8566588424
Written bytes (llama.cpp): 28922880
Read bytes (Nihilus): 8533203328
Written bytes (Nihilus): 264192
Bandwidth used per Inference Run - For Length: 8, For Model: llama-3.1-8B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 8667752480
Written bytes (llama.cpp): 114447360
Read bytes (Nihilus): 8533989760
Written bytes (Nihilus): 1050624
Bandwidth used per Inference Run - For Length: 32, For Model: llama-3.1-8B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 9078399104
Written bytes (llama.cpp): 462443520
Read bytes (Nihilus): 8537135616
Written bytes (Nihilus): 4196352
Bandwidth used per Inference Run - For Length: 128, For Model: llama-3.1-8B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 10816832000
Written bytes (llama.cpp): 1948800000
Read bytes (Nihilus): 8549719296
Written bytes (Nihilus): 16779264
Bandwidth used per Inference Run - For Length: 512, For Model: llama-3.1-8B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 19304105984
Written bytes (llama.cpp): 9404175360
Read bytes (Nihilus): 8600054016
Written bytes (Nihilus): 67110912
Bandwidth used per Inference Run - For Length: 2048, For Model: llama-3.1-8B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 77789880320
Written bytes (llama.cpp): 63384867840
Read bytes (Nihilus): 8801392896
Written bytes (Nihilus): 268437504
Bandwidth used per Inference Run - For Length: 8192, For Model: llama-3.1-8B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 704319832064
Written bytes (llama.cpp): 665854694400
Read bytes (Nihilus): 9606748416
Written bytes (Nihilus): 1073743872
Bandwidth used per Inference Run - For Length: 32768, For Model: llama-3.1-8B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 9491829309440
Written bytes (llama.cpp): 9260486906880
Read bytes (Nihilus): 12828170496
Written bytes (Nihilus): 4294969344
Bandwidth used per Inference Run - For Length: 131072, For Model: llama-3.1-8B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 145144101945344
Written bytes (llama.cpp): 142595062256640
Read bytes (Nihilus): 25713858816
Written bytes (Nihilus): 17179871232
Bandwidth used per Inference Run - For Length: 1, For Model: llama-3.1-70B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 75049944004
Written bytes (llama.cpp): 70120448
Read bytes (Nihilus): 74967533952
Written bytes (Nihilus): 329728
Bandwidth used per Inference Run - For Length: 2, For Model: llama-3.1-70B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 75132643848
Written bytes (llama.cpp): 139744256
Read bytes (Nihilus): 74967861632
Written bytes (Nihilus): 657408
Bandwidth used per Inference Run - For Length: 8, For Model: llama-3.1-70B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 75630576672
Written bytes (llama.cpp): 559207424
Read bytes (Nihilus): 74969827712
Written bytes (Nihilus): 2623488
Bandwidth used per Inference Run - For Length: 32, For Model: llama-3.1-70B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 77652029568
Written bytes (llama.cpp): 2266551296
Read bytes (Nihilus): 74977692160
Written bytes (Nihilus): 10487808
Bandwidth used per Inference Run - For Length: 128, For Model: llama-3.1-70B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 86213386752
Written bytes (llama.cpp): 9567785984
Read bytes (Nihilus): 75009150208
Written bytes (Nihilus): 41945088
Bandwidth used per Inference Run - For Length: 512, For Model: llama-3.1-70B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 128067545088
Written bytes (llama.cpp): 46322471936
Read bytes (Nihilus): 75134982400
Written bytes (Nihilus): 167774208
Bandwidth used per Inference Run - For Length: 2048, For Model: llama-3.1-70B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 417223852032
Written bytes (llama.cpp): 314137170944
Read bytes (Nihilus): 75638311168
Written bytes (Nihilus): 671090688
Bandwidth used per Inference Run - For Length: 8192, For Model: llama-3.1-70B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 3521683857408
Written bytes (llama.cpp): 3318131250176
Read bytes (Nihilus): 77651626240
Written bytes (Nihilus): 2684356608
Bandwidth used per Inference Run - For Length: 32768, For Model: llama-3.1-70B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 47104880320512
Written bytes (llama.cpp): 46257872098304
Read bytes (Nihilus): 85704886528
Written bytes (Nihilus): 10737420288
Bandwidth used per Inference Run - For Length: 131072, For Model: llama-3.1-70B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 720083369238528
Written bytes (llama.cpp): 712797067990016
Read bytes (Nihilus): 117917927680
Written bytes (Nihilus): 42949675008
Bandwidth used per Inference Run - For Length: 1, For Model: llama-3.1-405B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 431481584108
Written bytes (llama.cpp): 209341440
Read bytes (Nihilus): 431231986048
Written bytes (Nihilus): 518144
Bandwidth used per Inference Run - For Length: 2, For Model: llama-3.1-405B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 431731764168
Written bytes (llama.cpp): 418296832
Read bytes (Nihilus): 431232502144
Written bytes (Nihilus): 1034240
Bandwidth used per Inference Run - For Length: 8, For Model: llama-3.1-405B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 433238284704
Written bytes (llama.cpp): 1677448192
Read bytes (Nihilus): 431235598720
Written bytes (Nihilus): 4130816
Bandwidth used per Inference Run - For Length: 32, For Model: llama-3.1-405B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 439357627008
Written bytes (llama.cpp): 6806950912
Read bytes (Nihilus): 431247985152
Written bytes (Nihilus): 16517120
Bandwidth used per Inference Run - For Length: 128, For Model: llama-3.1-405B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 465327158784
Written bytes (llama.cpp): 28811318272
Read bytes (Nihilus): 431297531136
Written bytes (Nihilus): 66062336
Bandwidth used per Inference Run - For Length: 512, For Model: llama-3.1-405B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 593079886848
Written bytes (llama.cpp): 140610491392
Read bytes (Nihilus): 431495715072
Written bytes (Nihilus): 264243200
Bandwidth used per Inference Run - For Length: 2048, For Model: llama-3.1-405B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 1486084414464
Written bytes (llama.cpp): 968314442752
Read bytes (Nihilus): 432288450816
Written bytes (Nihilus): 1056966656
Bandwidth used per Inference Run - For Length: 8192, For Model: llama-3.1-405B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 11170000370688
Written bytes (llama.cpp): 10367246390272
Read bytes (Nihilus): 435459393792
Written bytes (Nihilus): 4227860480
Bandwidth used per Inference Run - For Length: 32768, For Model: llama-3.1-405B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 147696029727744
Written bytes (llama.cpp): 145372832453632
Read bytes (Nihilus): 448143165696
Written bytes (Nihilus): 16911435776
Bandwidth used per Inference Run - For Length: 131072, For Model: llama-3.1-405B-Q8-GQA
---------------------------------
Read bytes (llama.cpp): 2258445995670528
Written bytes (llama.cpp): 2243952909079552
Read bytes (Nihilus): 498878253312
Written bytes (Nihilus): 67645736960
Bandwidth used per Inference Run - For Length: 1, For Model: llama-3.1-8B-FP16-MHA
---------------------------------
Read bytes (llama.cpp): 16077906308
Written bytes (llama.cpp): 14726144
Read bytes (Nihilus): 16061190528
Written bytes (Nihilus): 133120
Bandwidth used per Inference Run - For Length: 2, For Model: llama-3.1-8B-FP16-MHA
---------------------------------
Read bytes (llama.cpp): 16094708744
Written bytes (llama.cpp): 28922880
Read bytes (Nihilus): 16061321600
Written bytes (Nihilus): 264192
Bandwidth used per Inference Run - For Length: 8, For Model: llama-3.1-8B-FP16-MHA
---------------------------------
Read bytes (llama.cpp): 16195872800
Written bytes (llama.cpp): 114447360
Read bytes (Nihilus): 16062108032
Written bytes (Nihilus): 1050624
Bandwidth used per Inference Run - For Length: 32, For Model: llama-3.1-8B-FP16-MHA
---------------------------------
Read bytes (llama.cpp): 16606519424
Written bytes (llama.cpp): 462443520
Read bytes (Nihilus): 16065253888
Written bytes (Nihilus): 4196352
Bandwidth used per Inference Run - For Length: 128, For Model: llama-3.1-8B-FP16-MHA
---------------------------------
Read bytes (llama.cpp): 18344952320
Written bytes (llama.cpp): 1948800000
Read bytes (Nihilus): 16077837568
Written bytes (Nihilus): 16779264
Bandwidth used per Inference Run - For Length: 512, For Model: llama-3.1-8B-FP16-MHA
---------------------------------
Read bytes (llama.cpp): 26832226304
Written bytes (llama.cpp): 9404175360
Read bytes (Nihilus): 16128172288
Written bytes (Nihilus): 67110912
Bandwidth used per Inference Run - For Length: 2048, For Model: llama-3.1-8B-FP16-MHA
---------------------------------
Read bytes (llama.cpp): 85318000640
Written bytes (llama.cpp): 63384867840
Read bytes (Nihilus): 16329511168
Written bytes (Nihilus): 268437504
Bandwidth used per Inference Run - For Length: 8192, For Model: llama-3.1-8B-FP16-MHA
---------------------------------
Read bytes (llama.cpp): 711847952384
Written bytes (llama.cpp): 665854694400
Read bytes (Nihilus): 17134866688
Written bytes (Nihilus): 1073743872
Bandwidth used per Inference Run - For Length: 32768, For Model: llama-3.1-8B-FP16-MHA
---------------------------------
Read bytes (llama.cpp): 9499357429760
Written bytes (llama.cpp): 9260486906880
Read bytes (Nihilus): 20356288768
Written bytes (Nihilus): 4294969344
Bandwidth used per Inference Run - For Length: 131072, For Model: llama-3.1-8B-FP16-MHA
---------------------------------
Read bytes (llama.cpp): 145151630065664
Written bytes (llama.cpp): 142595062256640
Read bytes (Nihilus): 33241977088
Written bytes (Nihilus): 17179871232