llama.cpp vs. Nihilus
Bandwidth used per Inference Run - For Length: 1, For Model: llama-3.1-8B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 8549785988 Written bytes (llama.cpp): 14726144 Read bytes (Nihilus): 8533072256 Written bytes (Nihilus): 133120 Bandwidth used per Inference Run - For Length: 2, For Model: llama-3.1-8B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 8566588424 Written bytes (llama.cpp): 28922880 Read bytes (Nihilus): 8533203328 Written bytes (Nihilus): 264192 Bandwidth used per Inference Run - For Length: 8, For Model: llama-3.1-8B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 8667752480 Written bytes (llama.cpp): 114447360 Read bytes (Nihilus): 8533989760 Written bytes (Nihilus): 1050624 Bandwidth used per Inference Run - For Length: 32, For Model: llama-3.1-8B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 9078399104 Written bytes (llama.cpp): 462443520 Read bytes (Nihilus): 8537135616 Written bytes (Nihilus): 4196352 Bandwidth used per Inference Run - For Length: 128, For Model: llama-3.1-8B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 10816832000 Written bytes (llama.cpp): 1948800000 Read bytes (Nihilus): 8549719296 Written bytes (Nihilus): 16779264 Bandwidth used per Inference Run - For Length: 512, For Model: llama-3.1-8B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 19304105984 Written bytes (llama.cpp): 9404175360 Read bytes (Nihilus): 8600054016 Written bytes (Nihilus): 67110912 Bandwidth used per Inference Run - For Length: 2048, For Model: llama-3.1-8B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 77789880320 Written bytes (llama.cpp): 63384867840 Read bytes (Nihilus): 8801392896 Written bytes (Nihilus): 268437504 Bandwidth used per Inference Run - For Length: 8192, For Model: llama-3.1-8B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 704319832064 Written bytes (llama.cpp): 665854694400 Read bytes (Nihilus): 9606748416 Written bytes (Nihilus): 1073743872 Bandwidth used per Inference Run - For Length: 32768, For Model: llama-3.1-8B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 9491829309440 Written bytes (llama.cpp): 9260486906880 Read bytes (Nihilus): 12828170496 Written bytes (Nihilus): 4294969344 Bandwidth used per Inference Run - For Length: 131072, For Model: llama-3.1-8B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 145144101945344 Written bytes (llama.cpp): 142595062256640 Read bytes (Nihilus): 25713858816 Written bytes (Nihilus): 17179871232 Bandwidth used per Inference Run - For Length: 1, For Model: llama-3.1-70B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 75049944004 Written bytes (llama.cpp): 70120448 Read bytes (Nihilus): 74967533952 Written bytes (Nihilus): 329728 Bandwidth used per Inference Run - For Length: 2, For Model: llama-3.1-70B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 75132643848 Written bytes (llama.cpp): 139744256 Read bytes (Nihilus): 74967861632 Written bytes (Nihilus): 657408 Bandwidth used per Inference Run - For Length: 8, For Model: llama-3.1-70B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 75630576672 Written bytes (llama.cpp): 559207424 Read bytes (Nihilus): 74969827712 Written bytes (Nihilus): 2623488 Bandwidth used per Inference Run - For Length: 32, For Model: llama-3.1-70B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 77652029568 Written bytes (llama.cpp): 2266551296 Read bytes (Nihilus): 74977692160 Written bytes (Nihilus): 10487808 Bandwidth used per Inference Run - For Length: 128, For Model: llama-3.1-70B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 86213386752 Written bytes (llama.cpp): 9567785984 Read bytes (Nihilus): 75009150208 Written bytes (Nihilus): 41945088 Bandwidth used per Inference Run - For Length: 512, For Model: llama-3.1-70B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 128067545088 Written bytes (llama.cpp): 46322471936 Read bytes (Nihilus): 75134982400 Written bytes (Nihilus): 167774208 Bandwidth used per Inference Run - For Length: 2048, For Model: llama-3.1-70B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 417223852032 Written bytes (llama.cpp): 314137170944 Read bytes (Nihilus): 75638311168 Written bytes (Nihilus): 671090688 Bandwidth used per Inference Run - For Length: 8192, For Model: llama-3.1-70B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 3521683857408 Written bytes (llama.cpp): 3318131250176 Read bytes (Nihilus): 77651626240 Written bytes (Nihilus): 2684356608 Bandwidth used per Inference Run - For Length: 32768, For Model: llama-3.1-70B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 47104880320512 Written bytes (llama.cpp): 46257872098304 Read bytes (Nihilus): 85704886528 Written bytes (Nihilus): 10737420288 Bandwidth used per Inference Run - For Length: 131072, For Model: llama-3.1-70B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 720083369238528 Written bytes (llama.cpp): 712797067990016 Read bytes (Nihilus): 117917927680 Written bytes (Nihilus): 42949675008 Bandwidth used per Inference Run - For Length: 1, For Model: llama-3.1-405B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 431481584108 Written bytes (llama.cpp): 209341440 Read bytes (Nihilus): 431231986048 Written bytes (Nihilus): 518144 Bandwidth used per Inference Run - For Length: 2, For Model: llama-3.1-405B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 431731764168 Written bytes (llama.cpp): 418296832 Read bytes (Nihilus): 431232502144 Written bytes (Nihilus): 1034240 Bandwidth used per Inference Run - For Length: 8, For Model: llama-3.1-405B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 433238284704 Written bytes (llama.cpp): 1677448192 Read bytes (Nihilus): 431235598720 Written bytes (Nihilus): 4130816 Bandwidth used per Inference Run - For Length: 32, For Model: llama-3.1-405B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 439357627008 Written bytes (llama.cpp): 6806950912 Read bytes (Nihilus): 431247985152 Written bytes (Nihilus): 16517120 Bandwidth used per Inference Run - For Length: 128, For Model: llama-3.1-405B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 465327158784 Written bytes (llama.cpp): 28811318272 Read bytes (Nihilus): 431297531136 Written bytes (Nihilus): 66062336 Bandwidth used per Inference Run - For Length: 512, For Model: llama-3.1-405B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 593079886848 Written bytes (llama.cpp): 140610491392 Read bytes (Nihilus): 431495715072 Written bytes (Nihilus): 264243200 Bandwidth used per Inference Run - For Length: 2048, For Model: llama-3.1-405B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 1486084414464 Written bytes (llama.cpp): 968314442752 Read bytes (Nihilus): 432288450816 Written bytes (Nihilus): 1056966656 Bandwidth used per Inference Run - For Length: 8192, For Model: llama-3.1-405B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 11170000370688 Written bytes (llama.cpp): 10367246390272 Read bytes (Nihilus): 435459393792 Written bytes (Nihilus): 4227860480 Bandwidth used per Inference Run - For Length: 32768, For Model: llama-3.1-405B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 147696029727744 Written bytes (llama.cpp): 145372832453632 Read bytes (Nihilus): 448143165696 Written bytes (Nihilus): 16911435776 Bandwidth used per Inference Run - For Length: 131072, For Model: llama-3.1-405B-Q8-GQA --------------------------------- Read bytes (llama.cpp): 2258445995670528 Written bytes (llama.cpp): 2243952909079552 Read bytes (Nihilus): 498878253312 Written bytes (Nihilus): 67645736960 Bandwidth used per Inference Run - For Length: 1, For Model: llama-3.1-8B-FP16-MHA --------------------------------- Read bytes (llama.cpp): 16077906308 Written bytes (llama.cpp): 14726144 Read bytes (Nihilus): 16061190528 Written bytes (Nihilus): 133120 Bandwidth used per Inference Run - For Length: 2, For Model: llama-3.1-8B-FP16-MHA --------------------------------- Read bytes (llama.cpp): 16094708744 Written bytes (llama.cpp): 28922880 Read bytes (Nihilus): 16061321600 Written bytes (Nihilus): 264192 Bandwidth used per Inference Run - For Length: 8, For Model: llama-3.1-8B-FP16-MHA --------------------------------- Read bytes (llama.cpp): 16195872800 Written bytes (llama.cpp): 114447360 Read bytes (Nihilus): 16062108032 Written bytes (Nihilus): 1050624 Bandwidth used per Inference Run - For Length: 32, For Model: llama-3.1-8B-FP16-MHA --------------------------------- Read bytes (llama.cpp): 16606519424 Written bytes (llama.cpp): 462443520 Read bytes (Nihilus): 16065253888 Written bytes (Nihilus): 4196352 Bandwidth used per Inference Run - For Length: 128, For Model: llama-3.1-8B-FP16-MHA --------------------------------- Read bytes (llama.cpp): 18344952320 Written bytes (llama.cpp): 1948800000 Read bytes (Nihilus): 16077837568 Written bytes (Nihilus): 16779264 Bandwidth used per Inference Run - For Length: 512, For Model: llama-3.1-8B-FP16-MHA --------------------------------- Read bytes (llama.cpp): 26832226304 Written bytes (llama.cpp): 9404175360 Read bytes (Nihilus): 16128172288 Written bytes (Nihilus): 67110912 Bandwidth used per Inference Run - For Length: 2048, For Model: llama-3.1-8B-FP16-MHA --------------------------------- Read bytes (llama.cpp): 85318000640 Written bytes (llama.cpp): 63384867840 Read bytes (Nihilus): 16329511168 Written bytes (Nihilus): 268437504 Bandwidth used per Inference Run - For Length: 8192, For Model: llama-3.1-8B-FP16-MHA --------------------------------- Read bytes (llama.cpp): 711847952384 Written bytes (llama.cpp): 665854694400 Read bytes (Nihilus): 17134866688 Written bytes (Nihilus): 1073743872 Bandwidth used per Inference Run - For Length: 32768, For Model: llama-3.1-8B-FP16-MHA --------------------------------- Read bytes (llama.cpp): 9499357429760 Written bytes (llama.cpp): 9260486906880 Read bytes (Nihilus): 20356288768 Written bytes (Nihilus): 4294969344 Bandwidth used per Inference Run - For Length: 131072, For Model: llama-3.1-8B-FP16-MHA --------------------------------- Read bytes (llama.cpp): 145151630065664 Written bytes (llama.cpp): 142595062256640 Read bytes (Nihilus): 33241977088 Written bytes (Nihilus): 17179871232