Taalas serves Llama 3.1 8B at 17,000 tokens/second
Taalas serves Llama 3.1 8B at 17,000 tokens/second This new Canadian hardware startup just announced their first product – a custom hardware implementation of the Llama 3.1 8B model (from July 2024) that can run at a staggering 17,000 tokens/second. I was going to include a video of their demo but it’s so fast it would look more like a screenshot. You can try it out at chatjimmy.ai. They describe their Silicon Llama as “aggressively quantized, combining 3-bit […]