Search Public

How we sped up transformer inference 100x for 🤗 API customers

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Tokenization is often a bottleneck for efficiency during inference. We use the most efficient methods from the 🤗 Tokenizers library, leveraging the Rust implementation of the model tokenizer in combination with smart caching to get up to 10x speedup for the overall latency.

Once the compute platform has been selected for the use case, we can go to work. Here are some CPU-specific techniques that can be applied with a static graph: Optimizing the graph (Removing unused flow) Fusing layers (with specific CPU instructions) Quantizing the operations

#llm #mlops

·huggingface.co·May 21, 2023

How we sped up transformer inference 100x for 🤗 API customers