LLMs running in the browser
This post explores various ways to run Language Models (LLMs) in the browser. I will:
- Review the available frameworks today, and choose the most compelling options
- Implement a working example for each framework
- Evaluate them both quantitatively (specifically, speed) and qualitatively (documentation, ease-of-use, etc.)
This is written spring of 2024 and things move quickly; consider this post a snapshot in time.
Why
Before we jump in, you might wonder why someone would want to run an LLM in the browser.
For me, there’s three reasons:
- Data privacy - keep your data local
- Latency - avoid round trips to the server (also can support offline mode)
- Cost - run models locally, save money. This applies both to the user (avoid paying for an API) and operators (run models on the users’ devices and avoid server inference costs)
One very large drawback is that LLMs tend to be huge, measuring from hundreds of MBs to GBs. If your users are on, say, a mobile data plan, a downloading an LLM won’t be a feasible solution. However, for desktop users with robust data connections, including electron apps and Node apps, browser-based LLMs remain a viable approach.
1. Available options
Available options for running LLMs in the browser I’ve found include:
- Transformers.js - a Hugging Face library built on top of ONNX runtime
- web-llm - put out by the MLC team, and running on top of Apache TVM Unity
- Candle - another Hugging Face library, this one in Rust and converting to WASM
- Burn - another Rust library compiling to WGPU (and Candle)
- Tensorflow.js - a Google library for doing machine learning in Javascript, with CPU, WebGL, WASM, and WebGPU backends
- ONNX - Microsoft library, supports WebGL, WebGPU, and WASM
I chose to evaluate Transformers.js, web-llm, and Candle, and discarded the rest for the following reasons:
- Burn leverages the Candle backend so an evaluation of Candle should cover both (and Candle offers specific LLM examples).
- ONNX is leveraged by Transformers.js, so an evaluation of Transformers.js should have implications for ONNX (and Transformers.js offers specific LLM examples)
- Tensorflow.js doesn’t seem to offer much in the way of cutting edge LLM support, though it’s been around for a long time and boasts great support. I’m keeping my eyes on TFJS in the hopes that cutting-edge LLMs get support soon.
2. Implementations
Transformers.js
Transformers.js is a Javascript library offered by Hugging Face, and is “designed to be functionally equivalent to Hugging Face’s transformers python library”. transformers is a key part of the Python ecosystem and often used for running LLMs, so feature parity is a huge plus. In addition to text, Transformers.js support vision, audio, and multimodal tasks, so it’s a great choice for supporting a wide variety of modalities.
Transformers.js offers an NPM package at @xenova/transformers. Models can be loaded and used with something like:
const generator = await pipeline('text-generation', 'Xenova/distilgpt2');
const text = 'I enjoy walking with my cute dog,';
const output = await generator(text);
Web-LLM
web-llm runs on top of TVM. An NPM package is available at @mlc-ai/web-llm. Models can be loaded with:
const chat = new webllm.ChatModule();
await chat.reload('Phi1.5-q0f16', undefined, {
"model_list": [
// Phi-1.5
{
"model_url": "https://huggingface.co/mlc-ai/phi-1_5-q0f16-MLC/resolve/main/",
"local_id": "Phi1.5-q0f16",
"model_lib_url": "https://raw.githubusercontent.com/mlc-ai/binary-mlc-llm-libs/main/phi-1_5/phi-1_5-q0f16-ctx2k-webgpu.wasm",
"vram_required_MB": 5818.09,
"low_resource_required": false,
"required_features": ["shader-f16"],
},
],
});
const reply = await chat.generate(prompt);
Candle
Candle is a Rust-based framework that compiles models into WASM. Like Transformers.js, this is a Hugging Face library, and boasts a larger mandate than just LLMs (namely, supporting training, any model, and a number of compilation targets).
Candle requires having Rust installed and does not offer an NPM package. Once you’ve got the repository cloned and installed, you can run:
sh build-lib.sh
to generate the WASM build outputs. Then a model can be loaded and run with:
import init, { Model } from "./build/m.js";
await init();
const model = {
base_url:
"https://huggingface.co/lmz/candle-quantized-phi/resolve/main/",
model: "model-q4k.gguf",
tokenizer: "tokenizer.json",
config: "phi-1_5.json",
quantized: true,
seq_len: 2048,
size: "800 MB",
}
const weightsURL =
model.model instanceof Array
? model.model.map((m) => model.base_url + m)
: model.base_url + model.model;
const tokenizerURL = model.base_url + model.tokenizer;
const configURL = model.base_url + model.config;
const [weightsArrayU8, tokenizerArrayU8, configArrayU8] =
await Promise.all([
// see demo for these implementations
concatenateArrayBuffers(([] as string[]).concat(weightsURL)),
fetchArrayBuffer(tokenizerURL),
fetchArrayBuffer(configURL),
]);
const model = new Model(
weightsArrayU8,
tokenizerArrayU8,
configArrayU8,
quantized
);
const firstToken = model.init_with_prompt(
prompt,
temp,
top_p,
repeatPenalty,
64,
BigInt(seed)
);
const seq_len = 2048;
let sentence = firstToken;
let maxTokens = maxSeqLen ? maxSeqLen : seq_len - prompt.length - 1;
let startTime = performance.now();
let tokensCount = 0;
while (tokensCount < maxTokens) {
await new Promise(async (resolve) => {
const token = await model.next_token();
if (token === "<|endoftext|>") {
return;
}
sentence += token;
});
tokensCount++;
}
3. Evaluations
Qualitative Evaluations
Transformers.js
Transformers.js is really fantastic. It’s simple to get up and running and the code is concise and readable. I’m a big fan of how easily models can be loaded directly from Hugging Face (though models do need to be converted before-hand for the library; Xenova maintains a number of LLMs).
On the cons side, I found some of the Typescript definitions to be out of date or missing, and the documentation (particularly around callbacks) to be lacking. The other big con is that a large number of models are broken in the browser (including Phi-2) due to an upstream issue with ONNX. The author has a PR fixing it here so hopefully this will be fixed soon.
ONNX, the backend on top of which Tranformers.js runs, supports both CPU and GPU processing:
With onnxruntime-web, you have the option to use webgl or webgpu for GPU processing, and WebAssembly (wasm, alias to cpu) for CPU processing. All ONNX operators are supported by WASM but only a subset are currently supported by WebGL and WebGPU.
Web-LLM
The output of web-llm are rock solid and its Typescript definitions are great.
However, I found the code to be overly prescriptive, almost as if specifically designed to support the demo. For example, the parameters for specifying a model configuration are strangely constructed. It was difficult to strip away the demo code to a minimal running example.
I believe web-llm only supports webgpu; it’s not clear to me whether this project supports CPU-only.
Candle
Candle is written in Rust. This may or may not be a dealbreaker. You’ll need a Rust-compatible build step, a way to check in WASM artifacts, and more.
What I find compelling about this compilation-based approach is the resulting size of the models; Burn (another Rust-written framework) writes:
The main differentiating factor of this example’s approach (compiling rust model into wasm) and other popular tools, such as TensorFlow.js, ONNX Runtime JS and TVM Web is the absence of runtime code. The rust compiler optimizes and includes only used burn routines. 1,509,747 bytes out of Wasm’s 1,866,491 byte file is the model’s parameters. The rest of 356,744 bytes contain all the code (including burn’s nn components, the data deserialization library, and math operations).
The code for loading a WASM model is considerably more verbose than the other two solutions on offer, and you must host and load the WASM files yourself.
Candle does not yet support WebGPU but it seems to be in active development. It also sounds like the Transformers.js maintainer will be adding support for Candle as a backend as soon as that lands, so it seems like these two projects will share compatibility, which makes sense since they’re both Hugging Face projects.
Quantitative Evaluations
For each set of measurements, I tried to target the same Phi 1.5 model, for a max of 128 tokens. I ran each evaluation three times and averaged the results. I ran on Chrome on a Macbook M3. I’m measuring model loading from cache (not over the network).
Transformers.js
I would love to evaluate the model Xenova/phi-1_5_dev; however, due to a bug that’s broken in the browser, so I won’t include generating time measurements here.
| Average loading time for a model from cache: | 10.8s |
Transformers.js boasts ~20k weekly downloads and 5.7k stars on Github.
Web LLM
I’m evaluating the model Phi1.5-q0f16.
| Average loading time for a model from cache: | 4.5s |
| Average generation time for a 128 token completion: | 11.7s |
Web-LLM has 207 weekly downloads and 8.5k stars on Github.
Candle
I’m evaluating the model Phi 1.5 q4k.
| Average loading time for a model from cache: | 1.5s |
| Average generation time for a 128 token completion: | 17.2s |
Candle has 11.9k stars on Github (no NPM package is available, as it’s Rust-based).
Conclusion
There’s no clear winner, which makes sense given how fast this space is moving.
Candle’s smaller model size is compelling, as reflected in how quickly it loads, but the lack of GPU support (for now) is a drawback. web-llm seems to offer a nice balance of speed and performance, but the lack of explicit CPU support is a drag. Transformers.js seems to have the most robust ecosystem of models and documentation, but has a number of open bugs that make it hard to fully evaluate; though the author’s intent to incorporate Candle may bring it the best of both worlds.