Run 3 LLMs on 8 GB GPU with C++ Multiplexing

TL;DR: An engineer managed to run three different LLMs on an 8 GB GPU using layer multiplexing in C++ and admission control, overcoming VRAM limitations. This enables multi-agent systems on affordable hardware.

What happened?

A developer published a methodology on Towards Data Science to run three large language models (LLMs) simultaneously on a GPU with only 8 GB of VRAM. The technique, called layer multiplexing, is implemented at the C++ level and uses an admission control system that manages available memory to prevent overflows. The experiment used three agents, each with a different LLM, achieving parallel inference without exceeding the GPU's limits. Although the original article does not specify the exact models, it is assumed they are quantized versions of models like Llama 2 7B, Mistral 7B, or similar, given that each requires less than 4 GB of VRAM in 4-bit format. This approach is novel because traditionally, simultaneous inference of multiple LLMs required partitioning VRAM or using multiple GPUs, which is not feasible on consumer hardware.

Why is it important?

Most LLMs require GPUs with at least 16 GB or 24 GB of VRAM to function properly, limiting their use to expensive hardware like the RTX 3090/4090 or professional GPUs. According to Steam data, the most common GPU among gamers is the RTX 3060 with 12 GB, and a significant portion still uses GPUs with 8 GB or less. This demonstration proves that it is possible to run multiple models on low-end GPUs (like an RTX 3080 with 8 GB), democratizing access to generative AI. Additionally, it enables building multi-agent systems with different specialized models (e.g., one for reasoning, one for code, and one for creativity) without expensive infrastructure. This could accelerate LLM adoption in startups, independent developers, and educational settings. Historically, LLM inference has been dominated by large companies with massive computing resources; techniques like this level the playing field.

How does the technique work?

Layer multiplexing involves loading only the active layers of each model into VRAM, swapping them on demand. Instead of keeping all three models fully in memory, the system loads only the layers needed for the current inference. Admission control prioritizes requests and prevents memory saturation, similar to how an operating system manages virtual memory. Everything is implemented in C++ to minimize overhead, leveraging low-level efficiency and direct CUDA access. Although specific models are not mentioned, it is assumed they are quantized versions (e.g., 4-bit) that reduce weight size. Layer swapping introduces additional latency, but the author reports it is manageable for non-real-time applications. Compared to previous approaches like CPU offloading or sequential execution, this technique enables true parallelism without duplicating resources.

Market implications

This technique could accelerate LLM adoption in startups and independent developers who cannot afford high-end GPUs. It also opens the door to applications in edge computing and resource-constrained devices, such as laptops with integrated GPUs or embedded systems. However, performance may be affected by constant layer swapping, especially if models are large or request frequency is high. It is not suitable for real-time applications like interactive chatbots, but it works for batch processing or asynchronous tasks. In the market, this could pressure hardware providers to offer more VRAM at affordable prices, or encourage the development of even more aggressive compression techniques. Companies like NVIDIA might see reduced demand for high-end GPUs for inference, though training will still require powerful hardware. Startups like Groq or Cerebras, which bet on specialized hardware, could face competition from software solutions like this.

What readers should know

The technique is experimental and not generalized; it requires advanced knowledge of C++ and CUDA to implement. There is no ready-to-use library yet.
Performance depends on model size and layer swap frequency. For 7B models in 4-bit, an additional latency of a few milliseconds per swap is expected.
Tools may emerge to automate this process, such as optimized inference libraries (e.g., llama.cpp or vLLM) incorporating layer multiplexing.
The technique is most effective when models have similar architectures, as layer swapping is more efficient.
For critical applications, testing with real workloads is recommended to measure impact on latency and throughput.

“Running multiple LLMs on an 8 GB GPU is a technical milestone that challenges the barriers to entry in AI. This methodology could be the first step toward more accessible and decentralized AI infrastructure.”

In summary, while the presented technique is promising, it is still in an early stage. Interested developers will need patience and technical skills to adopt it. However, its potential impact on democratizing AI is significant, and we are likely to see more innovations in this direction in the coming months.

Run 3 LLMs on an 8 GB GPU with C++ Multiplexing

What happened?

Why is it important?

How does the technique work?

Market implications

What readers should know

Keep reading