How the pioneering process of model quantisation can supercharge edge AI

By Rahul Pradhan, VP of Product and Strategy, Couchbase.

Thursday, 18th January 2024 Posted 2 years ago in by Phil Alsop

The convergence of Artificial Intelligence (AI) and edge computing is promising transformative change across industries. From wearable health devices that collect patient data, to smart sensors that monitor inventory on retailers’ shelves – the possibilities are boundless. If its transformative potential wasn’t obvious enough, IDC forecasts edge computing spending to reach $317 billion in 2026.

But why edge AI? Well, the purpose of edge AI is to bring data processing and models closer to where data is generated, such as on a remote server, tablet, IoT device, or smartphone. According to Gartner, more than half of all data analysis by deep neural networks will happen at the edge by 2025. This paradigm shift introduces multiple advantages. One is reduced latency: by processing data directly on the device, edge AI reduces the amount of transmitting data back and forth to the cloud. This is critical for applications dependent on real-time data, that require rapid responses.

Costs are also reduced. Processing data locally at the edge eliminates expensive data transfer back to a central server. A further benefit is increased scalability: edge AI’s decentralised nature makes it easier to scale applications without relying on a central server for processing power.

With edge AI’s potential benefits clear, the question becomes: how can this potential be fully realised?

The power of model quantisation

A good place to start is by acknowledging the growing complexity of AI models. For edge AI to be effective, AI models need to be optimised for performance without compromising accuracy. Yet AI models are becoming larger and more complex, making them harder to handle. In turn, this creates challenges for deploying AI models at the edge, where edge devices often have limited resources, and are constrained in their ability to support such models.

Model quantisation is a process that allows AI models to be deployed on edge devices with exactly these sorts of resource constraints, such as mobile phones, smart sensors, and other types of embedded systems.

It does this by reducing the memory and computation needed to run the AI model, by decreasing the precision of the model’s parameters and activations. Because fewer bits are required to represent the same data, it makes the AI model more flexible, albeit less precise.

Quantisation techniques to consider

Three fine-tuning techniques have emerged as transformative elements for model quantisation. The focus of these techniques is to make the deployment and fine-tuning of large language models (LLMs) more efficient and accessible.

Firstly, GPTQ, or GPT with Quantisation, is a technique that is used to reduce the precision of the weights and activations of a neural network. By reducing the size of the model and its computational requirements, without an impact on performance, GPTQ can deliver efficiency benefits. This allows for faster inference times, as the deep learning model can make quicker predictions on a more manageable model size, as well as lower memory usage.

GPTQ is, therefore, best suited to deploying LLMs in resource-limited environments, such as on edge devices.

Another method is Low-Rank Adoption (LoRA). It’s a radio communication technique to adapt large, pre-trained models like GPT by modifying only a small number of additional parameters, while the remaining model weights remain fixed. LoRA is useful for fine-tuning large models when there are constraints on training resources or when the stability of the original model must be maintained. It's particularly effective in scenarios where the model is frequently updated based on new data.

The third technique, QLoRa, or quantised LoRA, is a finetuning approach to reduce memory usage. QLoRA can be used to finetune more than 1000 models and introduce multiple innovations to achieve efficiency without sacrificing performance. This includes 4-bit NormalFloat (NF4), a new data type that is information-theoretically optimal for normally distributed weights, double quantisation to reduce the average memory footprint by quantising the constants, and paged optimisers to manage memory spikes.

Selecting from these methods depends heavily on the project's unique requirements – whether it is at the fine-tuning stage or deployment, and whether it has the computational resources at its disposal.

Using these quantisation techniques, developers can effectively bring AI to the edge, creating a balance between performance and efficiency.

But model quantisation is not the only process that can enhance edge AI. There is also the importance of managing, distributing, and processing data to consider.

For edge AI to thrive, a persistent data layer is essential for local and cloud-based management, distribution, and processing of data. With the emergence of multimodal AI models, a unified platform capable of handling various data types is becoming vital for meeting edge computing’s operational demands.

Fusion is the future

As we move towards intelligent edge devices, the fusion of AI, edge computing, and edge database management will be central to heralding an era of fast, real-time, and secure AI solutions. Looking ahead, organisations can focus on implementing sophisticated edge strategies for efficiently and securely managing AI workloads, and streamlining the use of data within their business.