But with opportunity, comes challenge. AI inference, the process of making predictions based on trained machine learning algorithms, requires high processing performance with tight power budgets, regardless of deployment location – cloud, edge or endpoint. It’s generally accepted that CPUs alone are not keeping up and some form of compute acceleration is needed to more efficiently process AI inference workloads.
At the same time, AI algorithms are evolving rapidly, faster than the speed of traditional silicon development cycles. Fixed-silicon chips like ASIC implementations of AI networks risk becoming quickly obsolete due to the rapid innovation in state-of-the-art AI models.
Whole Application Acceleration
There is a third, less well-known challenge. This centers around the fact that AI Inference does not get deployed in isolation. Real AI deployments typically require non-AI processing, both before and after the AI function. For example, an image may need to be decompressed and scaled to fit the AI model’s data input requirements. These traditional processing functions must operate at the same throughput as the AI function, again with high performance and low power. Like with the AI inference implementation, the non-AI pre and post processing functions are beginning to need some form of acceleration.
To build a real application, the whole application needs to be implemented efficiently. In a data center application, the application may have thousands or even millions of parallel instances. Every fraction of a Watt that can be saved per instance will make a huge difference to overall power consumption
A solution is viable only if the whole application meets both the performance goal, through acceleration, and the power requirements, through greater efficiency. So how do we viably implement a whole application acceleration?
There are three key elements: the ability to build a custom data path; use of a single-device implementation; and the ability to take advantage the latest AI models as they continue to rapidly evolve and improve. Let’s take a look at all three elements.
The ability to build a custom data path
Most forms of AI inference operate on streaming data. Often the data is in-motion, such as part of a video feed, medical images being processed, or network traffic being analyzed. Even when data is stored on disk, it’s read off disk and streamed through the “AI application”. A custom data path provides the most efficient method for processing such data streams. A custom data path frees the application from the limitations of a traditional Von-Neuman CPU architecture, where data is read from memory in small chunks, operated upon and written back to memory. Instead a custom data path passes data from one processing engine to the next, with low latency and the right level of performance. Too little processing performance would not meet the application’s requirements. Too much performance would be inefficient – wasting power or physical space with capability that’s sitting idle. A custom data path provides the perfect balance – rightsizing the implementation for the application.
Single device implementation
Some solutions are good at AI inference, but not whole application processing. Fixed-architecture devices such as a GPUs generally fall into this category. GPUs can often be capable of high Tera-operations per-second (TOPs) numbers, a common performance metric, but AI inference performance typically needs to be matched with pre and post processing performance. If the non-AI components cannot be efficiently implemented on the same GPU, a multi-device solution is needed. This wastes power by requiring data to be sent between devices, which is very inefficient and costly in terms of power consumption. A single device that can efficiently implement the whole application has a significant advantage in real-world AI inference deployments.
Adapt and evolve with the latest AI models
The pace of innovation in AI is staggering. What’s considered the state of the art today could easily be rendered nearly obsolete six months from now. Applications that use older models risk being uncompetitive, so the ability to rapidly implement the latest models is critical.
So what technology allows dynamic updates of the AI models while providing the ability to build a custom data path to accelerate both AI and non-AI processing in a single device? The answer is adaptive computing platforms.
Adaptive Computing Platforms
Adaptive computing platforms are built on hardware that can be dynamically reconfigured after manufacturing. This includes longstanding technologies such as FPGAs, as well as more recent innovations such as Xilinx’s AI Engine. A single-device platform such as Xilinx’s Versal™ Adaptive Compute Acceleration Platform can accelerate both the AI and non-AI processing functions, by allowing custom data paths to be built. They are also capable of implementing the latest AI models quickly and efficiency because the hardware can be quickly reconfigured. Adaptive computing devices provide the best of both worlds. They offer the efficiency benefits of custom ASICs without the lengthy and expensive design cycles.
Xilinx Versal AI Core Series VC1902
The best implementation of an AI application doesn’t need to be the fastest, it needs to be the most efficient, yet remain flexible. It must be right-sized, delivering the performance that’s needed, nothing more and nothing less.
Summary
As AI inference becomes more pervasive, the challenge is not just how to deploy the AI model, but how to most efficiently deploy the whole AI application. When applications are replicated thousands or even millions of times, a small energy saving in each instance could save an entire power station worth of energy. When you multiply that by the myriad of new AI applications under development, the effects will be dramatic. There should be no doubt that efficient acceleration of whole AI applications should be a goal for all in the technology industry and adaptive computing platforms provide a competitive solution.