Best practices for building an AI serving engine

One of the most critical steps in any operational machine learning (ML) pipeline is artificial intelligence (AI) serving, a task usually performed by an AI serving engine. By Yiftach Schoolman, Redis Labs Co-founder and CTO.

Thursday, 22nd July 2021 Posted 4 years ago in by Phil Alsop

AI serving engines evaluate and interpret data in the knowledgebase, handle model deployment, and monitor performance. They represent a whole new world in which applications will be able to leverage AI technologies to improve operational efficiencies and solve significant business problems.

AI Serving Engine for Real Time: Best Practices

I have been working with Redis Labs customers to better understand their challenges in taking AI to production and how they need to architect their AI serving engines. To help, we’ve developed a list of best practices:

Fast end-to-end serving

If you are supporting real-time apps, you should ensure that adding AI functionality in your stack will have little to no effect on application performance.

No downtime

As every transaction potentially includes some AI processing, you need to maintain a consistent standard SLA, preferably at least five-nines (99.999%) for mission-critical applications, using proven mechanisms such as replication, data persistence, multi availability zone/rack, Active-Active geo- distribution, periodic backups, and auto-cluster recovery.

Scalability

Driven by user behavior, many applications are built to serve peak use cases, from Black Friday to the big game. You need the flexibility to scale-out or scale-in the AI serving engine based on your expected and current loads.

Support for multiple platforms

Your AI serving engine should be able to serve deep-learning models trained by state-of-the-art platforms like TensorFlow or PyTorch. In addition, machine-learning models like random-forest and linear-regression still provide good predictability for many use cases and should be supported by your AI serving engine.

Easy to deploy new models

Most companies want the option to frequently update their models according to market trends or to exploit new opportunities. Updating a model should be as transparent as possible and should not affect application performance.

Performance monitoring and retraining

Everyone wants to know how well the model they trained is executing and be able to tune it according to how well it performs in the real world. Make sure to require that the AI serving engine support A/B testing to compare the model against a default model. The system should also provide tools to rank the AI execution of your applications.

Deploy everywhere

In most cases it’s best to build and train in the cloud and be able to serve wherever you need to, for example: in a vendor’s cloud, across multiple clouds, on-premises, in hybrid clouds, or at the edge. The AI serving engine should be platform agnostic, based on open source technology, and have a well-known deployment model that can run on CPUs, state-of-the-art GPUs, high- engines, and even Raspberry Pi device.