When I first dipped my toes into the world of generative AI, I wasn’t entirely sure what to expect. I figured it would be like any other tech endeavour—learn the ropes, build the thing, and voila! But, I soon discovered, there’s a bit more to it, especially when you’re trying to bring a generative AI model into production.
When it comes to deploying generative AI in a production environment, one of the most critical decisions you’ll face is selecting the right Large Language Model (LLM) for your specific use case. This task isn’t as straightforward as picking the latest or most popular model—it requires a deep understanding of the nuances that each model brings to the table, as well as how those nuances align with the unique needs of your application.
I recently attended a one-day AWS conference in Auckland, which offered invaluable insights into Foundation Model Operations (FM Ops) and introduced the concept of an AI Gateway. This talk got me thinking more critically about the process of selecting an LLM and how we can make this process more robust and tailored to our needs.
The ever-changing field of generative AI and the need for stability
Generative AI is unlike many other fields in technology because of its rapid pace of innovation and change. Models and best practices that are cutting-edge today might be outdated in just a few months. This presents a unique challenge for organisations trying to deploy stable and reliable AI solutions.
Given this ever-changing landscape, it’s crucial to establish a layer of stability when bringing generative AI into production. This is where the concept of an AI Gateway becomes particularly valuable. An AI Gateway is essentially a design pattern—a series of software components that work together to create a unified, centralised interface. This interface allows your application to interact with various generative AI models without getting bogged down by the intricacies of each model. Think of it as an API Gateway for AI, providing a consistent endpoint for your applications while abstracting away the complexity of the underlying models.
Why you need an AI gateway
The AI Gateway offers several advantages that can significantly streamline your AI deployment process:
- Consistency across model iterations: As your underlying models evolve—whether through updates, fine-tuning, or even complete replacement—the AI Gateway ensures that your application continues to interact with them in a consistent manner. This decoupling of application logic from model specifics can save countless hours of redevelopment and testing.
- Enhanced security and privacy: By centralising access through the Gateway, you can enforce strict access controls, ensuring that only authorised users and applications can interact with the models. This is crucial in environments where data privacy and security are paramount.
- Comprehensive usage management: The AI Gateway allows you to implement quotas, rate limits, and other usage controls, ensuring that your AI resources are used efficiently and fairly. This can be particularly important in multi-tenant environments or when managing limited computational resources.
- Centralised prompt management: One of the unique challenges in generative AI is managing the prompts that drive model interactions. The AI Gateway can help by centralising prompt management, reducing the risk of unexpected model behaviour (like hallucinations) by controlling and testing the prompts that are allowed.
Decoupling model management from application logic
One of the most significant benefits of using an AI Gateway is the ability to decouple model management from the application logic. This separation allows for more modular development and testing processes. For example, you can test your models independently of your application, ensuring that each component functions as expected before integrating them. This modular approach mirrors traditional software engineering practices, such as unit testing, where individual components are tested in isolation to ensure their reliability.
New personas in the generative AI ecosystem
As generative AI continues to evolve, it’s bringing new personas into the development and deployment processes—personas that aren’t typically found in standard DevOps setups. Let me break it down for you.
In this generative AI ecosystem, we have the ML providers, fine-tuners, and consumers:
- ML providers: These are the folks deep in the trenches, building and training foundation models on massive datasets. They’re the ones who live and breathe machine learning and natural language processing.
- Fine-tuners: These people take those foundation models and adapt them to specific use cases, fine-tuning the model to make it more relevant and effective for particular applications.
- Consumers: These are the users or applications interacting with the Gen AI platform. They don’t need to be ML experts, but a little knowledge of prompt engineering can go a long way in optimising their interactions with the model.
Understanding these roles is crucial when deploying generative AI because it highlights the complexity and specialisation required to bring these models into production.
AWS’s proposed AI Gateway implementation
During the AWS conference, a proposed implementation of an AI Gateway was discussed. Although still in its conceptual stages, this implementation provides a useful framework for understanding how an AI Gateway might be constructed.
AWS’s approach divides the Gateway into two main sections: the Gateway itself and a model abstraction layer. The model abstraction layer serves as an intermediary between model consumers (applications or users) and model managers (those responsible for testing, training, and deploying models).
In this setup, model managers define specific models, set policies, and manage access controls through a centralised database. Model consumers then interact with these models via the Gateway, which handles authentication, caching, and other essential services such as monitoring and loggin
Source: https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2023/09/18/ML-14882-Gen-AI-Gateway-Gen-AI-Gateway.png
This architecture not only simplifies model management but also provides a clear pathway for scaling AI deployments across large organisations. By abstracting away the complexities of model interactions, the AI Gateway allows different teams—whether they are data scientists, developers, or business analysts—to focus on their respective roles without getting bogged down by the intricacies of AI model management.
FMOps: Choosing the right foundation model
With over 15,000 foundation models available today, selecting the right one can feel like searching for a needle in a haystack. However, making the right choice is crucial for the success of your generative AI application. Here’s what I learnt about how to guide this process.
Step 1: Build a problem catalogue
Before diving into the models themselves, the first step is to clearly define the problem you’re trying to solve. This involves creating a problem catalogue—a list of key questions or tasks that your model needs to handle. For instance, if your application is designed to provide customer support, your problem catalogue might include questions like “How do I reset my password?” or “What are the shipping options available?”
The goal here is to outline the specific outcomes you expect from the model. This catalogue will serve as the benchmark against which you’ll evaluate different models.
Source: https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2023/08/23/ML_14962_img_012_v2-1024×373.png
Step 2: Test your problem catalogue against various models
Once your problem catalogue is ready, the next step is to test it against different foundation models. There are two primary methods you can use:
- Human in the loop: In this method, a human evaluator reviews the model’s responses to the questions in your problem catalogue. While this approach can be resource-intensive and time-consuming, it provides a nuanced understanding of the model’s capabilities and limitations.
- Automated scoring: Alternatively, you can use another foundation model to score the responses. This approach is faster and can handle large-scale evaluations, but it may lack the depth of insight provided by human evaluators.
Regardless of the method you choose, the outcome should be a detailed score sheet that ranks the models based on how well they meet your criteria.
Source: https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2023/08/23/ML_14962_img_014_v2-1024×448.png
Step 3: Evaluate the score sheet based on your priorities
With the score sheet in hand, it’s time to evaluate the models based on your specific priorities. These priorities typically revolve around three key factors:
- Cost: How much are you willing to spend on model inference? Some models, particularly those that are highly specialised or proprietary, can be expensive to run.
- Speed: How fast does the model need to be? In some applications, response time is critical, while in others, you might be able to tolerate longer processing times.
- Accuracy: How precise do the model’s responses need to be? In applications where incorrect answers can have serious consequences (such as in healthcare), accuracy will likely be your top priority.
Source: https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2023/08/17/ML_14962_img_015-1024×481.png
For example, when a team that prioritises cost above all else. They are willing to sacrifice some speed and accuracy to stay within budget. After running their problem catalogue through several models, they settle on one that offers significant cost savings while maintaining acceptable levels of accuracy and speed.
The role of testing in production
Even after selecting the right model, the journey isn’t over. Deploying a generative AI model in production requires a robust testing framework to ensure it continues to perform as expected. During the AWS conference, a speaker emphasised the importance of combining an AI Gateway with a prompt library as a first layer of testing.
By regularly running predefined prompts through the model and evaluating the responses, you can monitor the model’s performance and catch any deviations before they impact your users. This ongoing testing process is crucial for maintaining the quality and reliability of your generative AI application.
I also had a side conversation with a CEO who’s using generative AI to analyse open-ended survey responses. He shared a valuable insight: Keep each interaction with the model as simple as possible. This approach is akin to the single responsibility principle in software engineering, where each function or component should do one thing well. By breaking down complex tasks into smaller, more manageable interactions, you can reduce the risk of unexpected results and maintain better control over the model’s behaviour.
Bringing it all together: Your path to the right LLM
Choosing the right LLM for your use case isn’t just about following trends or picking the most powerful model available. It’s about understanding your specific needs, building a solid problem catalogue, rigorously testing potential models, and continually monitoring their performance in production.
By leveraging tools like an AI Gateway and implementing a comprehensive testing framework, you can ensure that your generative AI application not only meets your current needs but is also adaptable enough to handle future challenges.