All of us have been inundated with LinkedIn posts and news articles about the power of Large Language Models (LLMs) such as ChatGPT, BARD, or LLaMA. LLMs are an immensely useful type of AI model that extracts information from large volumes of text and can reproduce that information in language that mimics human intelligence.
They are useful for chatbots, virtual assistants, language translation, text summarization, code generation, sentiment analysis, and more. It is immediately apparent that such tools are immensely powerful but it isn’t always clear how to harness this innovative technology within our organizations.
Here are 5 steps to Build an LLM Product:
A good use case fits the needs of the company. It should be a healthy balance of the following factors:
Measurable Value: Make the benefit of the use case quantifiable so it is easier for others to appreciate the magnitude of the project’s contribution and help assuage the voices of concern around AI
Feasibility: Sufficient historical data for supervised training. Resources and infrastructure to fine-tune, deploy, and monitor an LLM. And domain experts to provide human feedback for reinforcement learning (RLHF), and achievable focused scope
Risk: Though entirely possible, an optimal initial use case does not involve sensitive data such as PII, PCI, PHI, financial information, etc. and the model deployment can be scaled incrementally
Support: Executive stakeholders provide necessary resources, credibility, and strategic alignment with broader initiatives, and have the authority to resolve challenges as they arise.
Example Use Case:
I work for a retailer with hundreds of stores nationwide. We pay US$ 3m annually for customer service agents to assist customers on the phone. Internal research has shown that 67% of questions have simple answers that are already on the company website and 50% of customers would prefer to use a chat function over waiting for a representative on the phone.
There is currently a chat function, and we have saved the logs for the past 5 years, but it has a few flaws in the implementation and was not broadly adopted. I pitched this LLM-based chatbot to the CDAO and she agreed it was a fantastic idea and offered a few customer service experts to help with development.
Let’s break this down:
Measurable Value: An LLM solution would reduce the need for customer service agents leading to US$1m savings annually (US$3m x 67% x 50% =US$1m)
Feasibility: There are 5 years of (imperfect) chat history to begin training the LLM and multiple domain experts (customer service agents) to provide RLHF
Risk: The LLM uses no sensitive data and can be deployed incrementally
Support: The CDAO agrees that the solution is aligned with the overall goals of the organization and supports the initiative
The POC proves that an LLM can functionally meet the acceptance criteria for the use case on a focused scope. This step should be implemented as quickly as is practical with a focus on demonstrating the technical capability of LLMs. It should not be an initial attempt at building the product infrastructure.
Currently, the fastest way for most companies to build a POC is by using prompt engineering on GPT-4 or a similarly high-quality publicly available LLM such as BARD. As LLMs rapidly become commonplace it is worth checking if any LLMs have been fine-tuned for your domain.
Third-party LLMs are prohibitively expensive to use at scale, so, many organizations will elect to fine-tune a model. If the product only needs to be used sparingly and has relaxed latency requirements, or your organization’s data science skills are still developing, it may make more sense to build the product around one of these 3rd party LLMs. Fine-tuned models also come with their own costs from training and maintainence.
Example Use Case
We have defined the scope for the POC to only include customer questions about the order total (the total cost to the customer for the order). Order total is challenging to calculate because we offer numerous discounts and sales that impact the price (e.g., holiday sales, flash sales, bundle discounts, etc.).
The acceptance criterion is communicating the order total to the customer with 99% accuracy. In reality, this POC would likely need a retrieval system to pull the relevant product description pages as sharing all information for all products will be too large a context for a single transmission to the LLM.
Keep in mind that it is perfectly fine that there are more efficient and elegant ways to solve the order total problem and the point of the POC is to demonstrate an LLM’s capability before putting in significant effort.
This is the step where the team may choose to fine-tune an open-source model.
A Minimum Viable Product (MVP) contains the fewest essential features needed to meet acceptance criteria for the product. A Minimum Lovable Product (MLP) goes a step further and incorporates enough design elements to fairly gauge user adoption.
Building the MLP instead of the MVP ensures that the product being built is what customers want. If there is low engagement, the team will research which user needs are not being met. They may ask themselves the following questions:
Do customers know about and have access to the MLP?
Does the scope of the MLP need to be expanded to address customer needs?
Is there a design flaw in the UX?
Agree to success criteria before deploying the MLP. It is tempting to move the goalposts post-deployment because the team wants the product to succeed. If the success criteria are not met after a second major iteration of the MLP, the team should seriously consider whether it is still the best use of time to continue iterating. It is often not.
If the success criteria for the MLP are met it is time to mature the MLP into a robust product.
Example Use Case
The MLP has an easy-to-use Graphical user interface (GUI) and pleasing UX. We are confident this will represent the product well. Our company has long wait times so we informed our customers when they were on hold, waiting to speak with a customer service agent so that order total questions can be answered more quickly on the website.
Our success criterion was a 30-second reduction in average call time over the following month and we met our success criterion with a 60-second reduction in average call time.
The preceding steps prove that LLMs deliver measurable value for their use case within a narrow scope. Now it is time to broaden that scope within the use case and iteratively refine the MLP to derive additional value.
The MLP was designed to be just enough to know whether a customer will love the product. Therefore, there are technical areas as well that need to be matured as well. These might include increasing scalability, load testing, performance tuning, augmenting design practices, implementing redundancies, etc.
Additionally, it may be wise to create and implement a Responsible AI Framework to ensure that AI products are developed, deployed, and used ethically.
Example Use Case
The MLP successfully answered questions regarding order total but that is only one of many types of questions customers ask. The next iterations of the MLP extended to answer questions that compared different items, found goods that satisfied a specific customer need, and more. Once the infrastructure, GUI, and tool for RLHF were created, iteratively maturing the MLP into a robust product was a significantly easier process.
It is a fantastic success that the robust product was deployed and is finally beginning to deliver the value anticipated at the beginning of the project. This accomplishment should be celebrated and shared with senior leadership but that is not the end of the project.
Just as machine learning models require monitoring and maintenance, LLMs need to be monitored and maintained too.
The following is a sample of the needs to be considered for monitoring and maintaining LLMs:
Monitor Inappropriate Content: All forms of discrimination, foul language, hate speech, sexual language, insults, violence, etc.
Monitor Semantic Similarity: LLM responses should provide similar results over time
Monitor Bias: Validate that gender, age, sexual orientation, etc. does not impact model output unless contextually relevant
Monitor Response Time: As more users use the product model, response time may increase
Maintain Latest Information: The LLM will need to be updated as the data it was trained on becomes outdated
Example Use Case
The deployed chatbot has been a huge success and reduced the number of required customer service agents which will lead to the anticipated US$1m in annual savings. The CDAO was elated and sent out an email to announce the accomplishment.
The preceding list of what to monitor reflects what we implemented but the details of maintenance could be expanded. Pricing changes and discounts/sales are implemented on the first day of every month but those changes are known two months in advance.
To be prepared we developed an auto-training system that initiates 2 months before it is needed. These newly trained models, that incorporate upcoming sales/discounts, are then moved into production as soon as it is time. We have found that the 2-month window provides enough time to re-train the model and for the subsequent automated testing.
About the author:
Dr. Zachary Elewitz is the Director of Data Science at Wex. He leads data science and AI initiatives that enhance business processes and provide white-glove services to customers. He concurrently serves as the Fractional-CDAO for Mystery on Main Street.
Elewitz holds two AI-related patents, sits on Texas A&M Commerce’s Venture College Board, and serves on several AI-related advisory boards including the National Institute of Standards and Technology Generative AI Public Working Group.