March 13, 2024

Navigating the Generative AI Landscape with Auxiliary LLMs

At HTCD, our foray into generative AI began by harnessing the power of GPT-4 for quick and efficient Proof of Concept (POC) developments, a strategy common among forward-thinking startups. This initial step was crucial for validating our innovative ideas and showcasing the potential of our solutions. Yet, as we delved deeper, the challenges of relying solely on GPT-4 — such as cost, scalability, and reliability — became evident, leading us to explore the potential of Auxiliary Large Language Models (LLMs) to overcome these hurdles.

Challenges of a GPT-only Approach

We quickly recognized that sole dependence on GPT-4 was unsustainable. Despite its impressive capabilities, GPT-4’s limitations, such as a 30% error rate in delivering accurate responses and significant cost implications, prompted us to seek a more versatile and sustainable approach.

Embracing Auxiliary LLMs

Our exploration uncovered a vital insight: the strengths of various LLMs can be leveraged for specific tasks — and in permutations featuring primary and auxiliary models. We define Auxiliary LLMs as a collection of small language models (SLMs) — a component of LLM design systems that augments the capabilities of a stronger LLM.

By integrating Auxiliary LLMs, we were able to amplify GPT-4’s capabilities, assigning it to tasks that align with its strengths, while delegating others to more specialized models, such as Mistral, known for its exceptional reasoning abilities.

Optimizing with Leading Software Design Practices

We implemented leading practices in software design, assigning specific, well-defined roles to each model or function. This strategy not only boosted our solutions’ efficiency and efficacy but also significantly reduced our reliance on GPT-4, addressing cost and scalability challenges.

Let’s understand this using a sample case study where the problem statement is, “An organization has a collection of unstructured data that has financial information embedded in it in various forms like text, numbers, and even numbers in different currencies. The organization started using GPT-4 for the complete process, and soon it started facing hallucinations as well as incorrect data formats.”

High-level Architecture with and without Auxiliary LLMs

A detailed description of the architecture:

Previous Architecture

The initial setup featured a straightforward process:

  1. Input: Collection of documents
  2. Prompt: GPT-4 directed to perform three tasks: extract financial numbers from documents (in the form of text, numbers, or both), convert numbers and textual financial information into United States Dollars (USD), and use the USD figures to create a financial report for a given timestamp
  3. Model: GPT-4
  4. Output: GPT-4 generated financial report

This naïve approach was neither cost-effective nor reliable, and featured a higher latency.

Present Architecture

The revamped architecture introduces Auxiliary LLMs and specific functions to address challenges:

  1. Input: Collection of documents
  2. Auxiliary Model: Aux LLM1 extracts numbers from documents. This aligns with the model’s capabilities, ensuring higher accuracy and efficiency.
  3. Functions: A simple Python function converts extracted numbers into USD. This deterministic computational step ensures accuracy in currency conversion, a task not optimal for LLMs.
  4. Prompt: GPT-4 provided pre-extracted and converted numbers and required additional context
  5. Model: GPT-4 for report generation
  6. Output: GPT-4 generated financial report with increased reliability, due to preprocessed and context-enriched input

From PoC to Production — HTCD’s Insights

  1. The process becomes more reliable as each component is optimized for its specific task — auxiliary LLMs for extraction, Python functions for conversion and other reasoning actions where an LLM is not capable, and GPT-4 for synthesis — reducing the error rate and improving the overall accuracy of the output.
  2. This architecture follows the software design principle of separation of concerns, making it easier to maintain and update. Each module or function can be independently updated or replaced without affecting the others.
  3. If we want to use our own self-hosted LLM or a newer LLM in the future, a loosely coupled architecture means we can start replacing them one by one rather than changing the base LLM in one shot. The latter would increase the chance of failure and downtime a fair bit. But with auxiliary LLMs, we have the flexibility to change into smaller models first allowing us the bandwidth to conduct A/B testing with them. The biggest advantage it provides though is the quick integration of newly released LLMs.
  4. In the previous architecture if you change a single word in the prompt, you may not know how it may affect the output of your whole system and you need to conduct extensive testing for cases for which you haven’t even changed the prompt. The current architecture gives you the flexibility to change only the LLMs you need and doesn’t affect the prompting of the other LLMs. This gives the engineer the ability to write unit tests focused only on the function of each specific auxiliary LLM.

Why Mistral?

As a company, we’ve closely followed Mistral over the past few months and selected it for its impressive reasoning capabilities, complementarity to GPT-4, and cost-efficacy. Its integration into our production environment underscores our commitment to leveraging the best available technologies. Mistral’s subsequent introduction of Mistral-Large and partnership with Microsoft Azure’s AI Studio make for a bonus.

Cost Implications of Dual LLM Utilization

This has been a constant question since we chose to adopt the auxiliary model approach to designing our system. Will using an additional LLM increase our costs? The answer is no — it decreased our costs.

Prior to the use of auxiliary models, our GPT-4 consumption was approximately 13,500 tokens per transaction, i.e. input and output. GPT-4 Turbo costs $30 per million output tokens and $10 per million input tokens. Mistral-medium costs $8.1 per million output tokens and $2.7 per million input tokens. The figures below represent an average of our consumption patterns and may vary based on your use case:

Initial Cost (GPT-4)

Input: $0.08 (8,000 tokens)

Output: $0.165 (5,500 tokens)

Total cost: $0.245

Current cost (GPT-4 and Mistral Medium)

GPT-4

Input: $0.035 (3,500 tokens)

Output: $0.135 (4,500 tokens)

Auxiliary Models

Input: $0.0135 (5,000 tokens)

Output: $0.00405 (500 tokens)

Total cost: $0.1875

As evident, we obtain approximate savings of 25%, which is significant when serving thousands of customers and millions of LLM calls worldwide.

Conclusion

HTCD’s strategic incorporation of Auxiliary LLMs, guided by best software design practices, has successfully addressed the limitations of a single LLM system. Our architecture not only enhances solution reliability and efficiency but also showcases a cost-effective strategy for leveraging generative AI technologies. Our commitment extends beyond our own cost savings, aiming to reduce expenses for our customers.

Join our growing community! Follow us on LinkedIn, X, and Facebook for the latest news and updates from HTCD, and visit our website to find out how we can help you with your cloud security needs.

Subham Kundu

LinkedIn logo
Principal AI Engineer

Related Articles

Back to blog