Navigating the Real-World Challenges of Building a Retrieval Augmented Generative Chatbot

Yishai Rasowsky
9 min readApr 17, 2024

--

Lessons Learned from Deploying a Production-Ready Question Answering System

I recently successfuly completed a year long assignment developing a chat bot to answer questions based on the knowledge base and documentation of my client who is one of the leading finance companies and the market.

Architecture

The basic design was the following. All of the documentation of the company was converted into a vector database. When a user submits a query, that query is converted into a vector. Subsequently, the most relevant or semantically similar portions of content from the database are retrieved. Finally, a chat response is formed that will best answer the question that was posed, with the proviso that the answer should be based on the document sections that were matched as being most similar.

Live and learn

Having designed and developed this bot from the ground up all the to production, I am satisfied looking back at how many important lessons I learned. Join me, as I share and explain many of the tips and tricks — and pitfalls! — that I encountered along the way. Together we can learn how to even more efficiently implement superior apps in the future.

Practical Challenges in Deploying RAG-QA Systems

Photo by Everyday basics on Unsplash

Importance of Robust Data Integration

There were many challenges that I faced integrating diverse data sources. The various forms of documentation included SQL databases, Google documents, and raw HTML web content. Part of my job was to organize and package all that into a cohesive knowledge base for the RAG-QA system which I was to build.

Keeping up with the times

Aside from the routine steps you would expect at the beginning of any standard NLP pipeline (data collection, text cleaning, preprocessing such as tokenization), one difficulty I encountered regarding my client’s web content data was that it needed to be sorted.

Discriminating outdated data

The reason for that is because much of it was outdated, and would therefore not be appropriate to form the basis for question answering. This required teamwork and collaboration between myself and domain experts on staff. Together we would judge which content was invalid, and then selectively incorporate only the legitimate sources to be fed into the bot.

Data format and organization

As anybody who works in practical natural language processing knows, the data quality, its normalization, and curation all require patience and diligence because they are necessary steps to ensure that the retrieval component provides relevant and accurate information.

A key method of data structuring was that instead of using the original format, which was Google documents, I instead converted the articles and FAQ content into Google sheet format. This made it much easier for the bot to digest the knowledge and also interpret which exact sub-portions of the data were relevant to answering the user’s query.

Model Tuning and Performance Optimization

One valuable insight I gleaned from the iterative process of experimenting with trial and error with a range of large language models was that I did not necessarily need to engage in fine-tuning in order to achieve the desired response quality and latency. Oftentimes one shot or few shot prompting was enough.

Juggling considerations

In the course of the project, I had to balance model complexity, inference speed, and resource constraints for a production-ready system. For example, at first, I used LangChain for the whole stack. That means, both knowledge retrieval and QA-inference. Then I encountered serious delays in response time. I found that the bottleneck was in the actual inference. This was based on the prompt template submitted to Open AI inside the LangChain infrastructure.

More than one way to skin a LLM

Consequently, I opted to use LangChain only for the initial portion of that retrieval of relevant sources from the knowledge base. I also was pleased that designed my own homemade system for more efficiently incorporating the appropriate prompt into the API call to Open AI, which cut the response time down by a factor of 4. Speaking of runtime, ultimately the response in the final version takes 5–10 seconds. But I believe that there is still ability to improve that.

Fortuitous errors: Unexpected discoveries of erroneous information in our documents

I encountered an unexpected and very valuable bonus when constructing this bot. That is, it helped us to reveal erroneous information in the knowledge base of my client.

For example, I would ask a question to the bot, and the reply was based on a particular section of a web page or blog post. Later the bot’s response was checked by domain experts to see whether the information was reliable and the citation was verified. Several times, response was logically coherent but it was factually incorrect!

The for this reason, as we excitedly discovered in real time, was that there were mistakes in the actual blog post or web page. These errors in our articles and FAQ documentation needed to be corrected — and quickly — in order to ensure reliable communication and responses from our bot in the future.

What a great asset this was to have a way of rooting out weeds and sifting erroneous facts and claims from our knowledge base, all by means of this convenient chatbot to which you can simply conduct a conversation with by asking questions.

Ensuring Transparency and User Trust

One should not be naive and believe that a chat bot will necessarily give you accurate answers. Even if the AI is supposed to do have been trained on your knowledge base, still it is unwise to rely on this sight unseen.

To the contrary, it is vital — especially in the initial stages, and I would argue this is true throughout the life cycle of one’s bot interface — you must focus on making the RAG-QA system transparent. This is done by manually and programmatically connstructing and instructing your bot to provide clear citations exclamation of the information conveyed and used in responses. I consider this perhaps the most significant lesson of the whole project.

This transparency helped build user trust and confidence in the system, especially for mission-critical applications. For example, the bot may gave an answer, and the reply might appear to be logical and coherent. On the other hand, quite the opposite may be true. The bot’s reply is indecipherable, or you cannot figure out where the bot got its information from.

Out with black box, in with transparency

The reason for this is because the inference mechanism is just a black box as far as you're concerned . Well, then, you really don't yet have a good handle on how and why the response is being provided based on the content in the knowledge base of your specific domain.

For this reason, I implemented a citation and explanation mechanism inside the bot so that whenever a response was issued I would return as part of my response for the API that I set up would include the exact document and the exact portion of it and the exact rationale why and how the response was based on that section.

Deployment and Scalability Considerations

As one might expect, there are several technical challenges that one faces in deploying the RAG-QA system at scale. These include: managing infrastructure, handling high traffic, and maintaining reliability.

Part of the infrastructure aspect was that each user, who naturally might be operating a different machine, needed to be identified and segregated. The reason being so that his conversations and chat history should not overlap or interfere with those of any other user. Toward this end, I introduced a mechanism in the bot to record logs of chat conversations in separate files, depending on the identity of the user.

By the way, I personally used the IP address to distinguish users, but this does have a potential weakness, because at times there are users who are logging in from separate machines, but still they do use the same IP address. So that is a consideration you all want to take into account; you must adopt the most appropriate criteria to delineate between separate users as fits your case.

In terms of handling traffic, I had to determine what the basic cost would be of so many people using and running inferences regularly on the API. Fortunately, it was not expensive per user; but still this is a budget consideration that has to be carefully taken into account.

An app you can count on

Concerning reliability for a scalable project like this, probably the most important aspect for me was the following. In case of failure or errors, I made use of a PowerShell script that would make sure that the Python modules and scripts would refresh and restart at any time there would be a failure. This was vital to making sure that the app was not dysfunctional or failed to run at any given time, during the day or night in all time zones, that users wanted to experiment chatting with the bot.

I believe it is worth mentioning, though it is obvious, that in order to provide stability for the development environment, I used several branches in git. Primarily I worked with the development branch, which would be for my experimentation and testing. From this I would create new feature branches whenever I wanted to introduce some new aspect or code modification. And for user acceptance testing, the most stable branch “UAT” was reserved for other parties to experiment with and perform QA tests.

I am pleased with how I addressed these challenges, because they all were vital to ensure the chatbot could handle real-world usage patterns and requirements.

Iterative Development and Testing

I want to highlight the importance of continuous testing, user feedback, and iterative improvements. I made use of each of these methods to enhance the RAG-QA system based on real-world usage.

For example, at times, I would introduce a new feature, or an improvement to the code base, or an enhancement to the data processing pipeline. When this stage occurred, I would run a Python module — of my own design and creation — to automatically answer a list of several dozen standard questions. From the quality and coherence of these responses, I could see whether the bot was advancing or regressing, or for example if there was some other runtime error that I did not want to escape my intention. Running this large batch of test questions was a very efficient way of checking the symptoms for any underlying bugs that had escaped my attention initially.

Everyone can use a little feeedback

Also critical to the success of this project was the fact that I had user feedback from people. Many of them were domain experts and could readily tell me whether the bot was understanding the content of the documentation. The goal being that the responses should reflect the accurate and factual and reliable authority of the content in the knowledge base, on which the bot was supposed to be relying.

Several times I was told that something was being misinterpreted so either the data had to be changed (as I mentioned above) or I had to modify how the bot was interpreting and processing that data and formulating a helpful response.

Let me share a good example of this: Sometimes the bot would confidently answer a question even if it could not find any information in our knowledge base. But it would just answer the question based on its own general worldly knowledge. Now, for our use case, that was actually dangerous and could actually misrepresent the true intent of our product and — just as bad — it could mislead the customer and hinder our sales mission.

Admitting ignorance

Therefore, the solution I implemented was to enable the bot to comfortably comfortably admit that it didn’t know the answer, and instead it would offer to connect the user with a sales representative. Our team of experts are the best to answer queries when there is no sufficiently relevant information in the knowledge base.

If you allow me a follow-up question…

Another excellent example of an error that I discovered through trial and error was the following. The bot was not always able to answer follow-up questions, because initially only the words of the immediate (i.e. most recent) query were being deemed as relevant for the response. But I implemented a method to convert the final query into a standalone query based on the history of the whole conversation thus far. This way, the bot was able to understand what the user was asking in light of the whole dialog.

Gradual change

Making every change in small steps incrementally was tremendously helpful to the overall success of the project. On one occasion, I failed to apply a long series of commits to our UAT branch, leaving it quite antiquated until refreshed much later. Learn from my mistake; this is best to avoid.

I saw much more success when I made small changes, one at a time, so that each one could be tested. This approach saves time as well, because when the QA team is checking the feature which I committed yesterday, I could be working today on a new feature, which I will ask them to test tomorrow, effectively making the whole chain of workflow runs more smoothly

Conclusion

I made use of all these methods and agile approaches, which helped me to identify and address issues, many if which would not be apparent in a purely academic setting. The reason I wrote this article was to emphasize these practical, real-world challenges and lessons learned. I hope this can demonstrates the depth of experience and the valuable insights one can gain from building a production-ready RAG-QA chatbot.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

No responses yet

Write a response