As specialists in dependable chatbots for customer service, our experts put the new GPT-5 through a hands-on test. The goal was to see if the model could meet the high standards required in sensitive areas like financial services, healthcare or public administration.
In these settings, “good enough” answers aren’t enough. Chatbots must deliver accurate, complete and compliant responses—consistently and reproducibly.
We looked at three core requirements for customer service chatbots:
In environments using retrieval-augmented generation (RAG), chatbots must extract precise content from external sources and process it in context. In practice, GPT-5 showed clear shortcomings here.
Compared to GPT-4 and some competing models, GPT-5 underperformed in several ways:
In real-world service operations, these gaps are critical—especially when precise data is needed for integration or follow-up steps.
Modern multi-agent chatbot setups often include memory components to keep track of complex dialogues and large information sets. Typical use cases include:
Here, it’s not enough to retrieve pre-trained token patterns. Strong logical reasoning is needed to connect information correctly and process it fully. In our tests, GPT-5 achieved this less often than its predecessors—a disadvantage that can quickly lead to incomplete or incorrect results in production.
Our analysis suggests OpenAI is aware of GPT-5’s current weaknesses. Public statements and developer notes indicate ongoing work on completeness, context understanding and reasoning. We also expect OpenAI to release more tools, interfaces and best-practice guides so companies can get more out of GPT-5.
In customer service, this could mean we’ll see a more stable and precise model in just weeks or months. If the gaps close, GPT-5 could turn from a cautious start into a real milestone for reliable chatbots.
For now, our team advises caution when using GPT-5 in production chatbot systems.
For critical service processes, test the model thoroughly in advance and, if needed, secure it with hybrid architectures or extra validation steps. Keep a close eye on OpenAI’s updates and plan pilots so future improvements can be integrated easily. To track progress objectively, we’ll run the test again in a few weeks and share the results—so you can decide if GPT-5 has made the leap from a cautious debut to a powerful service model.
Contact us today to learn how to use large language models safely and efficiently in your customer service.