GPT-5 in Testing: Why the New AI Model for Chatbots in Customer Service Falls Short

Written by Harald Huber | Aug 11, 2025 1:36:45 PM

As specialists in dependable chatbots for customer service, our experts put the new GPT-5 through a hands-on test. The goal was to see if the model could meet the high standards required in sensitive areas like financial services, healthcare or public administration.

In these settings, “good enough” answers aren’t enough. Chatbots must deliver accurate, complete and compliant responses—consistently and reproducibly.

Focus Areas: What a Service Chatbot Must Deliver

We looked at three core requirements for customer service chatbots:

Completeness of answers
Accuracy of information
Ability to handle longer dialogues, such as diagnostics or analysis

In environments using retrieval-augmented generation (RAG), chatbots must extract precise content from external sources and process it in context. In practice, GPT-5 showed clear shortcomings here.

Where GPT-5 Struggles

Compared to GPT-4 and some competing models, GPT-5 underperformed in several ways:

Misjudging task difficulty – GPT-5 adapts its “thinking effort” dynamically but often misjudges complexity.
Incomplete answers – When pulling from long texts, GPT-5 writes fluently but leaves out relevant content.
Uncertain reasoning – When logical connections are needed, it often stops too soon, leading to wrong conclusions.

In real-world service operations, these gaps are critical—especially when precise data is needed for integration or follow-up steps.

Impact on Multi-Agent Architectures

Modern multi-agent chatbot setups often include memory components to keep track of complex dialogues and large information sets. Typical use cases include:

Technical error diagnostics or needs assessments
Multi-step analyses in technical support
Collecting parameters for third-party systems

Here, it’s not enough to retrieve pre-trained token patterns. Strong logical reasoning is needed to connect information correctly and process it fully. In our tests, GPT-5 achieved this less often than its predecessors—a disadvantage that can quickly lead to incomplete or incorrect results in production.

Looking Ahead: GPT-5 in Progress with Potential for Rapid Improvement

Our analysis suggests OpenAI is aware of GPT-5’s current weaknesses. Public statements and developer notes indicate ongoing work on completeness, context understanding and reasoning. We also expect OpenAI to release more tools, interfaces and best-practice guides so companies can get more out of GPT-5.

In customer service, this could mean we’ll see a more stable and precise model in just weeks or months. If the gaps close, GPT-5 could turn from a cautious start into a real milestone for reliable chatbots.

Our Expert Recommendation

For now, our team advises caution when using GPT-5 in production chatbot systems.

For critical service processes, test the model thoroughly in advance and, if needed, secure it with hybrid architectures or extra validation steps. Keep a close eye on OpenAI’s updates and plan pilots so future improvements can be integrated easily. To track progress objectively, we’ll run the test again in a few weeks and share the results—so you can decide if GPT-5 has made the leap from a cautious debut to a powerful service model.

Want the Full Test Results and Specific Recommendations for Your Chatbot?

Helpful Resources for Customer Service & AI Leaders

View full post