Is ChatGPT Getting Worse? Unpacking the Performance Debate

Cindy Candrian
8. Nov. 2023
6 Min. Lesezeit

At Delta Labs, a question has been knocking on our doors persistently in almost every workshop or talk we host: Is ChatGPT’s performance getting worse? It’s not just us; the question has been making rounds on the web and within the AI community too. The whispers grew louder when GPT-4’s architecture was allegedly leaked, with claims following that OpenAI had dialed back GPT-4’s performance to save on computation time and costs. This wasn’t just a baseless rumor; many AI enthusiasts supported the claim with their own experiences of a perceived degradation in GPT-4's performance.Meanwhile, OpenAI has consistently denied any claims that GPT-4 has decreased in capability.

The debate moved from the corridors of the AI community to the spotlight when a research paper came out, putting data behind the claim of ChatGPT’s declining performance over certain tasks. Yet, the narrative surrounding ChatGPT’s performance is far from straightforward, with contrasting evaluations and discussions cropping up post the publication.

So, here we delve into the murmurings about ChatGPT's diminishing prowess, why the tale isn't as clear-cut, and what likely lies at the heart of this discourse: Is ChatGPT truly on a downward spiral?

The Decline Claim: The Academic Spotlight on GPT-4's Performance

The chatter around ChatGPT's performance took a more structured form when researchers from Stanford University and the University of California, Berkeley released a paper. Their investigation spanned from March to June 2023, evaluating GPT-4's performance across a variety of tasks such as solving math problems, code generation, and responding to sensitive questions. The findings were stark; they indicated a significant dip in GPT-4’s performance in certain areas. For instance, the accuracy in solving math problems plummeted from 97.6% in March to a mere 2.4% in June 2023.

But the math section wasn’t the only area facing a downgrade. The paper highlighted a decline in code generation competency, while a slight uptick was noted in visual reasoning tasks. The authors also observed a change in the model's response to sensitive queries. Initially, GPT-4 would articulate a more elaborate response, but by June, it opted for a terse, polite refusal to engage with sensitive questions.

These findings fueled the ongoing narrative of a possible deliberate degradation in GPT-4’s performance, aligning with the earlier speculations that OpenAI might have pulled back on GPT-4's capabilities to save on computational resources. The paper didn’t just provide fodder for debates; it ignited a more intensive discussion within and outside the AI community, pushing many to delve deeper into the claims and counterclaims surrounding ChatGPT's performance trajectory.

The study by these reputable institutions brought a more scholarly lens to the anecdotal observations many had shared, giving a structured base for the concerns that were floating around. The detailed analysis and data-backed claims presented in the paper brought the discussion from the periphery to the center stage, making it a focal point of discussion in forums, workshops, and among AI enthusiasts and experts alike.

Dissecting the Study: A Dive into Methodology and Critique

On the outset, OpenAI's VP of Product, Peter Welinder, took to Twitter to assert, "No, we haven't made GPT-4 dumber. Quite the opposite," setting a contrasting narrative to the claims made by the study from Stanford and UC Berkeley. This study, at first glance, appeared to be a robust endorsement of the whispers surrounding GPT-4's dwindling performance. Yet, a closer look reveals a scenario less black and white.

When diving into the specifics, some of the evaluation methods in the study raised eyebrows. For instance, in assessing GPT-4's code generation prowess, the focus was on whether the code was immediately executable rather than its correctness. Narayanan highlights this as a misstep, noting on Twitter, "The change they report is that the newer GPT-4 adds non-code text to its output...They merely check if the code is directly executable. So the newer model's attempt to be more helpful counted against it." This aspect of the evaluation overlooks the possibility that the additional text might be aiding user understanding, thereby providing a more comprehensive response.

Further, the mathematical evaluation in the study faced some notable shortcomings, particularly in its approach to primality testing. The researchers chose to test GPT-4's ability to identify prime numbers but limited the test set to only prime numbers. This narrow scope led to a skewed assessment, as it didn't challenge GPT-4 to distinguish between prime and composite numbers.

The critique pointed out that when the models were later tested with composite numbers, the narrative around their performance shifted. It revealed that the models were more or less guessing whether the numbers were prime based on how they were fine-tuned, rather than accurately computing primality.

The March version of GPT-4, for instance, tended to guess that the numbers were prime, while the June version leaned towards guessing they were composite. This behavior wasn't captured in the original study due to the exclusive use of prime numbers in the test set.

By not including composite numbers in the evaluation, the study missed out on providing a fuller picture of GPT-4's mathematical capabilities. It's a crucial oversight that casts doubt on the claim of a significant performance drop, underscoring the importance of a well-rounded evaluation to accurately gauge GPT-4’s mathematical prowess.

Although not entirely convincing in its methodology, the study nonetheless resonated with a shared sentiment—something seemed to change about ChatGPT's performance over time. So, why does a section of the user base feel that ChatGPT's performance has dulled?

Why did the paper touch a nerve?

The paper from Stanford and UC Berkeley rang true for many, as it delved into what seemed to be a shared concern among a broad spectrum of ChatGPT users. At its core, the resonance of the study can be traced back to two intertwined concerns: the growing awareness of ChatGPT’s limitations and the ripple effects of behavior drift on user-centric applications.

As users interacted more with ChatGPT, the initial excitement about its capabilities began to wear off, gradually exposing its limitations. This growing awareness wasn't a solitary journey; it was a collective realization that rippled through the community, finding a voice in hushed discussions and online forums. OpenAI's VP of Product echoed this sentiment publicly, suggesting, "Current hypothesis: When you use it more heavily, you start noticing issues you didn't see before." Yet, there's another facet to this narrative that unfolds in parallel.

Simultaneously, the phenomenon of behavior drift emerged as a significant concern. The capability of a model refers to the range of actions it can perform based on its training, while behavior denotes how the model acts in response to specific prompts. This nuanced difference is crucial, as critics argue that the study may have mistaken behavioral changes for a dip in capability.

Chatbots like ChatGPT acquire their capabilities through an intensive process known as pre-training. This foundational phase is both time-consuming and expensive, shaping what the model can potentially do. However, the behavior that users experience while interacting with ChatGPT is molded during a subsequent, less expensive phase known as fine-tuning, which is carried out more frequently to align the model with specific objectives, such as preventing undesirable outputs or enhancing user interaction. In essence, while pre-training equips ChatGPT with a toolkit of capabilities, fine-tuning directs how and when these tools are employed in response to user prompts.

Over time, fine-tuning adjustments altered ChatGPT's behavior, subtly shifting the goalposts for users and developers. What once were reliable prompting strategies now yielded inconsistent responses, disrupting established workflows. Although not a degradation in capability per se, these behavior shifts mirrored a performance decline in practical use-cases, blurring the line between actual performance degradation and perceived inconsistency.

Moreover, the ripple effects of behavior drift extended beyond individual user experiences, reaching the shores of developers and businesses who had built applications atop the GPT API. The need to constantly adapt to ChatGPT's evolving behavior, coupled with an undercurrent of uncertainty surrounding its performance, fueled a narrative of discontent.

In a milieu of high expectations and speculation, the paper acted as a catalyst, amplifying the growing disquiet around ChatGPT's performance. It's a telling scenario that underscores the complex interplay between user perception, actual system performance, and the broader narrative crafted by research studies amidst a climate of speculation and heightened scrutiny.

The Underlying Message

The work of Chen, Zaharia, and Zou might have its flaws, but it hits a chord that many in the field have felt—the tricky business of measuring the performance of language models with precision. Time after time, critics have pointed fingers at OpenAI's tight-lipped approach with GPT-4, which left many in the dark about the sources of training data, the code, the neural network setups, and even the basic blueprint of its architecture. Honestly, the lack of shared insights and open updates seems to be the big talk of the town here. The paper, due to its many flaws, doesn’t serve as proof of a downturn in GPT-4's capabilities. However, it’s a stark nudge about the unintended twists and turns that come with the regular fine-tuning of Large Language Models (LLMs), leading to noticeable shifts in behavior across different tasks. In wrapping up, the hurdles we stumbled upon are a loud echo of how tough it is to put a number on the performance of language models, especially given OpenAI’s closed approach to AI.