ExchangeDEX+

Buy Crypto Markets Spot Futures500X Earn Events

ChatGPT can't reliably add 61 numbers in random order, giving 12 different wrong answers despite showing convincing fake work. The AI optimizes for speed over accuracyChatGPT can't reliably add 61 numbers in random order, giving 12 different wrong answers despite showing convincing fake work. The AI optimizes for speed over accuracy

ChatGPT Is Gaslighting You With Math

2025/12/18 17:49

\ ChatGPT can write efficient Python code and draft complex SQL queries in seconds. Heck, it’ll brainstorm your entire marketing campaign if you let it. We’re rushing to trust this “genius,” latest models and all, with our legal documents and the code that runs our business operations.

As a Business Intelligence Analyst with over a decade of experience at companies like Amazon and Microsoft, I was curious. I’ve watched this technology go from a toy to a tool that many claim is ready to replace me. But while everyone tests AI with harder problems, I decided to do the opposite.

The results revealed what’s really going on: a major design choice, a cost-versus-performance tradeoff, that runs the entire system. It’s a transparency problem, and it’s a gap big enough to make any data professional skeptical. It revealed that while AI is great at writing code, it can’t be trusted with the most basic building block of my job: simple arithmetic.

And the scariest part? It’s confidently wrong.

The math problem that revealed the trick

My test was a simple list of all the numbers from -3000 to 3000 (at intervals of 100), which came to 61 numbers total. The list was specifically designed so the final, correct answer would be a simple “zero.” This way, I’d know instantly if it was right.

Here’s where the “shortcut” behavior became clear. When I gave it the list sorted in ascending or descending order, it aced the test every time. It correctly recognized the simple pattern: -3000 cancels 3000, -2900 cancels 2900, and so on. This, as it turns out, tested pattern matching, which is a core AI strength.

But what happens when you remove the pattern? I put those exact same numbers into a random order. This broke the simple shortcut and forced the AI to actually calculate.

It failed, and not by a small margin.

The “performative” failure

The AI didn’t calculate. It put on a performance of calculation.

It’s like watching Noah Wyle on The Pitt or ER: he’s a convincing doctor, but you should never trust him to perform an actual medical procedure on you. In the same way, the AI goes through the motions of calculation. It replies with a “step-by-step” breakdown that looks perfectly logical, but the final answer is confidently and fundamentally wrong.

When I challenged it, it “double-checked” and got a different wrong answer.

The failure was more fundamental than I first thought. This process involved extensive hand holding, like asking it to break the list into small groups of 10 numbers. It still got the math wrong on each of those simple, 10-integer sums. Even after I gave it the correct answers for each group, it still failed to correctly add the six group subtotals.

In all, I got 12 different incorrect answers from this back-and-forth, all from the same prompt.

This wasn’t a failure to handle a “complex” list. It was a complete failure to perform basic addition. In the end, it apologized for its “mental calculation errors,” a priceless admission that it was simulating calculation rather than doing it.

This shortcut-first design is the same reason the AI has historically flubbed other simple tasks, like counting letters in a word. Its main goal is finding the cheapest shortcut, not necessarily the correct answer.

Then I found the hidden calculator

So, is ChatGPT just hopelessly bad at math? Not exactly. I went back to that very first wrong answer and, instead of replying, I clicked the “Think Longer” option.

Instantly, this happened:

# Define the list of numbers numbers = [ 2100, 800, -1800, 2400, 1000, -2400, 1200, ... ] # Calculate the sum total_sum = sum(numbers) total_sum # Result 0

It got the correct answer (zero) in one second. This is the “gotcha.” It had the calculator all along. It was just choosing not to use it.

What’s actually happening here?

This isn’t a bug. What we’re seeing is a design tradeoff. ChatGPT is optimized for speed first and accuracy second in its default mode.

The default, fast path is text-only. It attempts to solve the problem by relying on its vast training data, which includes learned arithmetic patterns. For simple sums (like 2+2), this is reliable. But for tasks that require real precision across many steps, like our 61-number list, it fails. It’s like a person trying to multiply large numbers in their head: they understand the concept of math, but quickly lose track of the intermediate steps and “carries.” This text-only approach is very fast and, more than that, it’s cheap to run.

The thinking path (using the Python calculator) is perfectly accurate, but it’s expensive to run. Don’t get me wrong, this “slow” path still returned an answer in a split second. But for OpenAI, the system cost is enormous. Instead of just predicting the next word, the AI has to do a lot more work: it must spin up a secure Python interpreter, write the code, execute it, and then read the output.

This is the cost-performance tradeoff in action. It’s the why behind everything we just saw.

Here’s where it gets personal

Look, here’s the real risk for data professionals. We assume the AI is analyzing. It’s not. It’s just simulating what the analysis should look like.

Now, I know what the response will be: this is all part of the new “dynamic” system. OpenAI’s own press releases boast about this, calling it a “real-time router” that “quickly decides” whether to respond quickly or to “think longer” for hard problems. This sounds great on paper. But my test shows this so-called smart router can fail on seemingly simple tasks that lack obvious patterns.

When I gave it a 61-number math problem in random order, its internal logic misjudged the difficulty. It seemed to think this was a simple task it could crush. This tells me the router’s heuristics aren’t tuned to catch this kind of “deceptively simple” problem. It’s probably just looking at query length or whether there are math symbols. So, instead of correctly identifying this as a “hard problem” and automatically engaging its thinking model, it chose the fast, text-only path and proceeded to fail.

That’s the career risk right there. Imagine asking the AI to “check the subtotals on this expense report.” It replies with a confident, text-only “Looks correct!” You pass that report to an executive, who quickly does some mental math and realizes your calculations are wrong.

In that moment, you’ve damaged your credibility by relying on a tool for a task it wasn’t designed to handle reliably. The AI’s failure was that it simulated the act of checking instead of actually calculating. And you’re a professional, left holding the bag.

When AI math actually works

So, it’s important to know when AI is reliable for these tasks. My test was designed to hit a specific vulnerability. AI math is generally trustworthy in a few other places:

When it explicitly uses its code interpreter (like the “Think Longer” path).
For simple, in-context arithmetic (e.g., “I have 3 apples and buy 2 more, how many do I have?”).
For symbolic math and explaining concepts (e.g., “Explain the Pythagorean theorem”).

The risk, as my test shows, is not knowing which mode the AI is in.

What you should actually do about this

This doesn’t mean “don’t use AI.” It means we need to use it like a pro, not a novice. We have to treat it like the “shortcut engine” it is. So here’s my practical guide for data analysts based on my findings:

Look, if you remember nothing else: if it shows you code, you can trust the result. If it just talks at you, be skeptical.

Use “Think Longer” for any “right-or-wrong” answer. Don’t wait for it to fail first.

Use the right tool for the job. For straightforward arithmetic, use Excel. It’s built for it and is infinitely more reliable. Why make a “creative writing” engine do a calculator’s job? However, for generating an analytical workflow or cleaning data before the calculation, using the AI with its code execution on is a powerful and genuinely useful way to get work done.

This all comes down to transparency. The AI isn’t flawed because it failed the math; it’s flawed because it hid the failure behind a mask of confidence. It has a perfectly good calculator in its back pocket but defaults to the fast, unreliable method without telling you which one it’s using. As data professionals, that’s just not a foundation we can build on. Look, until these systems tell us how they’re getting an answer, the rule is simple: if it’s not in a code block, it’s not an answer. It’s just a performance.

Market Opportunity

MATH Price(MATH)

$0.03344

$0.03344$0.03344

-0.44%

USD

MATH (MATH) Live Price Chart

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.