It seems you misunderstood the paper, the point is to do any scientific process, we need to take afew steps in certain order.
The problem is not that system can't use calculator, but the ordering itself of the process, it might hallucinate on when to use a calcualtor.
The paper showed that LLMs solve every task by this method, now if that is true, it might just come up with random steps to finish that process and use calcualtor to process random numbers.
What all are you going to feed, all the constants, approximations. How system decide what to use when is kind of a hunch only, it might use a different steps altogether for the different instance of the same problem and that's exactly the problem.