Apple study exposes deep cracks in LLMs' "reasoning" capabilities
Oct 28
4 min read
0
1
0
Apple unveiled a new phase in its technological journey by integrating generative AI into its product lineup at the annual Worldwide Developers Conference. The new features, called "Apple Intelligence," include more innovative Siri interactions, AI-generated "Genmoji," and personalized user experiences. A partnership with OpenAI was also announced, highlighting Apple's collaboration to embed ChatGPT's technology into its offerings. However, Apple faces balancing privacy concerns with public skepticism around OpenAI's data practices.
Apple emphasized privacy, promising that many AI functions will be executed locally on the user's phone to avoid cloud-based risks. This AI push aims to spur iPhone sales and services in an economic climate where consumers hesitate to upgrade their devices. Yet, it also brings regulatory scrutiny and competitive pressure from companies like Nvidia, which recently overtook Apple in market valuation.
Apple CEO Tim Cook noted the company's cautious approach to AI adoption: "Our AI capabilities must be personal and intuitive. They must also reflect privacy from the ground up." While Apple is often slow to adopt emerging technologies, the rapid growth of generative AI has accelerated its efforts to stay competitive with cutting-edge features.
Apple study exposes deep cracks in LLMs’ “reasoning” capabilities - Ars Technica
Apple's enhanced AI capabilities allow Siri to behave like a chatbot, answering detailed questions and performing personalized tasks. These improvements enable Siri to, for instance, provide updates about a family member's flight by parsing emails or even convert a picture into a stylized "Genmoji" character. This personalized intelligence will evolve as Siri learns the user's behavior over time, streamlining actions across apps and responding to complex commands.
Cook emphasized the importance of these tools enhancing, not replacing, users: "We see AI's role as empowering people, not replacing them. With AI, we are responsible for integrating it thoughtfully and with respect for privacy."
Despite Apple's investment in AI-powered products, recent research by Apple's AI team underscores a significant limitation: LLMs (Large Language Models) exhibit brittle reasoning when applied to mathematical problems. Apple researchers Samy Bengio and Oncel Tuzel revealed these limitations in their study, GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models.
The research explored the differences between two datasets: OpenAI's GSM8K, a widely used benchmark of 8,000 grade-school math problems, and Apple's new GSM-Symbolic dataset. GSM-Symbolic replaces names and numbers in questions with new ones, ensuring that models can't rely on memorized patterns. The results exposed a critical weakness: while models like ChatGPT perform well on GSM8K, they stumble on GSM-Symbolic, with accuracy dropping up to 9.2%. Some models also showed inconsistent performance across multiple runs, further underscoring their limitations.
Robert Sutor on LinkedIn: Apple study exposes deep cracks in LLMs’ “reasoning” capabilities
The research highlights that while LLMs excel at pattern matching, they need a deeper logical understanding for reliable reasoning. The models can mimic reasoning by recalling similar examples from their training data but falter when faced with unfamiliar variations.
One test highlighted by the researchers is a math problem involving Oliver picking kiwis over three days. On Sunday, he picks twice as many as Friday, but five kiwis are noted as "smaller than average." Although this detail is irrelevant to the final count, many LLMs incorrectly subtract the smaller kiwis, illustrating how these models misinterpret incidental information. Apple's researchers termed such failures "critical flaws," demonstrating that the models attempt to replicate patterns rather than reason logically.
Apple's findings align with broader AI research, indicating that LLMs can create the illusion of reasoning. However, as Apple's research suggests, this illusion breaks down when the problem structure deviates slightly from the model's training data. AI expert Gary Marcus argues that the next leap in AI will require actual symbolic reasoning—where models manipulate variables abstractly, similar to algebra—something current LLMs are far from achieving.
Apple's research offers essential lessons for businesses integrating AI tools. It demonstrates that while LLMs can assist with repetitive or familiar tasks, they could be more reliable for complex decision-making. As Cook puts it, "AI is a supportive tool, but not a replacement for human reasoning." Businesses should ensure human oversight remains central, especially in mission-critical areas.
AI's Impact on Development - SurveyCTO
Apple's transparency about AI limitations reflects a practical approach. For example, despite its AI enhancements, Apple's Vision Pro headset still emphasizes user control and decision-making. Similarly, LLM-powered tools like ChatGPT can aid productivity but require careful monitoring to avoid errors.
The research also warns businesses to be cautious about using AI for mathematical accuracy. As complexity increases, the risk of errors grows, paralleling human tendencies—but without the genuine reasoning humans bring. Businesses should view AI as a supplement to human expertise rather than a replacement.
Apple's exploration of AI highlights this transformative technology's operation of ChatGPT in its services and focuses on privacy-centric AI solutions; Apple aims to enhance user experience. At the same time, its research reveals the brittleness of LLMs in handling mathematical reasoning, advocating for thoughtful AI integration with human oversight.
Apple's ability to recognize the limits of AI reflects a measured, transparent approach to innovation. As the tech landscape evolves, Apple's focus on balancing AI-driven products with realistic expectations will be crucial in building consumer trust. AI may excel in specific areas like photo editing or chat interactions, but the research underscores that human expertise remains irreplaceable for tasks requiring deep reasoning or critical thinking.