Mathematics has always been seen as the touchstone of artificial intelligence. When large language models overcome their "inherent deficiencies" (such as the lack of complex reasoning ability and inaccurate numerical calculations), and successfully meet the challenges of mathematical reasoning, artificial intelligence will enter a new era. How to enhance the mathematical reasoning ability of large language models and overcome their inherent limitations has become a key focus of the global artificial intelligence field.
Exploring the Uncharted Territory of AI Mathematical Reasoning
It has been reported that Xueersi, in collaboration with Google, Jinan University, and other well-known technology companies and universities, has recently organized the AAAI2024 Global Large Model Mathematical Reasoning Competition, leveraging the National New Generation Artificial Intelligence Open Innovation Platform for Smart Education. The competition invites global artificial intelligence experts, developers, and enthusiasts to automatically solve challenging elementary and middle school math problems using large models to explore and address the challenges of artificial intelligence in the field of mathematics.
The AAAI (Association for the Advancement of Artificial Intelligence), founded by computer scientists and AI scientists Allen Newell, Marvin Minsky, and John McCarthy, is one of the most authoritative and significant associations in the international AI field. The AAAI conference is recommended by the China Computer Federation (CCF) as a Class A conference.
During the competition, participants are required to use large models to generate reasoning steps and answers for given math problems. The organizers will rank the participants based on the accuracy rate by comparing the model-generated answers with the correct answers. The participant with the highest accuracy rate will win the competition.
To thoroughly explore the mathematical reasoning capabilities of various large models, the competition is divided into two tracks: Chinese Math Problem Solving and English Math Problem Solving. The datasets used in the competition, TAL-SAQ7K-CN and TAL-SAQ6K-EN, are provided by Xueersi. These datasets encompass actual competition questions from multiple primary and secondary school mathematics competitions both domestically and internationally. Each question has been meticulously formatted and includes fields such as content, difficulty level, and a chain of knowledge points ranging from coarse-grained to fine-grained details. Additionally, all mathematical expressions in the TAL-SAQ7K-CN and TAL-SAQ6K-EN datasets have been standardized into LaTeX text format.
The competition consists of two phases. The first phase, the public leaderboard phase, starts today and runs until December 31st. During this phase, the organizers will randomly select 30% of the data from TAL-SAQ7K-CN and TAL-SAQ6K-EN for participants to debug their large models. The second phase, the private leaderboard phase, runs from January 1st to January 10th, 2024. In this phase, participants will use the optimized large models from the first phase to solve the remaining 70% of the dataset. The results from this phase will be considered the final scores of the competition.
Furthermore, the organizers have provided three benchmark evaluations for reference: the performance of GPT-3.5, GPT-4, and Xueersi's self-developed math large model MathGPT on the public leaderboard. The specific results are as follows:
Track1:
Track2:
Laying the Mathematical Foundation for the Era of Large AI Models
Large models have been one of the hottest areas in artificial intelligence development in recent years, and the emergence of ChatGPT has shown many people the future direction of AI. However, existing large language models exhibit significant shortcomings in solving, explaining, answering, and recommending math problems. For instance, they often make errors when solving math problems and struggle with complex calculations.
As the initiator of this global large model mathematics competition, Xueersi hopes that through this competition, they can explore and address the current deficiencies where existing models excel in humanities but not in science-related reasoning and calculations. Xueersi is actively exploring solutions, such as Xueersi's MathGPT, which combines the capabilities of large models and computational engines to tackle three major challenges in the field of mathematics: solving problems correctly, explaining steps clearly, and making content interesting and engaging. The former handles understanding the problem, step-by-step analysis, and invoking the computation engine at appropriate steps to improve accuracy. By training the model on vast amounts of data from expert teachers' problem-solving processes, the clarity of the model's solution steps can be enhanced. Introducing excellent teaching philosophies and methods can further enhance the model’s ability to make problem-solving more engaging.
For example, MathGPT provides answers to sequence problems in three parts: "Analysis," "Detailed Solution," and "Highlight." This approach is more detailed than the rough explanations provided by general large models. "Analysis" offers the problem-solving ideas and thought processes to help users better understand the problem. "Detailed Solution" gives specific calculation methods and answers. Finally, the "Highlight" section points out the key points, difficulties, and critical aspects of the problem, helping users review and reflect on the intention behind the question and apply what they've learned to other problems.
As the first trillion-parameter large model in the field of mathematics in China, MathGPT's mathematical computation abilities cover elementary, middle, and high school levels. It includes various types of questions such as arithmetic, application problems, algebraic problems, and more, and it can also follow up with additional questions about a given problem. Technical reports show that MathGPT achieved the highest scores in several public mathematical evaluation sets, including CEval-Math, AGIEval-Math, APE5K, CMMLU-Math, Gaokao Math, and Math401. In the C-Eval comprehensive test set for junior and senior high school subjects, MathGPT also performed well.
Additionally, Xueersi has open-sourced MathGPT's model training and testing datasets—TAL-SCQ5K-EN/CN (with 3K training sets and 2K test sets each) on platforms like GitHub and Hugging Face. These datasets feature multiple-choice questions covering elementary, middle, and high school mathematics content, complete with detailed explanation steps to facilitate chain-of-thought training.
As the construction unit of the National New Generation Artificial Intelligence Open Innovation Platform for Smart Education, Xueersi has been actively involved in promoting the development and progress of artificial intelligence technology in China. With the advent of the era of large models, Xueersi aims to leverage its years of accumulation in mathematics and AI to lay the foundation for mathematics in the age of AI large models, reaching out to math enthusiasts and research institutions worldwide.
This article is reproduced from Pinecone Finance: https://www.163.com/dy/article/IGPG2NS50531KBFR.html