Just two months after the tech world was upended by the DeepSeek-R1 AI model, Alibaba Cloud has introduced QwQ-32B, an open source large language model (LLM).
The Chinese cloud giant describes the new model as “a compact reasoning model” which uses only 32 billion parameters, yet is capable of delivering performance comparable to other large language AI models that use larger numbers of parameters.
On its website, Alibaba Cloud published performance benchmarks which suggest that the new model is comparable to AI models from DeepSeek and OpenAI. These benchmarks include AIME 24 (mathematical reasoning), Live CodeBench (coding proficiency), LiveBench (test set contamination and objective evaluation), IFEval (instruction-following ability), and BFCL (tool and function-calling capabilities).
By using continuous reinforced learning (RL) scaling, Alibaba claimed the QwQ-32B model demonstrates significant improvements in mathematical reasoning and coding proficiency.
In a blog post, the company said QwQ-32B, which uses 32 billion parameters, achieves performance comparable to DeepSeek-R1, which uses 671 billion parameters. Alibaba said that this shows the effectiveness of RL when applied to robust foundation models pretrained on extensive world knowledge.
“We have integrated agent-related capabilities into the reasoning model, enabling it to think critically while utilising tools and adapting its reasoning based on environmental feedback,” Alibaba said in the blog post.
Alibaba said QwQ-32B demonstrates the effectiveness of using reinforcement learning (RL) to enhance reasoning capabilities. With this approach to AI training, a reinforcement learning AI agent is able to perceive and interpret its environment, as well as take actions and learn through trial and error. Reinforcement learning is one of several approaches developers use to train machine learning systems. Alibaba used RL to make its model more efficient.
“We have not only witnessed the immense potential of scaled RL, but also recognised the untapped possibilities within pretrained language models,” Alibaba said. “As we work towards developing the next generation of Qwen, we are confident that combining stronger foundation models with RL powered by scaled computational resources will propel us closer to achieving Artificial General Intelligence [AGI].”
Alibaba said it is actively exploring the integration of agents with RL to enable what it describes as “long-horizon reasoning” which, according to Alibaba, will eventually lead to greater intelligence with inference time scaling.
The QwQ-32B model was trained using rewards from a general reward model and rule-based verifiers, enhancing its general capabilities. According to Alibaba, these include better instruction-following, alignment with human preferences and improved agent performance.
China’s DeepSeek, which has been generally available since the start of the year, demonstrates the effectiveness of RL in its ability to deliver comparable benchmark results compared to rival US large language models. Its R1 LLM can rival US artificial intelligence without the need to resort to the latest GPU hardware.
The fact that Alibaba’s QwQ-32B model also uses RL is no coincidence. The US has banned the export of high-end AI accelerator chips – such as the Nvidia H100 graphics processor – to China, which means Chinese AI developers have had to look at alternative approaches to making their models work. Using RL does appear to deliver comparable benchmark results compared with what models like those from OpenAI are able to achieve.
What is interesting about the QwQ-32B model is that it uses significantly fewer parameters to achieve similar results to DeepSeek, which effectively means that it should be able to run on less powerful AI acceleration hardware.
 



