Qwen, Alibaba Cloud’s family of open AI models, has been popping up at the top of benchmarks a lot as it’s grown—in power and popularity—over the past year. I’ve been curious about it for a while: open models are a lot more transparent about how they work than the proprietary models you get from OpenAI, Anthropic, and Google. It’s really interesting to get a closer look at how techniques like mixture-of-experts and reasoning are put into action.
I did a deep dive into Qwen’s latest version, Qwen3, to see how it was trained and how it stacks up to other top-tier AI models. Let’s dive in.
Table of contents:
What is Qwen3?
Qwen is a family of open AI models from Alibaba Cloud. The latest version is Qwen3, and it’s currently one of the top large language models available.
Here’s what we know about Qwen3:
-
All the models are released under an Apache 2.0 license, which means they’re suitable for non-commercial, commercial, and research use.
-
There are eight models currently available. Two use a mixture-of-experts (MoE) architecture (Qwen3-235B-A22B and Qwen3-30B-A3B), while six use a dense architecture (Qwen3-32B, Qwen3-14B, Qwen3-8B, Qwen3-4B, Qwen3-1.7B, and Qwen3-0.6B)
-
Qwen3 uses a hybrid approach to problem solving. Instead of releasing two versions of each model, all Qwen3 models have both Thinking Mode (the model takes time to reason through tasks) and Non-Thinking Mode (the model responds quickly with less depth) built-in. Users can allocate a reasoning budget for each prompt or task, so the same model can be adapted for performance, speed, or cost.
-
Qwen3 models are among the most multilingual models available. They support 119 languages and dialects, including a large variety of non-Indo-European languages.
-
Qwen3 has also been optimized for agentic applications. It supports the model context protocol (MCP), so you can use it to connect with and control other applications and tools with the help of a tool like Zapier MCP.
The Qwen3 models
Across the eight models, Qwen3 covers a whole variety of use cases, performance and budget needs, and deployment situations. Here are the main things to know about each model.
Qwen3-235B-A22B
Qwen3-235B-A22B is Qwen3’s flagship model, and it’s currently among the top LLMs in the world. It uses a mixture-of-experts architecture and has 235 billion total parameters with 22 billion active at any time. It has a 128K token context length.
Its performance on benchmarks is within a few points of models like OpenAI o3 and o4-mini, DeepSeek R1, Google Gemini 2.5 Pro and Flash (Reasoning), and Claude 4 Sonnet (Thinking).
As a powerful MoE model with the ability to reason, Qwen3-235B-A22B is best suited for advanced reasoning, math, and coding tasks, as well as agentic applications where its power offsets its complexity and cost to deploy.
Qwen3-30B-A3B
Qwen3-30B-A3B is the other Qwen3 mixture-of-experts model. It has 30 billion total parameters with 3 billion active at any time. It also has a 128K token context length.
Qwen3-30B-A3B’s benchmark performance is about equivalent to GPT-4o and Llama Scout. It’s suitable for a wide variety of applications where inference budget is a factor, from everyday tasks to more advanced problems and agentic applications that use its reasoning abilities.
Qwen3-32B, Qwen3-14B, and Qwen3-8B
Qwen3-32B, Qwen3-14B, and Qwen3-8B are dense models with 32 billion, 14 billion, and 8 billion parameters, respectively. They have a 128K token context length.
As dense models, Qwen3-32B, Qwen3-14B, and Qwen3-8B are simpler to deploy than the mixture-of-experts models, while still offering high-level performance for their parameter count. They can also use reasoning.
Qwen3-4B, Qwen3-1.7B, and Qwen3-0.6B
Qwen3-4B, Qwen3-1.7B, and Qwen3-0.6B are dense models with 4 billion, 1.7 billion, and 0.6 billion parameters, respectively. They have a 32K token context length.
Qwen3-4B, Qwen3-1.7B, and Qwen3-0.6B’s small size makes them suitable to deploy locally for on-device inference. For example, you can get any of these models running well on a mid-spec MacBook Pro.
How does Qwen3 work?
The Qwen3 models were pre-trained on a 36 trillion token dataset that included content from 119 languages and dialects. The dataset included almost twice as many tokens as the one used to train Qwen2.5.
To increase the size of the dataset, Qwen2.5 models were used to extract data from PDF-like documents that hadn’t previously been included. Other Chinese-developed AI models have used censored training data, so bear in mind that its knowledge may be incomplete when it comes to subjects the CCP is uncomfortable with.
The Qwen3 models were pre-trained in three stages.
-
Stage 1 used more than 30 trillion tokens and resulted in a model with basic language skills and general knowledge.
-
Stage 2 added additional knowledge-intensive data covering STEM, coding, and reasoning tasks to improve performance.
-
Stage 3 used high-quality long-context data to create a base model with 32K context length.
The Qwen3 models were post-trained in four stages.
-
Stage 1 and Stage 2 used chain-of-thought data to develop reasoning capabilities.
-
Stage 3 added non-reasoning data to develop a model that could also respond quickly with less resources.
-
Stage 4 used reinforcement learning to improve the models’ performance, add agentic abilities, and correct undesired behaviors.
At the end of this process, the two foundation models, Qwen3-235B-A22B and Qwen3-32B, were distilled to create Qwen3-30B-A3B and the other dense models.
It might not sound like thrilling information, but this kind of transparency is rare. And it’s why I love covering open models: OpenAI, Anthropic, and Google just aren’t sharing these juicy details anymore.
How to get started with Qwen3
There are a few ways you can get started with a Qwen3 model.
Qwen Chat
The two mixture-of-experts models, Qwen3-235B-A22B and Qwen3-30B-A3B, and the highest parameter dense model, Qwen3-32B, are available through the chatbot Qwen Chat.
While it’s a bit rougher around the edges than ChatGPT or Claude, it’s a great way to try out Qwen3. The ability to set your thinking budget using a slider is pretty cool too.
(Reminder that, like all Chinese-made chatbots, Qwen has some data handling and censorship issues.)
APIs
Various Qwen3 models are available as an API through a number of services, including Alibaba Cloud Model Studio, OpenRouter, and Lambda.
Using the API, you can connect Qwen to Zapier, pulling the power of Qwen into the rest of your workflows.
Download and run Qwen3 yourself
Qwen3 is available to download from platforms like Hugging Face and Kaggle. But you’ll need the technical chops to configure it and get it running either locally or on your own server.
Should you use Qwen3?
Qwen3 really demonstrates that open models can be incredibly powerful. Of course, there are still a few question marks about censorship with all AI models that are developed by Chinese tech companies, but models like Qwen and DeepSeek make it very clear that not all the top models are being developed in America anymore.
While the Qwen3 LLMs are the main Qwen models, you can find older Qwen models (like Qwen2.5) as well as multimodal models, audio models, and more based on them on platforms like Hugging Face.
If this is your first time hearing of Qwen, load up the chatbot, and give it a try.
Related reading: