The Qwen 2.5 multilingual large language models (LLMs) are a collection of pre-trained and instruction tuned generative models in 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B (text in/text out and code out). The Qwen 2.5 fine tuned text-only models are optimized for multilingual dialogue use cases and outperform both previous generations of Qwen models, and many of the publicly available chat models based on common industry benchmarks.
At its core, Qwen 2.5 is an auto-regressive language model that uses an optimized transformer architecture. The Qwen2.5 collection can support over 29 languages and has enhanced role-playing abilities and condition-setting for chatbots.
In this post, we outline how to get started with deploying the Qwen 2.5 family of models on an Inferentia instance using Amazon Elastic Compute Cloud (Amazon EC2) and Amazon SageMaker using the Hugging Face Text Generation Inference (TGI) container and the Hugging Face Optimum Neuron library. Qwen2.5 Coder and Math variants are also supported.
Preparation
Hugging Face provides two tools that are frequently used when using AWS Inferentia and AWS Trainium: Text Generation Inference (TGI) containers, which provide support for deploying and serving LLMS, and the Optimum Neuron library, which serves as an interface between the Transformers library and the Inferentia and Trainium accelerators.
The first time a model is run on Inferentia or Trainium, you compile the model to make sure that you have a version that will perform optimally on Inferentia and Trainium chips. The Optimum Neuron library from Hugging Face along with the Optimum Neuron cache will transparently supply a compiled model when available. If you’re using a different model with the Qwen2.5 architecture, you might need to compile the model before deploying. For more information, see Compiling a model for Inferentia or Trainium.
You can deploy TGI as a docker container on an Inferentia or Trainium EC2 instance or on Amazon SageMaker.
Option 1: Deploy TGI on Amazon EC2 Inf2
In this example, you will deploy Qwen2.5-7B-Instruct on an inf2.xlarge instance. (See this article for detailed instructions on how to deploy an instance using the Hugging Face DLAMI.)
For this option, you SSH into the instance and create a .env file (where you’ll define your constants and specify where your model is cached) and a file named docker-compose.yaml (where you’ll define all of the environment parameters that you’ll need to deploy your model for inference). You can copy the following files for this use case.
- Create a .env file with the following content:
- Create a file named docker-compose.yaml with the following content:
- Use docker compose to deploy the model:
docker compose -f docker-compose.yaml --env-file .env up
- To confirm that the model deployed correctly, send a test prompt to the model:
- To confirm that the model can respond in multiple languages, try sending a prompt in Chinese:
Option 2: Deploy TGI on SageMaker
You can also use Hugging Face’s Optimum Neuron library to quickly deploy models directly from SageMaker using instructions on the Hugging Face Model Hub.
- From the Qwen 2.5 model card hub, choose Deploy, then SageMaker, and finally AWS Inferentia & Trainium.
- Copy the example code into a SageMaker notebook, then choose Run.
- The notebook you copied will look like the following:
Clean Up
Make sure that you terminate your EC2 instances and delete your SageMaker endpoints to avoid ongoing costs.
Terminate EC2 instances through the AWS Management Console.
Terminate a SageMaker endpoint through the console or with the following commands:
Conclusion
AWS Trainium and AWS Inferentia deliver high performance and low cost for deploying Qwen2.5 models. We’re excited to see how you will use these powerful models and our purpose-built AI infrastructure to build differentiated AI applications. To learn more about how to get started with AWS AI chips, see the AWS Neuron documentation.
About the Authors