---Advertisement---

How to Run Qwen 2.5 Locally? (Complete Beginner’s Guide)

By Ismail

Published On:

Follow Us
---Advertisement---

Running Qwen 2.5 on your own computer can be a great way to explore its capabilities without needing an internet connection.

Whether you’re a developer, researcher, or an AI enthusiast, setting up Qwen 2.5 locally allows you to have full control over the model.

In this guide, we’ll walk you through the process step by step in a simple, easy-to-understand way.

System Requirements

Before you install Qwen 2.5, make sure your computer meets the necessary requirements:

ComponentMinimum RequirementRecommended Requirement
Operating SystemLinux (Ubuntu 22.04+)Windows (WSL) Supported
ProcessorModern CPU (AVX Support)High-performance CPU
GPU (Recommended)NVIDIA GPU (8GB VRAM)NVIDIA GPU (16GB+ VRAM)
Memory (RAM)16GB32GB+
Storage50GB Free Space100GB+ SSD

If you don’t have a high-end GPU, you can still run Qwen 2.5 on your CPU, but it may be significantly slower.

Step 1: Update Your System

Keeping your system updated ensures compatibility with the required software. Run the following command in your terminal:

sudo apt update && sudo apt upgrade -y

For Windows users, ensure that your WSL and Ubuntu versions are up to date.

Step 2: Install Required Software

To run Qwen 2.5, you need to install Python, Git, and some essential libraries. Use the command below:

sudo apt install -y python3 python3-pip git

If you’re using an NVIDIA GPU, install CUDA and cuDNN for faster processing. You can find installation guides on NVIDIA’s official website.

Step 3: Create a Virtual Environment

A virtual environment helps keep all required dependencies organized. To create and activate one, use the following commands:

python3 -m venv qwen_env
source qwen_env/bin/activate

For Windows users, use:

python -m venv qwen_env
qwen_env\Scripts\activate

Step 4: Install Python Dependencies

Once inside your virtual environment, install the necessary libraries:

pip install torch transformers accelerate sentencepiece

To make future installations easier, save your dependencies:

pip freeze > requirements.txt

Later, you can install them again with:

pip install -r requirements.txt

Step 5: Download and Load Qwen 2.5

Now, let’s download and load the Qwen 2.5 model from Hugging Face:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen2.5-7B"  # Choose the appropriate model size

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

print("Model loaded successfully!")

If you don’t have a GPU, modify the loading step:

model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cpu")

Step 6: Test the Model

Let’s run a simple test to see if everything is working correctly:

def generate_text(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")  # Use "cpu" if you don't have a GPU
    output = model.generate(**inputs, max_length=100)
    return tokenizer.decode(output[0], skip_special_tokens=True)

# Example usage
print(generate_text("What is the meaning of life?"))

Expected Output Example:

"The meaning of life is a complex and philosophical question that has been debated for centuries..."

Step 7: Optimize Performance (Optional)

If the model is running slowly, try these optimization methods:

  • Use Half-Precision (FP16): This reduces memory usage.
  • Offload Model to Disk: Helps if you have limited RAM.
  • Enable DeepSpeed or BitsAndBytes: These libraries improve memory management and speed.

Step 8: Set Up an API for Easy Access (Optional)

If you want to use Qwen 2.5 as a local API, install FastAPI and create a simple API server:

pip install fastapi uvicorn

Create a script for the API:

from fastapi import FastAPI
from transformers import AutoModelForCausalLM, AutoTokenizer

app = FastAPI()

model_name = "Qwen/Qwen2.5-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

@app.post("/generate")
async def generate(prompt: str):
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    output = model.generate(**inputs, max_length=200)
    return {"response": tokenizer.decode(output[0], skip_special_tokens=True)}

# Run the API using: uvicorn app:app --host 0.0.0.0 --port 8000

Testing Your API

Run the following command to send a request:

curl -X POST "http://localhost:8000/generate" -H "Content-Type: application/json" -d '{"prompt": "Tell me a joke"}'

Step 9: Troubleshooting Common Issues

Problem: CUDA Out of Memory Error

  • Try using a smaller version of the Qwen model.
  • Use torch_dtype=torch.float16 when loading the model.
  • Close unnecessary applications to free up GPU memory.

Problem: Model Not Loading

  • Check that all dependencies are installed correctly.
  • Run pip install --upgrade transformers accelerate and try again.

Problem: Slow Performance

  • Ensure you are using a GPU instead of a CPU.
  • Use DeepSpeed or BitsAndBytes to improve memory usage.

Final Thoughts

By following this guide, you can successfully install and run Qwen 2.5 on your own system.

Whether you’re using it for AI research, chatbots, or text generation, running it locally gives you full control and better privacy.

With the optional performance tweaks and API deployment, you can customize Qwen 2.5 to meet your needs.

Now, you’re all set to explore the full potential of Qwen 2.5 on your computer!

Ismail

MD. Ismail is a writer at Scope On AI, here he shares the latest news, updates, and simple guides about artificial intelligence. He loves making AI easy to understand for everyone, whether you're a tech expert or just curious about AI. His articles break down complex topics into clear, straightforward language so readers can stay informed without the confusion. If you're interested in AI, his work is a great way to keep up with what's happening in the AI world.

Join WhatsApp

Join Now

Join Telegram

Join Now

Leave a Comment