Deploy Your First
Open-Source AI Model
A step-by-step guide to running open-source AI models on your own hardware and deploying them to AWS — no API keys, no per-request costs. Windows or Mac to the cloud.
Getting Set Up
Before running any AI models, let's get your machine ready with the tools you'll need — whether you're on Windows or Mac.
What You'll Need
To run and deploy an open-source AI model, you need just a few tools: Ollama (to download and run models), a terminal (PowerShell on Windows, Terminal on Mac), and the AWS CLI (to interact with Amazon Web Services later). That's it to start.
The best part: you don't need an OpenAI account, no API key, and no credit card to get started. Open-source models run directly on your machine — completely free, completely private.
Installing Ollama
On Windows: Go to ollama.com/download and download the Windows installer. Run it — the installation takes about a minute. Ollama runs as a background service and gives you a command-line tool to manage AI models. After installation, open PowerShell (search for it in the Start menu) and verify it's working.
On Mac: Go to ollama.com/download and download the macOS app, or install via Homebrew with brew install ollama. Open Terminal (Cmd+Space, type Terminal) and verify it's working. On Apple Silicon Macs (M1/M2/M3/M4), Ollama runs models natively on the GPU — no extra setup needed.
ollama --versionInstalling the AWS CLI (for Part 4)
On Windows: Download the AWS CLI v2 installer from aws.amazon.com/cli — it's an MSI file that installs with a few clicks. After installation, restart PowerShell and run aws --version to confirm.
On Mac: Install via Homebrew with brew install awscli, or download the .pkg installer from aws.amazon.com/cli. Open a new Terminal window and run aws --version to confirm.
Next, you'll need AWS credentials. Create a free account at aws.amazon.com, go to IAM → Users → Create User, and create a user with programmatic access. Save the Access Key ID and Secret Access Key somewhere safe.
Important: Never share your AWS credentials or commit them to code. These keys can create resources that cost real money.
aws configure
# It will ask for:
# AWS Access Key ID: (paste your key)
# AWS Secret Access Key: (paste your secret)
# Default region name: us-east-1
# Default output format: jsonInstalling Git & VS Code (Optional)
On Windows: Download Git for Windows from git-scm.com and VS Code from code.visualstudio.com.
On Mac: Git is included with the Xcode Command Line Tools — run xcode-select --install in Terminal to install. Download VS Code from code.visualstudio.com, or install via Homebrew with brew install --cask visual-studio-code.
Neither is strictly required for the core path in this guide, but both are helpful if you want to write code and track changes.
The Open-Source AI Ecosystem
Open models, HuggingFace, model formats, and the tools that tie it all together — the landscape explained.
What is an Open Model?
When you use ChatGPT, your messages go to OpenAI's servers. You're renting access to their model. Open models are AI models whose weights (the trained "brain") are freely downloadable. Companies like Meta (Llama), Mistral, Microsoft (Phi), and Google (Gemma) release these for anyone to use.
This means you can run the model on your own computer or your own server. No API key. No per-request cost. No data leaving your machine. You own the whole stack. The trade-off: you need to provide the compute (your laptop or a cloud server) rather than paying someone else to do it.
What is HuggingFace?
HuggingFace is the GitHub of AI. It's a platform where researchers and companies publish their models, datasets, and demo apps for anyone to download and use. When Meta releases Llama or Google releases Gemma, the model weights go on HuggingFace.
The HuggingFace Hub hosts over 500,000 models — not just text models, but image generators, speech recognizers, translation models, and more. Each model has a page with documentation, benchmarks, a try-it-now demo, and download links. Think of it as an app store for AI models.
HuggingFace also develops Transformers, the most popular open-source library for working with AI models in Python. If you see `from transformers import ...` in someone's code, that's HuggingFace. We won't need Transformers in this guide (Ollama handles everything), but you'll encounter it everywhere once you dig deeper.
How Models Get to Your Machine
There's a pipeline from research lab to your laptop, and understanding it helps you navigate the ecosystem:
1. Training: A company like Meta trains a model on massive GPU clusters for weeks. The result is a set of model weights — a giant file of numbers (tens of GB).
2. Release on HuggingFace: The company uploads the weights in their original format (usually SafeTensors or PyTorch format). This is the "full precision" version — large and requires a powerful GPU to run.
3. Community quantization: Community members convert the weights into compressed formats like GGUF that run on consumer hardware. Quantization shrinks a 16GB model to 4-5GB with minimal quality loss. Popular quantizers like TheBloke and bartowski on HuggingFace do this for nearly every popular model.
4. Ollama library: Ollama packages these quantized models into its own registry. When you run `ollama pull llama3.1`, it downloads a pre-quantized GGUF version optimized for your hardware. This is the easiest path — no manual downloading or conversion needed.
Model Sizes & What You Can Run
Model size is measured in parameters — the number after the name (8B, 70B) means billions of parameters. More parameters generally means smarter, but needs more memory (RAM or VRAM).
An 8B model needs about 5GB of memory and runs well on most modern laptops. A 70B model needs 40+ GB and requires serious hardware. For your first model, we'll use something in the 3-8B range — small enough for any modern PC or Mac, smart enough to be genuinely useful.
Models are often quantized — compressed to use less memory with minimal quality loss. A tag like Q4_K_M means 4-bit quantization with a specific method. Lower numbers = smaller file but slightly less accurate. Q4 is the sweet spot for most people — good quality, fits on consumer hardware.
Model Formats Explained
You'll see several file formats mentioned in the AI world. Here's what they mean:
GGUF — The format used by Ollama and llama.cpp. Designed to run on CPUs and consumer GPUs. Contains the model weights and all metadata in a single file. This is what you'll use most often.
SafeTensors — The standard format for full-precision models on HuggingFace. Used by the Transformers library and for training. These are the "original" model files before quantization.
ONNX — A cross-platform format for running models in production. Used more in traditional ML than large language models, but good to know about.
PyTorch (.pt / .bin) — The older format you'll still see on some HuggingFace pages. Being replaced by SafeTensors for security reasons (SafeTensors can't contain executable code).
Model Licenses — What You Can Do
Not all "open" models are equally open. Pay attention to the license:
Apache 2.0 / MIT — Fully open. Use commercially, modify, redistribute, no restrictions. Models: Mistral 7B, Phi-3, Falcon.
Llama License — Free for most uses including commercial, but Meta requires you to request access and agree to terms. You can't use Llama to train other models that compete with Meta's products. Models: Llama 3.1, Llama 3.2.
Non-commercial / Research only — Some models restrict commercial use. Always check before building a product on top of a model.
Choosing the Right Model
With thousands of models available, how do you pick? Start with your constraints:
Limited hardware (8GB RAM or less)? Try Phi-3 Mini (3.8B) or Gemma 2 2B — they run fast and need minimal resources.
General-purpose on decent hardware? Llama 3.1 8B is the community default for a reason. Great at chat, coding, reasoning, and creative writing.
Code generation? CodeLlama or DeepSeek Coder are fine-tuned specifically for programming tasks.
Want to compare? The Open LLM Leaderboard on HuggingFace benchmarks models across reasoning, math, coding, and knowledge tasks. It's the best way to compare models objectively before downloading one.
Popular Models
Llama 3.1
Meta
Meta's flagship open model. The 8B version runs on most hardware and punches above its weight for coding, reasoning, and conversation.
Phi-3 Mini
Microsoft
Microsoft's small-but-mighty model. Fits on almost any machine and runs fast. Perfect for a first experiment when you want instant results.
Mistral 7B
Mistral AI
From the French AI lab Mistral. Excellent performance-per-parameter ratio. A strong alternative to Llama for general-purpose tasks.
Gemma 2
Google's open model family. The 2B version is one of the smallest usable models — great for resource-constrained environments or as a fast utility model.
Tools & Platforms
HuggingFace Hub
Model Platform
The central hub for open-source AI. Browse, download, and try models. Every major open model is published here first. Think of it as the GitHub + npm of AI.
Ollama
Model Runner
The easiest way to run open models. One command to download and run any supported model. Provides a local API you can call from code — identical to OpenAI's format.
GGUF & llama.cpp
Model Format & Engine
GGUF is the file format, llama.cpp is the engine. Together they let you run large language models on consumer CPUs and GPUs. Ollama is built on top of llama.cpp.
Transformers
HuggingFace Library
The most popular Python library for AI models. Load any model from HuggingFace with a few lines of code. More flexible than Ollama but requires more setup.
Running Your First Model Locally
Time to run an AI model on your own machine — no API key, no internet required after download, no cost per request.
Pulling and Running a Model
With Ollama installed, running a model is a single command. Ollama downloads the model the first time (a few GB), then starts an interactive chat session. It's that simple.
The download only happens once. After that, the model loads from disk in seconds. Try it now — open your terminal (PowerShell on Windows, Terminal on Mac) and run one of the commands below.
# Download and run Llama 3.1 (8B) — about 4.7GB download
ollama run llama3.1
# Or if your machine is lower-spec, try the smaller Phi-3
ollama run phi3Chatting with Your Model
Once the model loads, you'll see a `>>>` prompt. Type anything and press Enter. The model generates a response right on your machine — no data leaves your computer.
Try asking it to explain something, write a poem, or help with code. To exit, type /bye. To see what models you've downloaded, run ollama list.
# List downloaded models
ollama list
# Pull a model without starting chat
ollama pull mistral
# Remove a model to free disk space
ollama rm phi3
# Show model details (size, parameters, quantization)
ollama show llama3.1The Ollama API
Ollama doesn't just run in the terminal — it also starts a local API server on http://localhost:11434. Any program on your computer can send requests to it, just like calling a web service.
This is the key insight for deployment: if you can call this API locally, you can run the exact same thing on a cloud server and call it from anywhere in the world.
curl http://localhost:11434/api/chat -d '{
"model": "llama3.1",
"messages": [{"role": "user", "content": "What is cloud computing in one sentence?"}],
"stream": false
}'Invoke-RestMethod -Uri "http://localhost:11434/api/chat" `
-Method POST `
-ContentType "application/json" `
-Body '{"model": "llama3.1", "messages": [{"role": "user", "content": "What is cloud computing in one sentence?"}], "stream": false}'Using Python with Ollama (Optional)
If you want to write a Python script that talks to your model, install the openai Python library. Because Ollama's API is OpenAI-compatible, you can use the same code pattern — just point it at localhost instead of OpenAI's servers.
This means code written for Ollama locally works the same way against your cloud deployment later — zero code changes. This works identically on Windows and Mac.
from openai import OpenAI
# Point the client at your local Ollama server
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Ollama doesn't need a real key
)
response = client.chat.completions.create(
model="llama3.1",
messages=[
{"role": "user", "content": "Explain AWS EC2 like I'm 10."}
]
)
print(response.choices[0].message.content)Deploying to AWS
Your model runs locally — now let's put it on an EC2 instance so anyone can reach it over the internet.
Why EC2 for AI Models?
Running an AI model requires a machine with enough RAM and ideally a GPU. AWS EC2 gives you a virtual machine in the cloud that you fully control — think of it as renting a powerful computer by the hour.
We'll use a g4dn.xlarge instance — it has an NVIDIA T4 GPU with 16GB VRAM, enough to run 7-8B models comfortably. It costs about $0.53/hr on-demand. Stop the instance when you're not using it — you only pay while it's running.
Step 1: Launch an EC2 Instance
Go to the AWS Console → EC2 → Launch Instance. Name it something like my-ai-server. For the AMI (operating system), choose Ubuntu 22.04 LTS. For instance type, search for g4dn.xlarge and select it.
Under Key Pair, create a new one and download the .pem file — you'll need it to connect. In Network Settings, allow SSH (port 22) and add a custom rule for port 11434 (Ollama's API) or port 80 (if using Nginx). Click Launch.
Cost warning: GPU instances cost money while running. Always Stop your instance from the Console when done. A stopped instance costs nearly nothing (just storage).
Step 2: Connect to Your Instance
Once the instance is running (1-2 minutes), find its Public IPv4 address in the EC2 Console. Connect using SSH — it's built into both Windows 10/11 (PowerShell) and macOS (Terminal).
# Fix key file permissions (required on first use)
chmod 400 ~/Downloads/my-key-pair.pem
# Connect to your EC2 instance
ssh -i ~/Downloads/my-key-pair.pem ubuntu@<your-ec2-public-ip># Fix key file permissions (required on first use)
icacls "C:\Users\Will\Downloads\my-key-pair.pem" /inheritance:r /grant:r "$($env:USERNAME):(R)"
# Connect to your EC2 instance
ssh -i "C:\Users\Will\Downloads\my-key-pair.pem" ubuntu@<your-ec2-public-ip>Step 3: Install Ollama on EC2
Once connected via SSH, you're on the Ubuntu server in the cloud. Install Ollama with a single command — the same tool, just on Linux. Then pull your model. Ollama auto-detects the GPU.
# Install Ollama (one command on Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Pull your model (downloads to the server)
ollama pull llama3.1
# Quick test — verify the GPU is detected
ollama run llama3.1 "Say hello in exactly 5 words"Step 4: Expose the API
By default, Ollama only listens on localhost. To make it reachable from the internet, configure it to listen on all network interfaces. The quickest method is changing an environment variable. For a more robust setup, use Nginx as a reverse proxy.
# Option A: Quick — change the bind address
sudo systemctl edit ollama
# Add these two lines in the editor that opens:
# [Service]
# Environment="OLLAMA_HOST=0.0.0.0"
# Save and exit, then restart:
sudo systemctl restart ollama
# Option B: Nginx reverse proxy (recommended)
sudo apt update && sudo apt install nginx -y
sudo tee /etc/nginx/sites-available/ollama << 'NGINX'
server {
listen 80;
server_name _;
location / {
proxy_pass http://localhost:11434;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_read_timeout 300s;
}
}
NGINX
sudo ln -sf /etc/nginx/sites-available/ollama /etc/nginx/sites-enabled/default
sudo nginx -t && sudo systemctl restart nginxStep 5: Test Your Cloud API
Back on your local machine, open your terminal and send a request to your EC2 instance's public IP. If you get a response, congratulations — you're running your own AI model in the cloud, accessible as an API.
# If using Nginx (port 80):
curl http://<your-ec2-ip>/api/chat -d '{
"model": "llama3.1",
"messages": [{"role": "user", "content": "Hello from the cloud!"}],
"stream": false
}'
# If using direct Ollama (port 11434):
curl http://<your-ec2-ip>:11434/api/chat -d '{
"model": "llama3.1",
"messages": [{"role": "user", "content": "Hello from the cloud!"}],
"stream": false
}'# If using Nginx (port 80):
Invoke-RestMethod -Uri "http://<your-ec2-ip>/api/chat" `
-Method POST `
-ContentType "application/json" `
-Body '{"model": "llama3.1", "messages": [{"role": "user", "content": "Hello from the cloud!"}], "stream": false}'
# If using direct Ollama (port 11434):
Invoke-RestMethod -Uri "http://<your-ec2-ip>:11434/api/chat" `
-Method POST `
-ContentType "application/json" `
-Body '{"model": "llama3.1", "messages": [{"role": "user", "content": "Hello from the cloud!"}], "stream": false}'The Deployment Flow
Run Locally
Ollama on WindowsInstall Ollama on Windows and run a model. Chat in the terminal and test the local API on localhost:11434.
Launch EC2 Instance
AWS EC2 + GPULaunch a g4dn.xlarge GPU instance in the AWS Console running Ubuntu. Download your SSH key pair.
Connect via SSH
SSH from PowerShellSSH into your EC2 instance from PowerShell. You're now on a remote Linux server with an NVIDIA T4 GPU.
Install Ollama on EC2
Ollama on LinuxRun the same one-line installer on your EC2 instance. Pull your chosen model. Ollama auto-detects the GPU.
Expose the API
Nginx reverse proxyConfigure Ollama to listen on all interfaces or set up Nginx as a reverse proxy. Open the port in Security Groups.
Call from Anywhere
REST APISend HTTP requests to your EC2 public IP from any machine. Your own AI API, running your own model, on your own server.
Going Further
You've deployed your own AI model — here's where to take it next.
Saving Money: Spot Instances & Scheduling
On-demand GPU instances add up. Spot Instances let you bid on unused AWS capacity at up to 70% off — a g4dn.xlarge can drop to ~$0.16/hr instead of $0.53/hr. The trade-off: AWS can reclaim the instance with 2 minutes notice, but for dev/testing this is fine.
For a personal project, use AWS Instance Scheduler or a simple cron job to automatically stop your instance at night and start it in the morning. A stopped instance costs nearly nothing.
Adding HTTPS & a Domain
Right now your API is on a bare IP address over HTTP. For production use, register a domain, point it at your EC2 instance with Route 53, and add a free HTTPS certificate with Let's Encrypt / Certbot. This takes about 15 minutes and makes your API trustworthy and shareable.
Trying Bigger Models
The 8B model is a great start, but bigger models are noticeably smarter. A g5.2xlarge (A10G GPU, 24GB VRAM) can run 13B models. For 70B models you need multi-GPU setups like g5.12xlarge (4x A10G). Costs go up, but so does capability.
Alternatively, explore quantized models — Ollama supports different quantization levels out of the box. Lower quantization (like q4) uses less memory with minimal quality loss, letting you fit bigger models on smaller instances.
Security Best Practices
Never leave port 11434 open to 0.0.0.0/0 on a real project. Restrict your EC2 Security Group rules to your own IP, add API key authentication via Nginx, or put the instance in a private subnet behind a load balancer.
Consider using AWS Systems Manager Session Manager instead of SSH — no need to open port 22 or manage .pem key files. Enable CloudWatch monitoring to watch for unusual traffic or cost spikes.
Production Alternatives
Ollama is perfect for learning and small projects. For production workloads with many concurrent users, consider: vLLM (high-throughput serving with continuous batching), text-generation-inference by HuggingFace (optimized for HF models), or SageMaker endpoints (fully managed, auto-scaling).
Each adds complexity but solves real production problems like handling hundreds of simultaneous requests efficiently.
~$0.53/hr
g4dn.xlarge On-Demand (T4 GPU)
~2 sec
First Token Latency (8B, T4 GPU)
$0/mo
When Instance Is Stopped