Build Your Own AI Coding Assistant for Free with Google Colab

Want to use AI for coding, but GPUs are expensive and you're worried about commercial services reading your proprietary source code? Here's a free compromise worth learning.

Why Go Through the Trouble of Self-Hosting a Language Model?¶

Writing code without AI these days feels like losing a hand. Commercial models like Claude and GPT-4 are powerful, but they share one critical weakness: your code goes to someone else's servers, and control stays with the big corporations.

If you're working on internal company systems, projects with sensitive logic, or simply don't want to hand your work over to a tech giant as training data, self-hosting an open-source model is the only way out.

Three reasons self-hosting is great:

Full privacy: Your source code never leaves your control.
Free your wallet: No more counting tokens or watching your bill climb.
Total freedom: Fine-tune or modify however you want, whenever you want.

The Harsh Reality: No GPU, No Party?¶

The biggest barrier to self-hosting is hardware.

A single NVIDIA RTX 4090 costs upward of NT$50,000, and even then it's barely enough to run a reasonably capable model. Renting cloud GPUs? That monthly bill hurts too.

If you're just working on occasional personal projects and want to experiment, this isn't money worth spending.

The Savior: Google Colab's Free T4 GPU¶

Thankfully, Google offers a generous free tier. Colab's free plan includes a T4 GPU (16 GB VRAM). It'll boot you off if you leave it idle too long, but for light usage it's more than enough.

Here's the honest comparison:

Option	Cost	Hardware	Best For
Buy a GPU	NT$50,000+ gone	RTX 4090	Hardcore enthusiasts
Cloud GPU rental	~$50 USD/month	A100-class	Enterprises / GPU rich
Google Colab Free	$0	T4 16GB	Light / experimental use
Colab Pro	~$10.49 USD/month	A100 / High-RAM	People tired of DRAM crashes

Of course, free comes with trade-offs:

Idle too long and it disconnects automatically.
Every restart requires reloading the model (unless you mount it to Google Drive).
Absolutely not suitable as a 24/7 production service.

But for "occasionally opening it while coding on personal projects," it's genuinely good.

Choosing a Brain: Google's Open-Source Gemma Series¶

Gemma is Google's open-source model — think of it as Gemini's little sibling.

Model variants at a glance:¶

Model	Parameters	Practical Notes
gemma-2-2b-it	2B	Ultra-lightweight. Runs effortlessly on the free T4.
gemma-2-9b-it	9B	The sweet spot — actually smart. Free Colab's 12 GB DRAM may crash; use 4-bit quantization.
gemma-2-27b-it	27B	Smartest, but the free T4 can't handle it. Requires Pro + A100.

💡 Quick tip: Models with -it in the name are Instruction-Tuned — already trained to follow instructions and hold a conversation. Never pick a base model by accident; that version just predicts the next token and can't hold a real dialogue.

For this guide, we'll use gemma-2-2b-it — the lightest option and least likely to cause problems.

Let's Build: Running Your Own API Server on Colab¶

How the system works:¶

Your local machine (OpenCode)
        ↕ HTTPS
    ngrok public URL (punches through the firewall)
        ↕
  FastAPI server running on Colab
        ↕
   Neural network (Gemma) crunching away on the T4 GPU

Your local coding assistant (OpenCode) connects to Colab in the cloud via a temporary ngrok URL. The FastAPI server on Colab acts as a middleman — it sends your request to Gemma, gets the response, and sends it back.

Step 1: Install the dependencies¶

Open a new notebook in Colab and run:

!pip install fastapi uvicorn pyngrok nest_asyncio transformers accelerate bitsandbytes

Step 2: Load Gemma with 4-bit quantization¶

Set model_id = "google/gemma-2-2b-it" (note: Gemma 2 doesn't have a 7B variant — don't mix it up). To prevent VRAM and DRAM from crashing simultaneously, we use bitsandbytes 4-bit quantization to aggressively compress the model.

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

# Correct model ID for Gemma 2 small
model_id = "google/gemma-2-2b-it"

# 4-bit quantization config — skip this and the hardware will likely crash
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

# Accept Gemma's terms of use on Hugging Face first, then paste your token here
HF_TOKEN = "your_HUGGING_FACE_TOKEN"

tokenizer = AutoTokenizer.from_pretrained(model_id, token=HF_TOKEN)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    quantization_config=bnb_config,
    token=HF_TOKEN
)

Step 3: Build an OpenAI-compatible API with FastAPI¶

Why OpenAI's format? Because 99% of AI plugins — VS Code extensions, OpenCode (what we're using today) — only speak OpenAI's API format. We fake one inside Colab and everything connects seamlessly.

from fastapi import FastAPI, Response
from pydantic import BaseModel
from typing import List, Optional, Any
import json, asyncio

app = FastAPI()

class Message(BaseModel):
    role: str
    content: Optional[str] = None

class CompletionRequest(BaseModel):
    model: str = "gemma-2-2b-it"
    messages: List[Message]
    stream: bool = False

def generate_response(messages: list) -> str:
    # Grab the last user message
    prompt = messages[-1]["content"]
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=512)
    new_tokens = outputs[0][inputs.input_ids.shape[1]:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True).strip()

@app.get("/v1/models")
def list_models():
    return {
        "object": "list",
        "data": [{"id": "gemma-2-2b-it", "object": "model", "owned_by": "google"}]
    }

@app.post("/v1/chat/completions")
async def chat_completions(request: CompletionRequest):
    messages = [m.dict(exclude_none=True) for m in request.messages]
    loop = asyncio.get_event_loop()
    reply = await loop.run_in_executor(None, generate_response, messages)

    message = {"role": "assistant", "content": reply}

    # Simulate streaming output to satisfy frontend plugins
    if request.stream:
        sse  = f"data: {json.dumps({'id':'chatcmpl-1','object':'chat.completion.chunk','model':request.model,'choices':[{'index':0,'delta':message,'finish_reason':None}]})}\n\n"
        sse += f"data: {json.dumps({'id':'chatcmpl-1','object':'chat.completion.chunk','model':request.model,'choices':[{'index':0,'delta':{},'finish_reason':'stop'}]})}\n\n"
        sse += "data: [DONE]\n\n"
        return Response(content=sse, media_type="text/event-stream")

    return {
        "id": "chatcmpl-1",
        "object": "chat.completion",
        "model": request.model,
        "choices": [{"index": 0, "message": message, "finish_reason": "stop"}]
    }

Step 4: Use ngrok to punch a public URL through the firewall¶

Your local machine can't reach Colab directly because it's inside Google's private network. Enter ngrok — a tunnel tool that generates a public URL for us.

from pyngrok import ngrok
import uvicorn, threading, nest_asyncio

nest_asyncio.apply()

# Sign up for a free ngrok account to get your auth token
ngrok.set_auth_token("your_NGROK_TOKEN")
public_url = ngrok.connect(8000)
print(f"🔗 Your personal API URL: {public_url.public_url}")

def run_server():
    uvicorn.run(app, host="0.0.0.0", port=8000)

threading.Thread(target=run_server, daemon=True).start()

After running this, you'll see something like:
🔗 Your personal API URL: https://xxxx-xxxx.ngrok-free.app

Copy it — you'll need it for the local setup below.

Local Setup: Connect OpenCode to Your Cloud Brain¶

OpenCode is a free, open-source terminal AI assistant — think of it as a no-cost Claude Code that lets you swap in any model you want.

1. Install OpenCode locally¶

Open your terminal (CMD or Terminal) and run:

npm install -g opencode-ai

2. Edit the config file¶

Find your config at ~/.config/opencode/opencode.jsonc (Windows: C:\Users\your-username\.config\opencode\opencode.jsonc) and set it to:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "gemma": {
      "api": "openai",
      "options": {
        "baseURL": "https://your-ngrok-url/v1", // ⚠️ Don't forget the /v1 at the end!
        "apiKey": "dummy" // Any value works — we didn't set auth
      },
      "models": {
        "gemma-2-2b-it": {
          "id": "gemma-2-2b-it",
          "limit": {
            "context": 8192,
            "output": 1024
          }
        }
      }
    }
  },
  "model": "gemma/gemma-2-2b-it"
}

Launch and Test¶

In your local terminal, just type:

opencode

Start chatting. Type what can you do? — if the terminal spits back a Gemma response, congratulations: you've successfully built a fully private AI assistant that's entirely your own.

Honest Assessment: Where Does This Setup Fall Short?¶

After the fun, time for reality. This "completely free" setup has a few genuine limitations:

The 2B model has limited intelligence: Gemma 2-2b is fast, but on complex code or logic it often starts hallucinating.
No hands (no tool calls): It can't read or edit files on your machine directly — it can only talk. You have to copy-paste manually.
Slower inference: Free T4 GPU + 4-bit quantization means responses are roughly 3–5× slower than a paid OpenAI API call.

💡 Where to go from here¶

If you're tired of the 2B model's short memory and want something that can actually read files, fix bugs autonomously, and think several levels deeper, try swapping the model to Alibaba's open-source Qwen2.5-Coder-7B-Instruct.

That said, running 7B or 9B models will likely require the Colab Pro plan at ~$10.49/month with High-RAM mode enabled — otherwise your DRAM will crash on you constantly.

How to upgrade to Qwen2.5-Coder and get the AI to actually edit your files? We'll cover that next time!