Build a production-ready Retrieval-Augmented Generation system with Assisters API

Build a RAG System with Assisters API

TL;DR

Use assisters-embed-v1 for embeddings (1024 dimensions, 100+ languages), assisters-rerank-v1 to improve retrieval quality, and assisters-chat-v1 for generation. Complete RAG pipeline with 3 API calls. Cost: ~$0.05 per 1000 queries.

Retrieval-Augmented Generation (RAG) combines your knowledge base with AI to answer questions accurately. This tutorial shows you how to build a production RAG system using Assisters' proprietary models.

Why Assisters for RAG?

Multilingual Embeddings

assisters-embed-v1 supports 100+ languages with 1024-dimensional vectors optimized for semantic search

Smart Reranking

assisters-rerank-v1 dramatically improves retrieval quality by reordering results by relevance

Advanced Chat

assisters-chat-v1 with 128K context window handles large retrieved contexts

Cost Effective

Embeddings at $0.01/M tokens, reranking at $0.02/M tokens—fraction of competitors

Architecture Overview

┌─────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   Query     │────▶│ assisters-embed  │────▶│  Vector Search  │
└─────────────┘     └──────────────────┘     └─────────────────┘
                                                      │
                                                      ▼
┌─────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   Answer    │◀────│ assisters-chat   │◀────│ assisters-rerank│
└─────────────┘     └──────────────────┘     └─────────────────┘

Prerequisites

An Assisters API key (get one free)

Python 3.8+ with pip

A vector database (we'll use ChromaDB for simplicity)

Step 1: Set Up Your Environment

pip install openai chromadb

from openai import OpenAI

# Initialize Assisters client
client = OpenAI(
    api_key="ask_your_api_key",  # Get from assisters.dev/dashboard
    base_url="https://api.assisters.dev/v1"
)

Step 2: Create Embeddings for Your Documents

Use assisters-embed-v1 to convert your documents into vectors:

import chromadb

# Initialize ChromaDB
chroma = chromadb.Client()
collection = chroma.create_collection("knowledge_base")

# Your documents
documents = [
    "Assisters API provides 9 proprietary AI models for developers.",
    "The assisters-chat-v1 model supports 128K context with streaming.",
    "Embeddings are priced at $0.01 per million tokens.",
    "Rate limits start at 10 RPM for free tier, 500 RPM for Startup.",
    # Add your documents here...
]

# Generate embeddings with Assisters
def get_embeddings(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(
        model="assisters-embed-v1",  # Assisters multilingual embeddings
        input=texts
    )
    return [item.embedding for item in response.data]

# Index documents
embeddings = get_embeddings(documents)
collection.add(
    documents=documents,
    embeddings=embeddings,
    ids=[f"doc_{i}" for i in range(len(documents))]
)

print(f"Indexed {len(documents)} documents")

import OpenAI from 'openai';
import { ChromaClient } from 'chromadb';

const client = new OpenAI({
  apiKey: 'ask_your_api_key',
  baseURL: 'https://api.assisters.dev/v1'
});

const chroma = new ChromaClient();

async function getEmbeddings(texts) {
  const response = await client.embeddings.create({
    model: 'assisters-embed-v1',
    input: texts
  });
  return response.data.map(item => item.embedding);
}

// Index your documents
const documents = [
  "Assisters API provides 9 proprietary AI models for developers.",
  // ... more documents
];

const embeddings = await getEmbeddings(documents);
const collection = await chroma.createCollection({ name: 'knowledge_base' });
await collection.add({
  documents,
  embeddings,
  ids: documents.map((_, i) => `doc_${i}`)
});

Step 3: Retrieve Relevant Documents

Search your vector database using query embeddings:

def retrieve(query: str, top_k: int = 10) -> list[str]:
    # Embed the query with Assisters
    query_embedding = get_embeddings([query])[0]

    # Search vector database
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k
    )

    return results["documents"][0]

Step 4: Rerank for Better Quality

Pro Tip: Reranking with assisters-rerank-v1 typically improves answer quality by 15-30% compared to vector search alone. It's the secret weapon for production RAG systems.

import requests

def rerank(query: str, documents: list[str], top_k: int = 5) -> list[str]:
    """Rerank documents using Assisters Rerank model."""
    response = requests.post(
        "https://api.assisters.dev/v1/rerank",
        headers={
            "Authorization": "Bearer ask_your_api_key",
            "Content-Type": "application/json"
        },
        json={
            "model": "assisters-rerank-v1",
            "query": query,
            "documents": documents,
            "top_n": top_k
        }
    )

    results = response.json()["results"]
    # Return documents sorted by relevance score
    return [documents[r["index"]] for r in results]

Step 5: Generate the Answer

Use assisters-chat-v1 to generate answers from retrieved context:

def generate_answer(query: str, context: list[str]) -> str:
    """Generate answer using Assisters Chat model."""

    # Build context string
    context_str = "\n\n".join([f"[{i+1}] {doc}" for i, doc in enumerate(context)])

    response = client.chat.completions.create(
        model="assisters-chat-v1",  # Assisters flagship chat model
        messages=[
            {
                "role": "system",
                "content": """You are a helpful assistant that answers questions based on the provided context.

Rules:
- Only use information from the context
- If the answer isn't in the context, say "I don't have that information"
- Cite sources using [1], [2], etc."""
            },
            {
                "role": "user",
                "content": f"""Context:
{context_str}

Question: {query}

Answer:"""
            }
        ],
        temperature=0.3,  # Lower temperature for factual answers
        max_tokens=500
    )

    return response.choices[0].message.content

Step 6: Complete RAG Pipeline

Put it all together:

def rag_query(query: str) -> str:
    """Complete RAG pipeline using Assisters API."""

    # Step 1: Retrieve candidates
    candidates = retrieve(query, top_k=10)

    # Step 2: Rerank for quality
    relevant_docs = rerank(query, candidates, top_k=5)

    # Step 3: Generate answer
    answer = generate_answer(query, relevant_docs)

    return answer

# Example usage
question = "What models does Assisters offer?"
answer = rag_query(question)
print(answer)

Streaming for Better UX

For real-time responses, enable streaming:

def rag_query_streaming(query: str):
    """RAG with streaming responses."""

    candidates = retrieve(query, top_k=10)
    relevant_docs = rerank(query, candidates, top_k=5)
    context_str = "\n\n".join([f"[{i+1}] {doc}" for i, doc in enumerate(relevant_docs)])

    # Stream the response
    stream = client.chat.completions.create(
        model="assisters-chat-v1",
        messages=[
            {"role": "system", "content": "Answer based on context. Cite sources."},
            {"role": "user", "content": f"Context:\n{context_str}\n\nQuestion: {query}"}
        ],
        stream=True
    )

    for chunk in stream:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)

# Real-time output
rag_query_streaming("How much do embeddings cost?")

Cost Estimation

Component	Model	Cost per 1000 Queries
Query Embedding	`assisters-embed-v1`	$0.0001
Reranking	`assisters-rerank-v1`	$0.001
Generation	`assisters-chat-v1`	$0.05
Total		~$0.05

Assisters pricing is 5-10x cheaper than comparable solutions while maintaining quality. See pricing for details.

Build a RAG System

Build a RAG System with Assisters API

Why Assisters for RAG?

Multilingual Embeddings

Smart Reranking

Advanced Chat

Cost Effective

Architecture Overview

Prerequisites

Step 1: Set Up Your Environment

Step 2: Create Embeddings for Your Documents

Step 3: Retrieve Relevant Documents

Step 4: Rerank for Better Quality

Step 5: Generate the Answer

Step 6: Complete RAG Pipeline

Streaming for Better UX

Cost Estimation

Production Best Practices

Next Steps

Embeddings API

Reranking API

Chat API

Rate Limits

On this page

Build a RAG System

Multilingual Embeddings

Smart Reranking

Advanced Chat

Cost Effective

Batch Your Embeddings

Cache Embeddings

Handle Rate Limits

Use Moderation

Embeddings API

Reranking API

Chat API

Rate Limits

On this page