Build a RAG System
Build a production-ready Retrieval-Augmented Generation system with Assisters API
Build a RAG System with Assisters API
TL;DR
Use assisters-embed-v1 for embeddings (1024 dimensions, 100+ languages), assisters-rerank-v1 to improve retrieval quality, and assisters-chat-v1 for generation. Complete RAG pipeline with 3 API calls. Cost: ~$0.05 per 1000 queries.
Retrieval-Augmented Generation (RAG) combines your knowledge base with AI to answer questions accurately. This tutorial shows you how to build a production RAG system using Assisters' proprietary models.
Why Assisters for RAG?
Multilingual Embeddings
assisters-embed-v1 supports 100+ languages with 1024-dimensional vectors optimized for semantic search
Smart Reranking
assisters-rerank-v1 dramatically improves retrieval quality by reordering results by relevance
Advanced Chat
assisters-chat-v1 with 128K context window handles large retrieved contexts
Cost Effective
Embeddings at $0.01/M tokens, reranking at $0.02/M tokens—fraction of competitors
Architecture Overview
┌─────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Query │────▶│ assisters-embed │────▶│ Vector Search │
└─────────────┘ └──────────────────┘ └─────────────────┘
│
▼
┌─────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Answer │◀────│ assisters-chat │◀────│ assisters-rerank│
└─────────────┘ └──────────────────┘ └─────────────────┘Prerequisites
Step 1: Set Up Your Environment
pip install openai chromadbfrom openai import OpenAI
# Initialize Assisters client
client = OpenAI(
api_key="ask_your_api_key", # Get from assisters.dev/dashboard
base_url="https://api.assisters.dev/v1"
)Step 2: Create Embeddings for Your Documents
Use assisters-embed-v1 to convert your documents into vectors:
import chromadb
# Initialize ChromaDB
chroma = chromadb.Client()
collection = chroma.create_collection("knowledge_base")
# Your documents
documents = [
"Assisters API provides 9 proprietary AI models for developers.",
"The assisters-chat-v1 model supports 128K context with streaming.",
"Embeddings are priced at $0.01 per million tokens.",
"Rate limits start at 10 RPM for free tier, 500 RPM for Startup.",
# Add your documents here...
]
# Generate embeddings with Assisters
def get_embeddings(texts: list[str]) -> list[list[float]]:
response = client.embeddings.create(
model="assisters-embed-v1", # Assisters multilingual embeddings
input=texts
)
return [item.embedding for item in response.data]
# Index documents
embeddings = get_embeddings(documents)
collection.add(
documents=documents,
embeddings=embeddings,
ids=[f"doc_{i}" for i in range(len(documents))]
)
print(f"Indexed {len(documents)} documents")import OpenAI from 'openai';
import { ChromaClient } from 'chromadb';
const client = new OpenAI({
apiKey: 'ask_your_api_key',
baseURL: 'https://api.assisters.dev/v1'
});
const chroma = new ChromaClient();
async function getEmbeddings(texts) {
const response = await client.embeddings.create({
model: 'assisters-embed-v1',
input: texts
});
return response.data.map(item => item.embedding);
}
// Index your documents
const documents = [
"Assisters API provides 9 proprietary AI models for developers.",
// ... more documents
];
const embeddings = await getEmbeddings(documents);
const collection = await chroma.createCollection({ name: 'knowledge_base' });
await collection.add({
documents,
embeddings,
ids: documents.map((_, i) => `doc_${i}`)
});Step 3: Retrieve Relevant Documents
Search your vector database using query embeddings:
def retrieve(query: str, top_k: int = 10) -> list[str]:
# Embed the query with Assisters
query_embedding = get_embeddings([query])[0]
# Search vector database
results = collection.query(
query_embeddings=[query_embedding],
n_results=top_k
)
return results["documents"][0]Step 4: Rerank for Better Quality
Pro Tip: Reranking with assisters-rerank-v1 typically improves answer quality by 15-30% compared to vector search alone. It's the secret weapon for production RAG systems.
import requests
def rerank(query: str, documents: list[str], top_k: int = 5) -> list[str]:
"""Rerank documents using Assisters Rerank model."""
response = requests.post(
"https://api.assisters.dev/v1/rerank",
headers={
"Authorization": "Bearer ask_your_api_key",
"Content-Type": "application/json"
},
json={
"model": "assisters-rerank-v1",
"query": query,
"documents": documents,
"top_n": top_k
}
)
results = response.json()["results"]
# Return documents sorted by relevance score
return [documents[r["index"]] for r in results]Step 5: Generate the Answer
Use assisters-chat-v1 to generate answers from retrieved context:
def generate_answer(query: str, context: list[str]) -> str:
"""Generate answer using Assisters Chat model."""
# Build context string
context_str = "\n\n".join([f"[{i+1}] {doc}" for i, doc in enumerate(context)])
response = client.chat.completions.create(
model="assisters-chat-v1", # Assisters flagship chat model
messages=[
{
"role": "system",
"content": """You are a helpful assistant that answers questions based on the provided context.
Rules:
- Only use information from the context
- If the answer isn't in the context, say "I don't have that information"
- Cite sources using [1], [2], etc."""
},
{
"role": "user",
"content": f"""Context:
{context_str}
Question: {query}
Answer:"""
}
],
temperature=0.3, # Lower temperature for factual answers
max_tokens=500
)
return response.choices[0].message.contentStep 6: Complete RAG Pipeline
Put it all together:
def rag_query(query: str) -> str:
"""Complete RAG pipeline using Assisters API."""
# Step 1: Retrieve candidates
candidates = retrieve(query, top_k=10)
# Step 2: Rerank for quality
relevant_docs = rerank(query, candidates, top_k=5)
# Step 3: Generate answer
answer = generate_answer(query, relevant_docs)
return answer
# Example usage
question = "What models does Assisters offer?"
answer = rag_query(question)
print(answer)Streaming for Better UX
For real-time responses, enable streaming:
def rag_query_streaming(query: str):
"""RAG with streaming responses."""
candidates = retrieve(query, top_k=10)
relevant_docs = rerank(query, candidates, top_k=5)
context_str = "\n\n".join([f"[{i+1}] {doc}" for i, doc in enumerate(relevant_docs)])
# Stream the response
stream = client.chat.completions.create(
model="assisters-chat-v1",
messages=[
{"role": "system", "content": "Answer based on context. Cite sources."},
{"role": "user", "content": f"Context:\n{context_str}\n\nQuestion: {query}"}
],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
# Real-time output
rag_query_streaming("How much do embeddings cost?")Cost Estimation
| Component | Model | Cost per 1000 Queries |
|---|---|---|
| Query Embedding | assisters-embed-v1 | $0.0001 |
| Reranking | assisters-rerank-v1 | $0.001 |
| Generation | assisters-chat-v1 | $0.05 |
| Total | ~$0.05 |
Assisters pricing is 5-10x cheaper than comparable solutions while maintaining quality. See pricing for details.