Services / 03

Hire a RAG developer to ground LLMs in your company knowledge.

Production RAG (Retrieval-Augmented Generation) systems — grounded in your private documents, with hybrid search, re-ranking and cited answers. Used in PatFace to retrieve prior art before drafting patents. Pinecone, ChromaDB, Weaviate, Qdrant, pgvector.

§ 01

What you get

Production-grade pipelines

Ingestion, chunking, embedding, indexing, query, re-ranking, generation, citation. Every stage observable, every stage swappable as the frontier moves.

Hybrid search

Dense retrieval (embeddings) + BM25 keyword + cross-encoder re-rank. Beats either approach alone on real corpora.

Smart chunking

Semantic chunking tuned to your document type — clause-level for contracts, section-level for research, symbol-level for code.

Cited answers

Every answer links back to source passages. No ungrounded output. Audit trail for regulated industries.

Evaluation harness

Offline: retrieval recall, MRR, answer-quality scoring. Online: feedback collection and query analytics. You know when the RAG is working and when it isn’t.

Self-hosted option

For sensitive data: self-hosted LLMs (Llama, Mistral) plus self-hosted vector DBs (Qdrant, Weaviate). No data leaves your VPC.

§ 02

How I build it

  1. 01
    Corpus review
    We look at a sample of your documents: formats, volumes, update cadence, structure. This determines chunking and storage strategy.
  2. 02
    Pipeline v1
    Ingestion, embedding, indexing, retrieval, generation. Minimal UI for you to test queries on real data within a week.
  3. 03
    Quality engineering
    Chunking tuning, hybrid search, re-ranking, evaluation suite. The biggest quality gains happen here.
  4. 04
    Guardrails
    Citation enforcement, grounded-only generation, refusal behaviour for out-of-corpus queries. No hallucinated answers.
  5. 05
    Integration & deploy
    Embedded into your product, Slack bot, helpdesk, internal portal. Monitoring, cost dashboards, handover docs.
§ 03

Stack used

Vector DBs

Pinecone · ChromaDB · Weaviate · Qdrant · pgvector · FAISS · Milvus

Embeddings

OpenAI text-embedding-3-large · Cohere embed-v3 · Voyage · BGE · E5 · self-hosted sentence-transformers

Re-ranking

Cohere rerank · Jina reranker · cross-encoders (MS MARCO) · custom rerankers

Frameworks

LangChain · LlamaIndex · Haystack · DSPy · custom pipelines

Doc parsing

Unstructured · LlamaParse · PyMuPDF · pdfplumber · OCR via Tesseract/Textract

Eval

Ragas · TruLens · LangSmith · custom eval harnesses

§ 05

Engagements

Prototype RAG

$3,000 – $6,000

Single-corpus RAG with basic retrieval and a query UI. 1–2 weeks.

Production RAG

$10,000 – $22,000

Hybrid search, re-ranking, eval suite, citations, dashboard, deployed. 4–6 weeks.

Enterprise RAG

$25,000+

Multi-corpus, per-tenant, self-hosted LLM option, access controls, audit logs. 6–12 weeks.

Fixed-price or monthly retainer. NDA and IP-assignment standard. Hourly available on request.

§ 06

Frequently asked

What is a RAG system?

RAG (Retrieval-Augmented Generation) is an architecture that grounds a large language model in your private documents. At query time, the system retrieves the most relevant passages from your data, feeds them to the LLM as context, and generates an answer with citations. It replaces the need to fine-tune a model on your data.

When should I use RAG instead of fine-tuning?

Use RAG when your data changes often, when you need citations, or when you want to restrict the model’s knowledge to a specific corpus. Fine-tuning is appropriate when you need the model to learn a style or behaviour, not facts. 95% of enterprise AI projects I build are RAG, not fine-tuning.

What vector database should I use: Pinecone, ChromaDB, Weaviate, Qdrant or pgvector?

Pinecone for serverless and operational simplicity. Qdrant or Weaviate for self-hosted with advanced filtering. ChromaDB for local/prototype. pgvector if you already run Postgres and queries are under 10M vectors. I make the call based on volume, operational appetite, and cost.

How much does a RAG system cost to build and operate?

A production RAG pipeline over 100k–1M documents typically costs $6,000–$18,000 to build. Operating costs: LLM API (~$0.002–$0.02 per query), vector DB ($70–$500/month depending on provider), embedding refresh for new documents. A small business deployment often runs under $300/month total.

How do you handle chunking for RAG?

Semantic chunking based on document structure, not fixed token windows. For contracts: clause-level. For research papers: section-level with sliding overlap. For code: symbol-level. The chunking strategy is decided per corpus, not globally — it’s the biggest determinant of retrieval quality.

Do you use hybrid search?

Yes. Dense retrieval alone misses exact-match terms (product SKUs, clause numbers, IDs). I combine BM25 keyword search with dense embeddings and re-rank with a cross-encoder. On real corpora this consistently beats either approach alone.

How do you evaluate a RAG system?

Offline: retrieval recall, MRR, and LLM-judged answer quality on a held-out set of queries. Online: thumbs-up/down feedback, time-to-answer, escalation rate. Every RAG I ship has an evaluation suite the client can re-run.

Can you build RAG that cites sources?

Yes. Every generated answer includes inline citations linking back to the source document and passage. This is standard — I refuse to ship RAG that can’t cite its evidence.

Ready to build?

Same-week start. Email reply within 24 hours. Written enquiries welcome.