How to Organize a Repository for an LLM Agent

More complex is better? I don’t think so. Five tiers of repository organization:

Tier 0 — flat files. Everything in context. Prototypes, configs, small projects under 20 files.

Tier 1 — text search + CLAUDE.md. How every AI coding agent works. Code projects up to 500 files.

Tier 2 — docs-as-code. Structured documentation for teams. Stripe, Kubernetes, Django — no RAG needed.

Tier 3 — LLM wiki, the Karpathy method. An LLM compiles a wiki from raw sources. Hundreds of documents.

Tier 4 — wiki + RAG + knowledge graph. Semantic search and entity relationships. 500+ sources.

Don’t move to the next tier if the current one works.

LLM agents today solve radically different problems. One writes code in a ten-thousand-file repository. Another researches five hundred scientific papers. A third maintains documentation for two hundred people. Applying the same knowledge organization approach to all of these is overkill. I went through all five tiers on my own project — a university course on AI with hundreds of sources, dozens of artifacts, and a single author — and below I’ll explain what works at which scale.

In 2024–2025, while the industry was building complex RAG pipelines and knowledge graphs, Cursor soared to $100M ARR with an approach built on an embedding index and text search over code. Not because text search is better than RAG — but because for code, it’s the right tool. Specifically for code.

In the world of knowledge organization for agents, people make two symmetrical mistakes. Some underinvest: 500 documents plus text search equals chaos — nothing gets found. Others overinvest: as Paul Hoke described, a developer deleted 2,000 lines of RAG code and accuracy jumped to 94%.

There is no “best” way to organize knowledge for an LLM agent. There are five tiers, each the best answer for its type of task and scale. Move to the next one only when the current tier breaks on a specific pain point. Context windows of all major models in 2026 have reached a million tokens and beyond — Gemini, Claude, Llama, GPT — and this shifts the threshold at which search infrastructure is even justified.

Tier 0: Everything Fits in Context — and That’s Great

Google NotebookLM lets you upload up to 50 sources and ask questions about them. Claude Projects from Anthropic is a feature where you add files to a “project” and the agent works with them in their entirety. Tens of millions of users. No RAG, no vector indexes. Just files in context. This isn’t an MVP — it’s a production architecture.

The Core Idea

All files are loaded entirely into the LLM’s context window. No search, no indexing. With 20 files of 200 lines each, that’s roughly 16,000 tokens — 1.6% of Claude’s window. As the Ahoi Kapptn team writes: “If your knowledge base is under 200K tokens (~500 pages), include it entirely in the prompt.”

Where this works perfectly: load 10 articles and ask questions — get synthesis with zero minutes of setup. A prototype with 5 files — the agent sees everything, accuracy is maximal. 15 infrastructure project configs — full context, zero latency. My AI course started exactly this way: two dozen files, everything fit in context, and the agent found what it needed instantly.

Example Structure

my-project/
  notes.md                 # notes, ideas, drafts
  data-analysis.py         # all code — 3-5 files
  config.yaml
  research-paper-1.pdf     # all sources right in the root
  research-paper-2.pdf

When to Move On

One day you notice the agent starting to “forget” information. Research from Stanford and UC Berkeley (Liu et al., 2023) demonstrated the lost-in-the-middle effect: accuracy drops by 30% or more when relevant information lands in the middle of the context. Another study found that the effective context of all models on complex tasks turned out to be far smaller than advertised. The boundary: roughly 20 files or 50,000 tokens. If you feel this pain — time for the next tier. If not — stay put, you’re in the right place.

Pattern	Anti-pattern
All files in one folder, no nesting	Setting up RAG for 5 documents
Maximally flat structure	Dumping 100 files into context “just in case”
Zero infrastructure, zero setup	Creating a folder hierarchy for 10 files

Tier 1: Text Search + CLAUDE.md — How Every AI Coding Agent Works

Cursor. Claude Code. Windsurf. None of them require developers to spin up a vector database. All use text search as their core infrastructure. As BuildMVPFast writes: “Text search has quietly become the load-bearing infrastructure for how AI writes code.”

The Core Idea

At this tier, the project has a CLAUDE.md (or AGENTS.md, .cursorrules) that explains the codebase structure and conventions to the agent. The agent reads CLAUDE.md and understands the lay of the land — which directories are responsible for what, what naming conventions are in use. When a task arrives, the agent searches by keywords, finds the right files, then reads them in full for complete context. The directory structure itself becomes a navigation map.

At Tier 0, the agent sees everything but doesn’t know what matters. CLAUDE.md provides priorities. Search lets the agent read only the files it needs rather than loading all 500 into context. AGENTS.md is already standardized by the Linux Foundation, supported by OpenAI, Anthropic, Google, AWS, and Bloomberg. Over 60,000 repositories include it. As HumanLayer notes: “A CLAUDE.md written in 30 minutes gives the agent 80% of the context it needs.” To get started — create a CLAUDE.md and describe the architecture, key conventions, and how to run and test the project.

Text search objectively outperforms semantic search for exact matches. As ast-grep notes: ERROR_4532 in vector space is indistinguishable from ERROR_4533 — yet these are completely different errors. My AI course moved to this tier when sources exceeded twenty — search over exported documents was fast and accurate.

Example Structure

my-repo/
  CLAUDE.md              # ← instructions for the agent: architecture, conventions
  AGENTS.md              # standardized rules (can be used instead of CLAUDE.md)
  src/                   # project code
  tests/                 # tests alongside the code
  docs/
    architecture.md      # keep documentation next to the code
    adr/
      001-use-postgres.md  # architectural decisions in ADR format

When to Move On

You have 300 code files and search works great. Then a task comes in: find all GDPR requirements across research notes, legal documents, and meeting transcripts. Searching for the word “GDPR” finds 5 out of 20 relevant documents — the rest talk about “personal data”, “privacy regulation”, “data processing”. This is the polysemy problem: one concept, dozens of names. You don’t need a better search engine — you need structured navigation. The boundary: roughly 500 files, predominantly code. For non-code knowledge — PDFs, regulations, research — this model doesn’t work.

Pattern	Anti-pattern
CLAUDE.md with architecture and conventions	Hoping the agent will “figure it out”
Consistent naming conventions	Different styles in different parts of the project
AGENTS.md + separate .md files per subdirectory	One giant 2,000-line CLAUDE.md
Text search for code and identifiers	Text search for concepts in prose

Tier 2: Docs-as-Code — Structured Documentation for Teams

This tier is for projects where documentation is created by people for people, and the AI agent gets quality navigation for free. Stripe docs, Kubernetes (3,000+ pages), Django, Terraform — they serve millions of developers without RAG and have no plans to switch. As Mintlify notes: “At Stripe, a feature isn’t considered shipped until the documentation is written.”

The Core Idea

Documentation is organized by content type. The Diataxis framework divides it into 4 types — tutorials, how-to guides, reference, and explanation. When search finds the word “authentication” in 15 files, an agent without content typing has to read all 15. With Diataxis, it goes straight to how-to/configure-oauth.md. The framework is adopted by Cloudflare, Ubuntu, Django, and Gatsby.

The key advantage is a dual audience. A new team member reads the same documents as the AI agent. At Tier 3, the wiki is also human-readable but optimized for agent navigation. Here, there’s a single source of truth for both audiences. Plus, documentation gets indexed by search engines — a wiki behind an LLM or a RAG system is invisible to Google. To get started: sort your documents into the 4 Diataxis types and add a navigational index.md. One day for an average project.

Example Structure

docs/
  index.md                 # ← navigation hub, start here
  tutorials/
    getting-started.md     # learning material for newcomers
  how-to/
    configure-auth.md      # instructions: "how to do X"
  reference/
    api/                   # reference docs, often generated from code
  explanation/
    architecture.md        # explanations: "why we chose X"
  adr/
    001-use-postgres.md    # architectural decisions in ADR format

When to Move On

Maintenance cost — that’s what breaks this tier. At 200+ documents, classification becomes the bottleneck, and heterogeneous sources — scientific papers, transcripts, regulatory documents — don’t fit into neat templates.

Pattern	Anti-pattern
Diataxis: 4 content types	A flat docs/ folder with no typing
Build-time link validation	Manually checking “did we break any links”
ADRs for architectural decisions	Decisions in chat, lost within a month

Tier 3: The Karpathy Method — LLM as Librarian

According to ussumant/llm-wiki-compiler, 383 files became 13 articles — 81x compression. 130 meeting transcripts became a single 244-line digest — 503x compression. And this isn’t lossy summarization: the LLM finds connections between sources that a human would miss. As Karpathy wrote: “With ~100 articles and ~400K words, the LLM’s ability to navigate through summaries and index files is more than sufficient.”

The Core Idea

Three-layer architecture (Andrej Karpathy, April 2026): raw/ — immutable sources (PDFs, transcripts, notes), append-only, no editing; wiki/ — LLM-generated and LLM-maintained pages; index.md — a catalog of all wiki pages with one-line descriptions. The index is the search mechanism: the LLM scans it, finds the right page, reads it.

Three operations: Ingest — read a source, write a wiki page, update the index, update 10–15 related pages. Query — find an answer by scanning the index, save good answers as new pages. Lint — detect contradictions, orphaned pages, and outdated claims.

This is paradise for the solo researcher. One person plus one LLM replaces a documentation team. My AI course moved to this tier when sources reached the hundreds — a single maintainer manages the entire knowledge base through a wiki. Lint proactively detects outdated claims — unlike Tier 2 documentation, which goes stale silently. The entire “stack” is markdown in git. According to ussumant/llm-wiki-compiler, the agent starts a session with a compact index (~7.7K tokens) instead of hundreds of files (~47K) — an 84% reduction.

Karpathy’s gist garnered millions of views — it struck a nerve. Full implementations have already appeared: ussumant/llm-wiki-compiler (Claude Code plugin), atomicmemory/llm-wiki-compiler (TypeScript, concept extraction), xoai/sage-wiki (Go, hybrid text + vector search). As MindStudio notes: “If your knowledge base is under 50,000–100,000 tokens, there’s no technical reason to use RAG.”

If you need semantic search over heterogeneous sources but without wiki compilation, you can simply load documents into a local RAG system and get meaning-based search in a single evening. To start with a wiki: create raw/ and wiki/, add a CLAUDE.md with conventions from Karpathy’s gist. Ingest 10–20 documents per session — the wiki grows organically.

Example Structure

knowledge-base/
  CLAUDE.md                # ← schema and conventions from Karpathy's gist
  index.md                 # catalog: one line per wiki page
  log.md                   # operations log (append-only)
  raw/                     # immutable sources
    paper-attention-2017.pdf
    meeting-2026-03-15.txt
    regulation-gdpr.md
  wiki/                    # LLM-generated pages (flat structure)
    transformer-architectures.md
    gdpr-compliance.md     # ← the LLM found a connection to three sources
    team-decisions-q1.md
    # wiki is flat: LLM navigates via index.md, no subdirectories needed

When to Move On

You’re running a research project: 200 papers, 50 meeting transcripts, 30 regulatory documents. The wiki handles it beautifully. Then a request comes in: “find everything related to model fairness evaluation.” But in wiki pages, this topic is called “fairness metrics”; in source files, “bias evaluation”; in regulatory documents, “equity assessment.” The index is a precision tool: it finds what’s listed. Semantic discovery is not its job. At 500+ sources, the index itself exceeds 50,000 tokens and no longer fits in context.

Pattern	Anti-pattern
raw/ append-only, wiki/ maintained by LLM	Editing the wiki by hand (breaks on recompilation)
One index.md with one-line descriptions	Nested indexes “for the future” with fewer than 100 pages
Incremental compilation	Full recompilation of 500 sources every time
Lint after every Ingest	Accumulating 100 sources and compiling them all at once

Tier 4: When the Index Doesn’t Fit in Context — Add Semantics

In my AI course, the Karpathy-method wiki delivered a 7.6x reduction in tool calls and 9 out of 9 on completeness scores. But when I needed to find “everything about AI agents” across Russian-language documents, the wiki index didn’t help. The topic appeared under five different names in fifteen different places. Only semantic search found what text search and the index missed.

The Core Idea

At this tier, the wiki (Tier 3) is supplemented with one or two layers. RAG (vector search) — semantic search via embeddings, finds “equity measures” when you search for “fairness metrics.” Knowledge graph (ontology) — structured relationships between entities: “paper X cites method Y, applied in domain Z.” The wiki remains the foundation — readable, navigable, in git. RAG and the graph are additional search layers on top, with results combined via Reciprocal Rank Fusion.

The cost isn’t necessarily high. In my course, I use local free tools: Oxigraph (an RDF store for the knowledge graph), mcp-local-rag (local semantic search with no external services) — everything lives in a single git repository, infrastructure cost is zero. For larger-scale tasks, LazyGraphRAG from Microsoft promises order-of-magnitude reductions in indexing costs. LightRAG delivers 70–90% of the quality at a hundredth of the cost.

Research library — the wiki compiles literature reviews, RAG finds papers by meaning, the graph tracks citation chains. Agent knowledge base — in my course: wiki for navigation, RAG for bilingual search (Russian and English), ontology on Oxigraph for traceability: “requirement -> lecture -> seminar -> assessment.” Team knowledge base — three years of accumulated experience: meeting transcripts, project documents, post-mortems; the wiki provides topic overviews, RAG finds “that time we already solved a similar problem.” Start with RAG on top of an existing wiki — one evening. Add the graph only when specific relational queries appear.

Example Structure

knowledge-base/
  CLAUDE.md
  index.md                 # wiki index (Tier 3)
  raw/                     # sources
    papers/
      by-topic/            # grouped by topic for convenience
    meeting-notes/
    regulations/
  wiki/                    # LLM-compiled pages
  index/                   # ← RAG index, add this first
  ontology/                # knowledge graph, add when you need relationships
    schema.ttl             # classes and properties (I use Oxigraph)
    store.ttl              # data
    queries/               # SPARQL queries for common questions

When You Need This

You need RAG when	You need a knowledge graph when
Bilingual search (RU and EN)	Multi-hop queries (“papers by author X -> method Y -> domain Z”)
“Find something similar” (fuzzy discovery)	Traceability (requirement -> test -> coverage)
Wiki index exceeds 50,000 tokens	Aggregation (“all papers with no citations”)
Heterogeneous sources	Taxonomies and classifications

Pattern	Anti-pattern
Wiki as foundation + RAG/graph as layers	RAG instead of wiki (you lose navigation)
Local free tools (Oxigraph, local-rag)	Paying $200/mo for a vector DB to index 100 documents
Adding layers one at a time	Building the entire infrastructure upfront “for growth”
Graph for specific relational queries	Graph “because it looks cool” with no clear use cases

How I Walked This Path

My AI course — hundreds of sources, dozens of artifacts, one maintainer.

I started at Tier 0: two dozen files, everything in context. Quickly outgrew it into Tier 1: search over exported documents. Tried RAG — got 10% precision on Russian-language queries. Tried an ontology — a beautiful schema, zero data.

I implemented Tier 3 — the Karpathy-method wiki: 7.6x reduction in tool calls, 9 out of 9 on completeness across test scenarios. Added RAG for semantic search on bilingual queries — but only after the wiki was working.

The key lesson: I tried to jump from Tier 1 to Tier 4 — and got beautifully empty infrastructure. Only when I went back to Tier 3 as the foundation and layered search on top did the system start working.

How to Determine the Right Structure

The entire selection framework boils down to two questions:

How many sources do you have? (fewer than 20 / 20 to 500 / more than 500)
What is it — code or documentation? (code / documentation for people / research, papers, heterogeneous sources)

Scale \ Content	Code	Documentation for people	Research, heterogeneous
Fewer than 20 files	Tier 0	Tier 0	Tier 0
20–500	Tier 1 (search + CLAUDE.md)	Tier 2 (docs-as-code)	Tier 3 (LLM wiki)
More than 500	Tier 1 + indexed search	Tier 2 (scales to 3,000+)	Tier 3 + 4 (RAG/graph)

Hybrid situations are the norm. “200 code files + 50 research papers” means code at Tier 1 (search + CLAUDE.md), papers at Tier 3 (wiki). Tiers aren’t mutually exclusive — they’re about content type.

Most of You Are at Tier 1. And That’s Fine

Entrepreneur Vamshi Reddy wrote to Karpathy: “Every business has a raw/ directory. Nobody has compiled it yet. There’s the product.”

I myself spent a sprint on a four-layer system with an ontology and SPARQL queries. Beautiful architecture. Graphs, relationships, validation. Then I opened the knowledge graph and discovered it was empty. Zero data. Right next to it sat a 40-line CLAUDE.md through which the agent had already been finding everything it needed for a week.

The right answer depends on the task. Tier 0 remains the best for small projects — NotebookLM serves millions of users without a single vector index. Tier 1 is for code. Stripe isn’t switching to RAG for their documentation, and they see no reason to. The Karpathy-method wiki is for researchers with hundreds of heterogeneous sources. And hybrid Tier 4 is justified where the cost of unfound information is measured in lost revenue or patients.

Each tier is not a step on a ladder but the right tool for its scale. A simple rule: if you’re not experiencing a specific pain point at your current tier — you’re in the right place.

Как организовать репозиторий для LLM-агента

Чем сложнее, тем лучше? Не думаю. Пять уровней организации репозитория:

Уровень 0 — плоские файлы. Всё в контексте. Прототипы, конфиги, малые проекты до 20 файлов.

Уровень 1 — текстовый поиск + CLAUDE.md. Так работают все AI-кодинг-агенты. Кодовые проекты до 500 файлов.

Уровень 2 — docs-as-code. Структурированная документация для команд. Stripe, Kubernetes, Django — без RAG.

Уровень 3 — LLM-вики по методу Карпати. LLM компилирует вики из сырых источников. Сотни документов.

Уровень 4 — вики + RAG + граф знаний. Семантический поиск и связи. 500+ источников.

Не переходите на следующий, если хватает текущего.

LLM-агенты сегодня решают радикально разные задачи. Один агент пишет код в репозитории на десять тысяч файлов. Другой исследует пятьсот научных публикаций. Третий поддерживает документацию для двухсот человек. Применять один и тот же подход к организации знаний для всех этих задач — это слишком. Я прошёл все пять уровней на собственном проекте — учебном курсе по AI с сотнями источников, десятками артефактов и одним автором — и дальше расскажу, что работает на каком масштабе.

В 2024–2025 годах, пока индустрия строила сложные RAG-пайплайны и графы знаний, Cursor взлетел до $100M ARR с подходом, в основе которого — индекс эмбеддингов и текстовый поиск по коду. Не потому что текстовый поиск лучше RAG. А потому что для кода это правильный инструмент. Именно для кода.

В мире организации знаний для агентов люди совершают две симметричные ошибки. Одни недоинвестируют: 500 документов и текстовый поиск — хаос, ничего не находится. Другие переинвестируют: как описал Пол Хоук, разработчик удалил 2000 строк RAG-кода, и точность подскочила до 94%.

Нет «лучшего» способа организовать знания для LLM-агента. Есть пять уровней, каждый из которых — лучший ответ для своего типа задачи и масштаба. Переходить на следующий стоит только когда текущий ломается на конкретной болевой точке. Контекстные окна всех основных моделей в 2026 году достигли миллиона токенов и больше — Gemini, Claude, Llama, GPT — и это сдвигает порог, на котором инфраструктура поиска вообще оправдана.

Уровень 0: Всё помещается в контекст — и это прекрасно

Google NotebookLM позволяет загрузить до 50 источников и задавать вопросы по ним. Claude Projects от Anthropic — функция, где вы добавляете файлы в «проект» и агент работает с ними целиком. Десятки миллионов пользователей. Никакого RAG, никаких векторных индексов. Просто файлы в контексте. Это не MVP — это рабочая архитектура.

Суть подхода

Все файлы целиком загружаются в контекстное окно LLM. Никакого поиска, никакой индексации. При 20 файлах по 200 строк это около 16 тысяч токенов — 1.6% окна Claude. Как пишет команда Ahoi Kapptn: «Если ваша база знаний меньше 200 тысяч токенов (около 500 страниц), включите её целиком в промпт».

Где это работает идеально: загрузили 10 статей — задавайте вопросы, получайте синтез за ноль минут настройки. Прототип на 5 файлов — агент видит всё, точность максимальна. 15 конфигов инфраструктурного проекта — полный контекст, нулевая задержка. Мой курс по AI начинался именно так: два десятка файлов, всё помещалось в контекст, и агент находил нужное мгновенно.

Пример структуры

my-project/
  notes.md                 # заметки, идеи, черновики
  data-analysis.py         # весь код — 3-5 файлов
  config.yaml
  research-paper-1.pdf     # все источники прямо в корне
  research-paper-2.pdf

Когда переезжать

Однажды вы замечаете, что агент начинает «забывать» информацию. Исследование Stanford и UC Berkeley (Liu et al., 2023) показало эффект «потери в середине»: точность падает на 30% и больше, когда нужная информация оказывается в середине контекста. Другая работа зафиксировала, что эффективный контекст всех моделей на сложных задачах оказался гораздо меньше рекламируемого. Граница: примерно 20 файлов или 50 тысяч токенов. Если чувствуете эту боль — пора на следующий уровень. Если нет — оставайтесь, вы на правильном месте.

Паттерн	Антипаттерн
Все файлы в одной папке, без вложенности	Настраивать RAG для 5 документов
Максимально плоская структура	Складывать 100 файлов в контекст «про запас»
Ноль инфраструктуры, ноль настройки	Создавать иерархию папок для 10 файлов

Уровень 1: Текстовый поиск + CLAUDE.md — так работают все AI-кодинг-агенты

Cursor. Claude Code. Windsurf. Ни один из них не требует от разработчика поднимать векторную базу данных. Все используют текстовый поиск как основную инфраструктуру. Как пишет BuildMVPFast: «Текстовый поиск тихо стал несущей инфраструктурой для того, как AI пишет код».

Суть подхода

На этом уровне проект имеет CLAUDE.md (или AGENTS.md, .cursorrules), который объясняет агенту структуру и конвенции кодовой базы. Агент читает CLAUDE.md и понимает, где что лежит — какие директории за что отвечают, какие конвенции именования используются. Когда приходит задача, агент ищет по ключевым словам, находит подходящие файлы, а затем зачитывает их целиком, чтобы получить полный контекст. Структура директорий сама по себе становится навигационной картой.

На уровне 0 агент видит всё — но не знает, что важно. CLAUDE.md даёт приоритеты. Поиск позволяет агенту читать только нужные файлы, а не загружать все 500 в контекст. AGENTS.md уже стандартизирован Linux Foundation, поддерживается OpenAI, Anthropic, Google, AWS, Bloomberg. Более 60 тысяч репозиториев включают его. Как отмечает HumanLayer: «CLAUDE.md за 30 минут даёт агенту 80% нужного контекста». Чтобы начать — создайте CLAUDE.md и опишите архитектуру, ключевые конвенции, как запустить и протестировать проект.

Текстовый поиск объективно превосходит семантический для точных совпадений. Как отмечает ast-grep: ERROR_4532 в векторном пространстве неотличим от ERROR_4533 — а это совершенно разные ошибки. Мой курс по AI перешёл на этот уровень, когда источников стало больше двадцати — поиск по экспортированным документам работал быстро и точно.

Пример структуры

my-repo/
  CLAUDE.md              # ← инструкции агенту: архитектура, конвенции
  AGENTS.md              # стандартизованные правила (можно вместо CLAUDE.md)
  src/                   # код проекта
  tests/                 # тесты рядом с кодом
  docs/
    architecture.md      # держите документацию рядом с кодом
    adr/
      001-use-postgres.md  # архитектурные решения в формате ADR

Когда переезжать

У вас 300 файлов кода и поиск работает отлично. Потом приходит задача: найти все требования GDPR в исследовательских заметках, юридических документах и протоколах встреч. Поиск по слову «GDPR» находит 5 из 20 релевантных документов — остальные говорят о «персональных данных», «privacy regulation», «обработке ПДн». Это проблема полисемии: одно понятие, десятки названий. Вам нужна не лучшая поисковая система, а структурированная навигация. Граница: примерно 500 файлов, преимущественно код. Для не-кодовых знаний — PDF, нормативные документы, исследования — эта модель не работает.

Паттерн	Антипаттерн
CLAUDE.md с архитектурой и конвенциями	Надеяться, что агент «сам разберётся»
Единообразные правила именования	Разные стили в разных частях проекта
AGENTS.md + отдельные .md по поддиректориям	Один гигантский CLAUDE.md на 2000 строк
Текстовый поиск для кода и идентификаторов	Текстовый поиск для концепций в прозе

Уровень 2: Docs-as-code — структурированная документация для команд

Этот уровень — для проектов, где документация создаётся людьми для людей, а AI-агент получает качественную навигацию бесплатно. Stripe docs, Kubernetes (3000+ страниц), Django, Terraform — обслуживают миллионы разработчиков без RAG и не собираются переходить. Как отмечает Mintlify: «В Stripe фича не считается выпущенной, пока не написана документация».

Суть подхода

Документация организована по типу контента. Фреймворк Diátaxis делит её на 4 типа — обучение, инструкции, справочник, объяснение. Когда поиск находит слово «authentication» в 15 файлах, агент без типизации вынужден читать все 15. С Diátaxis — сразу идёт в how-to/configure-oauth.md. Фреймворк принят Cloudflare, Ubuntu, Django, Gatsby.

Главное преимущество — двойная аудитория. Новый член команды читает те же документы, что и AI-агент. На уровне 3 вики тоже читаема, но оптимизирована под навигацию агента. Здесь — один источник правды для обеих аудиторий. Плюс документация индексируется поисковиками — вики за LLM или RAG-система для Google невидимы. Чтобы начать: рассортируйте документы по 4 типам Diátaxis, добавьте навигационный index.md. Один день для среднего проекта.

Пример структуры

docs/
  index.md                 # ← навигационный хаб, начните здесь
  tutorials/
    getting-started.md     # обучение для новичков
  how-to/
    configure-auth.md      # инструкции: «как сделать X»
  reference/
    api/                   # справочник, часто генерируется из кода
  explanation/
    architecture.md        # объяснения: «почему мы выбрали X»
  adr/
    001-use-postgres.md    # архитектурные решения в формате ADR

Когда переезжать

Стоимость поддержания — вот что ломает этот уровень. При 200+ документах классификация становится узким местом, а разнородные источники — научные публикации, транскрипты, нормативные документы — не укладываются в аккуратные шаблоны.

Паттерн	Антипаттерн
Diátaxis: 4 типа контента	Плоская папка docs/ без типизации
Валидация ссылок при сборке	Ручная проверка «не сломали ли ссылки»
ADR для архитектурных решений	Решения в чатах, потерянные через месяц

Уровень 3: Метод Карпати — LLM как библиотекарь

По данным ussumant/llm-wiki-compiler, 383 файла превратились в 13 статей — 81-кратная компрессия. 130 транскриптов совещаний стали одним дайджестом на 244 строки — 503-кратное сжатие. И это не выжимка с потерями: LLM находит связи между источниками, которые человек бы пропустил. Как написал Карпати: «При ~100 статьях и ~400K слов способности LLM навигировать через саммари и индексные файлы более чем достаточно».

Суть подхода

Трёхслойная архитектура (Andrej Karpathy, апрель 2026): raw/ — неизменяемые источники (PDF, транскрипты, заметки), только добавление, без редактирования; wiki/ — LLM-сгенерированные и LLM-поддерживаемые страницы; index.md — каталог всех вики-страниц с однострочными описаниями. Индекс — это и есть механизм поиска: LLM сканирует его, находит нужную страницу, читает.

Три операции: Ingest — прочитать источник, написать вики-страницу, обновить индекс, обновить 10–15 связанных страниц. Query — найти ответ через сканирование индекса, сохранить хорошие ответы как новые страницы. Lint — обнаружить противоречия, осиротевшие страницы, устаревшие утверждения.

Это рай для соло-исследователя. Один человек плюс один LLM заменяют документационную команду. Мой курс по AI перешёл на этот уровень, когда источников стало сотни — один мейнтейнер управляет всей базой знаний через вики. Lint обнаруживает устаревшие утверждения проактивно — в отличие от документации уровня 2, которая устаревает молча. Весь «стек» — markdown в git. По данным ussumant/llm-wiki-compiler, агент начинает сессию с компактного индекса (~7.7K токенов) вместо сотен файлов (~47K) — сокращение на 84%.

Гист Карпати набрал миллионы просмотров — он попал в нерв. Уже появились полноценные реализации: ussumant/llm-wiki-compiler (плагин для Claude Code), atomicmemory/llm-wiki-compiler (TypeScript, извлечение концепций), xoai/sage-wiki (Go, гибридный текстовый + векторный поиск). Как отмечает MindStudio: «Если ваша база знаний меньше 50–100 тысяч токенов, нет технической причины использовать RAG».

Если вам нужен семантический поиск по разнородным источникам, но без вики-компиляции — можно просто загрузить документы в локальный RAG и получить поиск по смыслу за один вечер. Чтобы начать с вики: создайте raw/ и wiki/, добавьте CLAUDE.md с конвенциями из гиста Карпати. Загружайте по 10–20 документов за сессию — вики растёт органически.

Пример структуры

knowledge-base/
  CLAUDE.md                # ← схема и конвенции из гиста Карпати
  index.md                 # каталог: одна строка — одна вики-страница
  log.md                   # журнал операций (только дополнение)
  raw/                     # неизменяемые источники
    paper-attention-2017.pdf
    meeting-2026-03-15.txt
    regulation-gdpr.md
  wiki/                    # LLM-сгенерированные страницы (плоская структура)
    transformer-architectures.md
    gdpr-compliance.md     # ← LLM нашёл связь с тремя источниками
    team-decisions-q1.md
    # вики плоская: LLM навигирует через index.md, подпапки не нужны

Когда переезжать

Вы ведёте исследовательский проект: 200 публикаций, 50 протоколов встреч, 30 нормативных документов. Вики отлично справляется. Потом приходит запрос: «найди всё связанное с оценкой справедливости моделей». Но в вики-страницах эта тема называется «метрики справедливости», в исходниках — «bias evaluation», в нормативных документах — «оценка корректности». Индекс — точный инструмент: он находит то, что перечислено. Семантическое обнаружение — не его задача. При 500+ источниках сам индекс превышает 50 тысяч токенов и перестаёт помещаться в контекст.

Паттерн	Антипаттерн
raw/ только дополнение, wiki/ поддерживается LLM	Редактировать вики руками (сломается при перекомпиляции)
Один index.md с однострочными описаниями	Вложенные индексы «на будущее» при менее 100 страниц
Инкрементальная компиляция	Полная перекомпиляция 500 источников каждый раз
Lint после каждого Ingest	Копить 100 источников и потом компилировать разом

Уровень 4: Когда индекс не помещается в контекст — добавляем семантику

В моём курсе по AI вики по методу Карпати дала 7.6-кратное сокращение обращений к инструментам и 9 из 9 по полноте ответов. Но когда понадобилось найти «всё про AI-агентов» по русскоязычным документам — вики-индекс не помог. Тема упоминалась под пятью разными названиями в пятнадцати разных местах. Только семантический поиск нашёл то, что текстовый поиск и индекс пропустили.

Суть подхода

На этом уровне вики (уровень 3) дополняется одним или двумя слоями. RAG (векторный поиск) — семантический поиск по векторным представлениям, находит «equity measures» когда ищешь «метрики справедливости». Граф знаний (онтология) — структурированные связи между сущностями: «статья X цитирует метод Y, применённый в домене Z». Вики остаётся основой — читаемой, навигируемой, в git. RAG и граф — дополнительные слои поиска поверх неё, результаты объединяются через Reciprocal Rank Fusion.

Стоимость не обязательно высокая. В моём курсе я использую локальные бесплатные инструменты: Oxigraph (RDF-хранилище для графа знаний), mcp-local-rag (локальный семантический поиск без внешних сервисов) — всё живёт в одном git-репозитории, стоимость инфраструктуры равна нулю. Для более масштабных задач LazyGraphRAG от Microsoft обещает снижение стоимости индексации на порядки. LightRAG даёт 70–90% качества за сотую долю цены.

Научная библиотека — вики компилирует литературные обзоры, RAG находит публикации по смыслу, граф отслеживает цепочки цитирования. Агентская база знаний — в моём курсе: вики для навигации, RAG для двуязычного поиска (русский и английский), онтология на Oxigraph для трассировки «требование → лекция → семинар → оценка». Командная база знаний — три года накопленного опыта: протоколы встреч, проектные документы, пост-мортемы; вики даёт обзоры по темам, RAG находит «тот случай, когда мы уже решали похожую проблему». Начните с RAG поверх существующей вики — один вечер. Граф добавляйте только когда появятся конкретные запросы на связи.

Пример структуры

knowledge-base/
  CLAUDE.md
  index.md                 # вики-индекс (уровень 3)
  raw/                     # источники
    papers/
      by-topic/            # группировка по темам для удобства
    meeting-notes/
    regulations/
  wiki/                    # LLM-компилированные страницы
  index/                   # ← RAG-индекс, добавьте первым
  ontology/                # граф знаний, добавьте когда нужны связи
    schema.ttl             # классы и свойства (я использую Oxigraph)
    store.ttl              # данные
    queries/               # SPARQL-запросы для типовых вопросов

Когда это нужно

Нужен RAG когда	Нужен граф знаний когда
Двуязычный поиск (RU и EN)	Многошаговые запросы («публикации автора X → метод Y → домен Z»)
«Найди похожее» (нечёткое обнаружение)	Трассировка (требование → тест → покрытие)
Индекс вики больше 50 тысяч токенов	Агрегация («все публикации без цитирований»)
Разнородные источники	Таксономии и классификации

Паттерн	Антипаттерн
Вики как основа + RAG/граф как слои	RAG вместо вики (теряете навигацию)
Локальные бесплатные инструменты (Oxigraph, local-rag)	Платная векторная БД за $200/мес для 100 документов
Добавлять слои по одному	Строить всю инфраструктуру сразу «на вырост»
Граф для конкретных запросов на связи	Граф «потому что красиво» без чётких задач

Как я прошёл этот путь

Мой курс по AI — сотни источников, десятки артефактов, один мейнтейнер.

Начинал с уровня 0: два десятка файлов, всё в контексте. Быстро перерос в уровень 1: поиск по экспортированным документам. Попробовал RAG — получил 10% точности на русскоязычных запросах. Попробовал онтологию — красивая схема, ноль данных.

Реализовал уровень 3 — вики по методу Карпати: 7.6-кратное сокращение обращений к инструментам, 9 из 9 по полноте на тестовых сценариях. Добавил RAG для семантического поиска по двуязычным запросам — но только после того, как вики заработала.

Ключевой урок: я попробовал перепрыгнуть с уровня 1 на уровень 4 — и получил красивую пустую инфраструктуру. Только когда вернулся к уровню 3 как базе и добавил слои поиска сверху — система заработала.

Как определить нужную структуру

Весь фреймворк выбора сводится к двум вопросам:

Сколько у вас источников? (менее 20 / от 20 до 500 / более 500)
Что это — код или документация? (код / документация для людей / исследования, публикации, разнородные источники)

Масштаб \ Контент	Код	Документация для людей	Исследования, разнородные
Менее 20 файлов	Уровень 0	Уровень 0	Уровень 0
20–500	Уровень 1 (поиск + CLAUDE.md)	Уровень 2 (docs-as-code)	Уровень 3 (LLM-вики)
Более 500	Уровень 1 + индексированный поиск	Уровень 2 (до 3000+)	Уровень 3 + 4 (RAG/граф)

Гибридные ситуации — норма. «200 файлов кода + 50 научных публикаций» — код на уровне 1 (поиск + CLAUDE.md), публикации на уровне 3 (вики). Уровни не монопольны, они про тип контента.

Большинство из вас на уровне 1. И это нормально

Предприниматель Вамши Редди написал Карпати: «У каждого бизнеса есть директория raw/. Никто её ещё не скомпилировал. Вот и продукт».

Я сам потратил спринт на четырёхслойную систему с онтологией и SPARQL-запросами. Красивая архитектура. Графы, связи, валидация. А потом открыл граф знаний и обнаружил, что он пуст. Ноль данных. Рядом лежал CLAUDE.md на 40 строк, через который агент уже неделю находил всё нужное.

Правильный ответ зависит от задачи. Уровень 0 пока остаётся лучшим для малых проектов — NotebookLM обслуживает миллионы пользователей без единого векторного индекса. Уровень 1 — для кода. Stripe пока не переходит на RAG для своей документации, и пока не видит причин. Вики по методу Карпати — для исследователей с сотнями разнородных источников. А гибридный уровень 4 оправдан там, где стоимость ненайденной информации измеряется в потерянных деньгах или пациентах.

Каждый уровень — не ступенька лестницы, а правильный инструмент для своего масштаба. Простое правило: если не испытываете конкретную боль текущего уровня — вы на правильном месте.

Best 2025 RAG as a Service tools overview.

As businesses increasingly adopt Retrieval-Augmented Generation (RAG) to power intelligent applications, a specialized market of platforms known as “RAG as a Service” (RaaS) has rapidly matured. These services aim to abstract away the significant engineering challenges involved in building, deploying, and maintaining a production-ready RAG system.

However, the landscape is not limited to commercial, managed services. A vibrant ecosystem of open-source, self-hostable platforms has emerged, offering a compelling alternative for organizations that require greater control, data sovereignty, and deeper customization. These solutions provide a strategic middle ground between building from scratch with frameworks like LangChain and buying a proprietary, “black box” service.

This article provides a comprehensive overview of the modern RAG landscape, comparing leading commercial RaaS providers with their powerful open-source counterparts to help you choose the right path for your project.

Commercial RaaS Platforms: Managed for Speed and Simplicity

Commercial RaaS platforms are designed to deliver value with minimal setup. They offer end-to-end managed services that handle the underlying complexity of data ingestion, vectorization, and secure deployment, allowing development teams to focus on application logic.

🎯 Vectara: The Accuracy-Focused Engine

Product Overview: Vectara is an end-to-end cloud platform that puts a heavy emphasis on minimizing hallucinations and providing verifiable, fact-grounded answers. It operates as a fully managed service, using its own suite of proprietary AI models engineered for retrieval accuracy and factual consistency.

Architectural Approach:

Grounded Generation: A core design principle is forcing generated answers to be based strictly on the provided documents, complete with inline citations to ensure verifiability.
Proprietary Models: It uses specialized models like the HHEM (Hallucination Evaluation Model), which acts as a real-time fact-checker, to improve the reliability of its outputs.
Black Box Design: The platform is intentionally a “black box,” abstracting away the internal components to deliver high accuracy out-of-the-box, at the expense of granular customizability.

Well-Suited For: Enterprise applications where factual precision is a non-negotiable requirement, such as internal policy chatbots, financial reporting tools, or customer support systems dealing with technical information.

🛡️ Nuclia: The Security-First Fortress

Product Overview: Nuclia is an all-in-one RAG platform distinguished by its focus on Security & Governance. Its standout feature is the option for on-premise deployment, which allows enterprises to maintain full control over sensitive data.

Architectural Approach:

Data Sovereignty: The ability to run the entire platform within a company’s own firewall is its main differentiator, making it ideal for data-sensitive environments.
Versatile Data Processing: It is engineered to process a wide range of unstructured data, including video, audio, and complex PDFs, making them fully searchable.
Certified Security: The platform adheres to high security standards like SOC 2 Type II and ISO 27001, providing enterprise-grade assurance.

Well-Suited For: Organizations in highly regulated industries (e.g., finance, legal, healthcare) or those handling sensitive R&D data that cannot be exposed to a public cloud environment.

🚀 Ragie: The Developer-Centric Launchpad

Product Overview: Ragie is a fully-managed RAG platform designed for developer velocity and ease of use. It aims to lower the barrier to entry for building RAG applications by providing simple APIs and a large library of pre-built connectors.

Architectural Approach:

Managed Connectors: A key feature is its library of connectors that automate data syncing from sources like Google Drive, Notion, and Confluence, reducing integration overhead.
Accessible Features: It packages advanced capabilities like multimodal search and reranking into all its plans, including a free tier, to encourage rapid prototyping.
Simplicity over Control: It is designed for ease of use, which means it offers less granular control over internal components like chunking algorithms or underlying LLMs.

Well-Suited For: Startups and development teams that need to build and launch RAG applications quickly and cost-effectively, especially for prototypes, MVPs, or less critical internal tools.

🛠️ Ragu AI: The Modular Workshop

Product Overview: Ragu AI operates more like a flexible framework than a closed system. It emphasizes modularity and control, allowing expert teams to assemble a bespoke RAG pipeline using their own preferred components.

Architectural Approach:

Bring Your Own Components (BYOC): Its core philosophy is integration. Users can plug in their own vector database (e.g., Pinecone), LLMs, and other tools, giving them full control over the stack.
Pipeline Optimization: It provides tools for A/B testing different pipeline configurations, enabling teams to empirically tune the system for their specific needs.
Orchestration Layer: It acts as a managed orchestration layer that connects to a company’s existing infrastructure, avoiding the need for large-scale data migration.

Well-Suited For: Experienced AI/ML teams building sophisticated, custom RAG solutions that require deep integration with existing data stacks or the use of specific, fine-tuned models.

Open-Source RAG Platforms: Built for Control and Customization

Open-source platforms offer a powerful alternative for teams that require full data sovereignty, architectural control, and the ability to customize their RAG pipeline. These are not just libraries; they are complete, deployable application stacks.

🧩 Dify.ai: The Visual AI Application Development Platform

Product Overview: Dify.ai is a comprehensive, open-source LLM application development platform that extends beyond RAG to encompass a wide range of agentic AI applications. Its low-code/no-code visual interface democratizes AI development for a broad audience.

Architectural Approach:

Visual Workflow Builder: Its centerpiece is an intuitive, drag-and-drop canvas for constructing, testing, and deploying complex AI workflows and multi-step agents without extensive coding.
Integrated RAG Engine: Includes a powerful, built-in RAG pipeline that manages the entire lifecycle of knowledge augmentation, from document ingestion and parsing to advanced retrieval strategies.
Backend-as-a-Service (BaaS): Provides a complete set of RESTful APIs, allowing developers to programmatically integrate Dify’s backend into their own custom applications.

Well-Suited For: Cross-functional teams (Product Managers, Developers, Marketers) that need to rapidly build, prototype, and deploy AI-powered applications, including RAG chatbots and complex agents.

📚 RAGFlow: The Deep Document Understanding Engine

Product Overview: RAGFlow is an open-source RAG platform singularly focused on solving “deep document understanding.” Its philosophy is that RAG system performance is limited by the quality of data extraction, especially from complex, unstructured formats.

Architectural Approach:

Template-Based Chunking: A key differentiator is its use of customizable visual templates for document chunking, allowing for more logical and contextually aware segmentation of complex layouts (e.g., multi-column PDFs).
Hybrid Search: Employs a hybrid search approach that combines modern vector search with traditional keyword-based search to enhance accuracy and handle diverse query types.
Graph-Enhanced RAG: Incorporates graph-based retrieval mechanisms to understand the relationships between different parts of a document, providing more contextually relevant answers.

Well-Suited For: Organizations whose primary challenge is extracting knowledge from large volumes of complex, poorly structured, or scanned documents (e.g., in finance, legal, and engineering).

🌐 TrustGraph: The Enterprise GraphRAG Intelligence Platform

Product Overview: TrustGraph is an open-source platform engineered for building enterprise-grade AI applications that demand deep contextual reasoning. It moves “Beyond Basic RAG” by embracing a more advanced GraphRAG architecture.

Architectural Approach:

GraphRAG Engine: Automates the process of building a knowledge graph from ingested data, identifying entities and their relationships. This enables multi-hop reasoning that traditional RAG cannot perform.
Asynchronous Pub/Sub Backbone: Built on Apache Pulsar, ensuring reliability, fault tolerance, and scalability for demanding enterprise environments.
Reusable Knowledge Packages: Stores the processed graph structure and vector embeddings in modular packages, so the computationally expensive data structuring is only performed once.

Well-Suited For: Sophisticated technology teams in complex, regulated industries (e.g., finance, national security, scientific research) needing high-accuracy, explainable AI that can reason over vast, interconnected datasets.

Platform Comparison

The choice between a commercial and open-source platform depends on your organization’s priorities. Here is a comparison grouped by key evaluation criteria.

Platform	Focus	Deployment	Best For	Pricing
Vectara	🎯 Accuracy	☁️ Cloud	Enterprise	💵 Subscription
Nuclia	🛡️ Security	🏢 On-Premise	Regulated	💵 Subscription
Ragie	🚀 Speed	☁️ Cloud	Startups	💵 Subscription
Ragu AI	🛠️ Control	🧩 BYOC	Experts	💵 Subscription
Dify.ai	🎨 Visual Dev	☁️/🏢 Hybrid	All Teams	🎁 Freemium
RAGFlow	📄 Doc Parsing	🏢 Self-Hosted	Data-Heavy	🆓 Open Source
TrustGraph	🌐 GraphRAG	🏢 Self-Hosted	Researchers	🆓 Open Source

Conclusion: A Spectrum of Choice in a Maturing Market

The “build vs. buy” decision for RAG infrastructure has evolved into a more nuanced “build vs. buy vs. adapt” framework. The availability of mature RaaS platforms and powerful open-source alternatives means that building from scratch is often no longer the most efficient path.

The current landscape reflects the diverse needs of the market. The choice is no longer simply whether to buy, but which service philosophy—or open-source architecture—best aligns with a project’s specific goals. Whether the priority is out-of-the-box accuracy, absolute data security, rapid development, or deep architectural control, there is a solution available. This variety empowers teams to select a platform that lets them move beyond infrastructure challenges and focus on creating innovative, data-driven applications that unlock the true value of their knowledge.

Semantic Search Demystified: Architectures, Use Cases, and What Actually Works

🔗 Introduction: From RAG to Foundation

“If RAG is how intelligent systems respond, semantic search is how they understand.”

In our last post, we explored how Retrieval-Augmented Generation (RAG) unlocked the ability for AI systems to answer questions in rich, fluent, contextual language. But how do these systems decide what information even matters?

That’s where semantic search steps in.

Semantic search is the unsung engine behind intelligent systems—helping GitHub Copilot generate 46% of developer code, Shopify drive 700+ orders in 90 days, and healthcare platforms like Tempus AI match patients to life-saving treatments. It doesn’t just find “words”—it finds meaning.

This post goes beyond the buzz. We’ll show what real semantic search looks like in 2025:

Architectures that power enterprise copilots and recommendation systems.
Tools and best practices that go beyond vector search hype.
Lessons from real deployments—from legal tech to e-commerce to support automation.

Just like RAG changed how we write answers, semantic search is changing how systems think. Let’s dive into the practical patterns shaping this transformation.

🧭 Why Keyword Search Fails, and Semantic Search Wins

Most search systems still rely on keyword matching—fast, simple, and well understood. But when relevance depends on meaning, not exact terms, this approach consistently breaks down.

Common Failure Modes

Synonym blindness: Searching for “doctoral candidates” misses pages indexed under “PhD students.”
Multilingual mismatch: A support ticket in Spanish isn’t found by an English-only keyword query—even if translated equivalents exist.
Overfitting to phrasing: Searching legal clauses for “terminate agreement” doesn’t return documents using “contract dissolution,” even if conceptually identical.

These aren’t edge cases—they’re systemic.

A 2024 benchmark study showed enterprises lose an average of $31,754 per employee per year due to inefficient internal search systemssemantic search claude. The gap is especially painful in:

Customer support, where unresolved queries escalate due to missed knowledge base hits.
Legal search, where clause discovery depends on phrasing, not legal equivalence.
E-commerce, where product searches fail unless users mirror site taxonomy (“running shoes” vs. “sneakers”).

Semantic search addresses these issues by modeling similarity in meaning—not just words. But that doesn’t mean it always wins. The next section unpacks what it is, how it works, and when it actually makes sense to use.

🧠 What Is Semantic Search? A Practical Model

Semantic search retrieves information based on meaning, not surface words. It relies on transforming text into vectors—mathematical representations that cluster similar ideas together, regardless of how they’re phrased.

Lexical vs. Semantic: A Mental Model

Lexical search finds exact word matches.

Query: “laptop stand”

Misses: “notebook riser”, “portable desk support”

Semantic search maps all these terms into nearby positions in vector space.The system knows they mean similar things, even without shared words.

Core Components

Embeddings: Text is encoded into a dense vector (e.g., 768 to 3072 dimensions), capturing semantic context.
Similarity: Queries are compared to documents using cosine similarity or dot product.
Hybrid Fusion: Combines lexical and semantic scores using techniques like Reciprocal Rank Fusion (RRF) or weighted ensembling.

Evolution of Approaches

Stage	Description	When Used
Keyword-only	Classic full-text search	Simple filters, structured data
Vector-only	Embedding similarity, no text indexing	Small scale, fuzzy lookup
Hybrid Search	Combine lexical + semantic (RRF, CC)	Most production systems
RAG	Retrieve + generate with LLMs	Question answering, chatbots
Agentic Retrieval	Multi-step, context-aware, tool-using AI	Autonomous systems

Semantic search isn’t just “vector lookup.” It’s a design pattern built from embeddings, retrieval logic, scoring strategies, and increasingly—reasoning modules.

🧱 Architectural Building Blocks and Best Practices

Designing a semantic search system means combining several moving parts into a cohesive pipeline—from turning text into vectors to returning ranked results. Below is a working blueprint.

Core Components: What Every System Needs

Let’s walk through the core flow:

Embedding Layer

Converts queries and documents into dense vectors using a model like:

OpenAI text-embedding-3-large (plug-and-play, high quality)
Cohere v3 (multilingual)
BGE-M3 or Mistral-E5 (open-source options)

Vector Store

Indexes embeddings for fast similarity search:

Qdrant (ultra-low latency, good for filtering)
Weaviate (multimodal, plug-in architecture)
pgvector (PostgreSQL extension, ideal for small-scale or internal use)

Retriever Orchestration

Frameworks like:

LangChain (fast prototyping, agent support)
LlamaIndex (good for structured docs)
Haystack (production-grade with observability)

Re-ranker (Precision Layer)

Refines top-N results from the retriever stage using more sophisticated logic:

Cross-Encoder Models: Jointly score query+document pairs with higher accuracy
Heuristic Scorers: Prioritize based on position, title match, freshness, or user profile
Purpose: Suppress false positives and boost the most useful answers
Often used with LLMs for re-ranking in RAG and legal search pipelines

Key Architectural Practices (with Real-World Lessons)

✅ Store embeddings alongside original text and metadata
→ Enables fallback keyword search, filterable results, and traceable audit trails.
Used in: Salesforce Einstein — supports semantic and lexical retrieval in enterprise CRM with user-specific filters.

✅ Log search-click feedback loops
→ Use post-click data to re-rank results over time.
Used in: Shopify — improved precision by learning actual user paths after product search.

✅ Use hybrid search as the default
→ Pure vector often retrieves plausible but irrelevant text.
Used in: Voiceflow AI — combining keyword match with embedding similarity reduced unresolved support cases by 35%.

✅ Re-evaluate embedding models every 3–6 months
→ Models degrade as usage context shifts.
Seen in: GitHub Copilot — regular retraining required as codebase evolves.

✅ Run offline re-ranking experiments
→ Don’t trust similarity scores blindly—test on real query-result pairs.
Used in: Harvey AI — false positives in legal Q&A dropped after introducing graph-based reranking layer.

🧩Use Case Patterns: Architectures by Purpose

Semantic search isn’t one-size-fits-all. Different problem domains call for different architectural patterns. Below is a compact guide to five proven setups, each aligned with a specific goal and backed by production examples.

Pattern	Architecture	Real Case / Result
Enterprise Search	Hybrid search + user modeling	Salesforce Einstein: −50% click depth in internal CRM search
RAG-based Systems	Dense retriever + LLM generation	GitHub Copilot: 46% of developer code generated via contextual completion
Recommendation Engines	Vector similarity + collaborative signals	Shopify: 700+ orders in 90 days from semantic product search
Monitoring & Support	Real-time semantic + event ranking	Voiceflow AI: 35% drop in unresolved support tickets
Semantic ETL / Indexing	Auto-labeling + semantic clustering	Tempus AI: structure unstructured medical notes for retrieval across 20+ hospitals

🧠 Enterprise Search

Employees often can’t find critical internal information—even when it exists. Hybrid systems help match queries to phrased variations, acronyms, and internal jargon.

Query: “Leads in NY Q2”
Result: Finds “All active prospects in New York during second quarter,” even if phrased differently
Example: Salesforce uses hybrid vector + text with user-specific filters (location, role, permissions)

💬 RAG-based Systems

When search must become language generation, Retrieval-Augmented Generation (RAG) pipelines retrieve semantic matches and feed them into LLMs for synthesis.

Query: “Explain why the user’s API key stopped working”
System: Retrieves changelog, error logs → generates full explanation
Example: GitHub Copilot uses embedding-powered retrieval across billions of code fragments to auto-generate dev suggestions.

🛒 Recommendation Engines

Semantic search improves discovery when users don’t know what to ask—or use unexpected phrasing.

Query: “Gift ideas for someone who cooks”
Matches: “chef knife,” “cast iron pan,” “Japanese cookbook”
Example: Shopify’s implementation led to a direct sales lift—Rakuten saw a +5% GMS boost.

📞 Monitoring & Support

Support systems use semantic matching to find answers in ticket archives, help docs, or logs—even with vague or novel queries.

Query: “My bot isn’t answering messages after midnight”
Matches: archived incidents tagged with “off-hours bug”
Example: Voiceflow AI reduced unresolved queries by 35% using real-time vector retrieval + fallback heuristics.

🧬 Semantic ETL / Indexing

Large unstructured corpora—e.g., medical notes, financial reports—can be semantically indexed to enable fast filtering and retrieval later.

Source: Clinical notes, radiology reports
Process: Auto-split, embed, cluster, label
Example: Tempus AI created semantic indexes of medical data across 65 academic centers, powering search for treatment and diagnosis pathways.

🛠️ Tooling Guide: What to Choose and When

Choosing the right tool depends on scale, latency needs, domain complexity, and whether you’re optimizing for speed, cost, or control. Below is a guide to key categories—embedding models, vector databases, and orchestration frameworks.

Embedding Models

OpenAI text-embedding-3-large

General-purpose, high-quality, plug-and-play
Ideal for teams prioritizing speed over control
Used by: Notion AI for internal semantic document search

Cohere Embed v3

Multilingual (100+ languages), efficient, with compression-aware training
Strong in global support centers or multilingual corpora
Used by: Cohere’s own internal customer support bots

BGE-M3 / Mistral-E5

Open-source, high-performance models, require your own infrastructure
Better suited for teams with GPU resources and need for fine-tuning
Used in: Voiceflow AI for scalable customer support retrieval

Vector Databases

DB	Best For	Weakness	Known Use
Qdrant	Real-time search, metadata filters	Smaller ecosystem	FragranceBuy semantic product search
Pinecone	SaaS scaling, enterprise ops-free	Expensive, less customizable	Harvey AI for legal Q&A retrieval
Weaviate	Multimodal search, LLM integration	Can be memory-intensive	Tempus AI for healthcare document indexing
pgvector	PostgreSQL-native, low-complexity use	Not optimal for >1M vectors	Internal tooling at early-stage startups

Chroma (optional)

Local, dev-focused, great for experimentation
Ideal for prototyping or offline use cases
Used in: R&D pipelines at AI startups and LangChain demos

Frameworks

Tool	Use If…	Avoid If…	Real Use
LangChain	You need fast prototyping and agent support	You require fine-grained performance control	Used in 100+ AI demos and open-source agents
LlamaIndex	Your data is document-heavy (PDFs, tables)	You need sub-200ms response time	Used in enterprise doc Q&A bots
Haystack	You want observability + long-term ops	You’re just testing MVP ideas	Deployed by enterprises using Qdrant and RAG
Semantic Kernel	You’re on Microsoft stack (Azure, Copilot)	You need light, cross-cloud tools	Used by Microsoft in enterprise copilots

🧠 Pro Tip: Mix-and-match works. Many real systems use OpenAI + pgvector for MVP, then migrate to Qdrant + BGE-M3 + Haystack at scale.

🚀 Deployment Patterns and Real Lessons

Most teams don’t start with a perfect architecture. They evolve—from quick MVPs to scalable production systems. Below are two reference patterns grounded in real-world cases.

MVP Phase: Fast, Focused, Affordable

Use Case: Internal search, small product catalog, support KB, chatbot context
Stack:

Embedding: OpenAI text-embedding-3-large (no infra needed)
Vector DB: pgvector on PostgreSQL
Framework: LangChain for simple retrieval and RAG routing

🧪 Real Case: FragranceBuy

A mid-size e-commerce site deployed semantic product search using pgvector and OpenAI
Outcome: 3× conversion growth on desktop, 4× on mobile within 30 days
Cost: Minimal infra; no LLM hosting; latency acceptable for sub-second queries

🔧 What Worked:

Easy to launch, no GPU required
Immediate uplift from replacing brittle keyword filters

⚠️ Watch Out:

Lacks user feedback learning
pgvector indexing slows beyond ~1M vectors

Scale Phase: Hybrid, Observability, Tuning

Use Case: Large support system, knowledge base, multilingual corpora, product discovery
Stack:

Embedding: BGE-M3 or Cohere v3 (self-hosted or API)
Vector DB: Qdrant (filtering, high throughput) or Pinecone (SaaS)
Framework: Haystack (monitoring, pipelines, fallback layers)

🧪 Real Case: Voiceflow AI Support Search

Rebuilt internal help search with hybrid strategy (BM25 + embedding)
Outcome: 35% fewer unresolved support queries
Added re-ranker based on user click logs and feedback

🔧 What Worked:

Fast hybrid retrieval, with semantic fallback when keywords fail
Embedded feedback loop (logs clicks and corrections)

⚠️ Watch Out:

Requires tuning: chunk size, re-ranking rules, hybrid weighting
Embedding updates need versioning (to avoid relevance decay)

These patterns aren’t static—they evolve. But they offer a foundation: start small, then optimize based on user behavior and search drift.

⚠️ Pitfalls, Limitations & Anti-Patterns

Even good semantic search systems can fail—quietly, and in production. Below are common traps that catch teams new to this space, with real-life illustrations.

Overreliance on Vector Similarity (No Re-ranking)

Problem: Relying solely on cosine similarity between vectors often surfaces “vaguely related” content instead of precise answers.
Why: Vectors capture semantic neighborhoods, but not task-specific relevance or user context.
Fix: Use re-ranking—like BM25 + embedding hybrid scoring or learning-to-rank models.

🔎 Real Issue: GitHub Copilot without context filtering would suggest irrelevant completions. Their final system includes re-ranking via neighboring tab usage and intent analysis.

Ignoring GDPR & Privacy Risks

Problem: Embeddings leak information. A vector can retain personal data even if the original text is gone.
Why: Dense vectors are hard to anonymize, and can’t be fully reversed—but can be probed.
Fix: Hash document IDs, store minimal metadata, isolate sensitive domains, avoid user PII in raw embeddings.

🔎 Caution: Healthcare or legal domains must treat embeddings as sensitive. Microsoft Copilot and Tempus AI implement access controls and data lineage for this reason.

Skipping Hybrid Search (Because It Seems “Messy”)

Problem: Many teams disable keyword search to “go all in” on vectors, assuming it’s smarter.
Why: Some queries still require precision that embeddings can’t guarantee.
Fix: Use Reciprocal Rank Fusion (RRF) or weighted ensembles to blend text and vector results.

🔎 Real Result: Voiceflow AI initially used vector-only, but missed exact-matching FAQ queries. Adding BM25 boosted retrieval precision.

Not Versioning Embeddings

Problem: Embeddings drift—newer model versions represent meaning differently. If you replace your model without rebuilding the index, quality decays.
Why: Same text → different vector → corrupted retrieval
Fix: Version each embedding model, regenerate entire index when switching.

🔎 Real Case: An e-commerce site updated from OpenAI 2 to 3-large without reindexing, and saw a sudden drop in search quality. Rolling back solved it.

Misusing Dense Retrieval for Structured Filtering

Problem: Some teams try to replace every search filter with semantic matching.
Why: Dense search is approximate. If you want “all files after 2022” or “emails tagged ‘legal’”—use metadata filters, not cosine.
Fix: Combine semantic scores with strict filter logic (like SQL WHERE clauses).

🔎 Lesson: Harvey AI layered dense retrieval with graph-based constraints for legal clause searches—only then did false positives drop.

🧪 Bonus Tip: Monitor What Users Click, Not Just What You Return

Embedding quality is hard to evaluate offline. Use logs of real searches and which results users clicked. Over time, these patterns train re-rankers and highlight drift.

📌 Summary & Strategic Recommendations

Semantic search isn’t just another search plugin—it’s becoming the default foundation for AI systems that need to understand, not just retrieve.

Here’s what you should take away:

Use Semantic Search Where Meaning > Keywords

Complex catalogs (“headphones” vs. “noise-cancelling audio gear”)
Legal, medical, financial documents where synonyms are unpredictable
Internal enterprise search where wording varies by department or region

🧪 Real ROI: $31,754 per employee/year saved in enterprise productivitysemantic search claude
🧪 Example: Harvey AI reached 94.8% accuracy in legal document Q&A only after semantic + custom graph fusion

Default to Hybrid, Unless Latency Is Critical

BM25 + embeddings outperform either alone in most cases
If real-time isn’t required, hybrid gives best coverage and robustness

🧪 Real Case: Voiceflow AI improved ticket resolution by combining semantic ranking with keyword fallback

Choose Tools by Scale × Complexity × Control

Need	Best Tooling Stack
Fast MVP	OpenAI + pgvector + LangChain
Production RAG	Cohere or BGE-M3 + Qdrant + Haystack
Microsoft-native	OpenAI + Semantic Kernel + Azure
Heavy structure	LlamaIndex + metadata filters

🧠 Don’t get locked into your first tool—plan for embedding upgrades and index regeneration.

Treat Semantic Indexing as AI Infrastructure

Search, RAG, chatbots, agents—they all start with high-quality indexing.

Poor chunking → irrelevant answers
Wrong embeddings → irrelevant documents
Missing metadata → unfilterable output

🧪 Example: Salesforce Einstein used user-role metadata in its index to cut irrelevant clicks by 50%.

📈 What’s Coming

Multimodal Search: text + image + audio embeddings (e.g., Titan, CLIP)
Agentic Retrieval: query breakdown, multi-step search, tool use
Self-Adaptive Indexes: auto-retraining, auto-chunking, drift tracking