| title | Chapter 12 — Limitations, Open Problems, Future Work |
|---|---|
| chapter | 12 |
| lang | en |
| license | CC-BY-NC-4.0 |
| last_modified_at | 2026-04-22T03:40:36Z |
The honest chapter. What we haven't solved, what we might reverse, what's coming.
Per-tenant embeddings at ~100k volume: HNSW P95 < 120 ms. Past 5M, performance degrades. If any tenant hits 5M, evaluate Qdrant/Milvus migration or sharding.
zh_parser (SCWS-based) misses new words, brand names, product names. We patch with synonym dictionaries — ongoing maintenance burden. Experimental alternative: LLM-time tokenization — better accuracy, +100 ms, higher cost.
Text only. Real knowledge is mixed:
- Product photo + text description
- Construction workflow diagram
- Cosmetic SDS PDF with tables and chemical structures
CLIP-style multimodal embedding experimental; targeted 2026 Q3.
Deployed only in AWS Tokyo. EU compliance needs EU region. Docker Compose architecture doesn't support multi-region; needs K8s refactor.
LLM-authored Wiki has systematic biases: Western-centric examples, inconsistent transliteration, stale post-training-cutoff knowledge. Our mitigations (strict "chunks-only" instruction, cross-chunk consistency lint) partially help but don't fix the root.
The paper suggests 60, but gives no theoretical justification. We haven't run sufficient A/B to validate it for Chinese. Worth revisiting.
GPT-4o-mini misclassifies vague openings ("I was wondering..."). Knowledge → smalltalk means the customer gets a polite non-answer. Fix direction: expanded training set + confidence threshold + dual-path on low confidence.
English NLI (DeBERTa-v3-NLI) is excellent. Chinese NLI quality varies; we use mDeBERTa-multi + human audit at ~85% accuracy. Production-grade Chinese NLI is an open problem.
Current pricing by message count. Actual cost varies:
- Simple CS ask: USD 0.001
- PIF regulatory citation: USD 0.02
- NLI + Rerank high-precision: USD 0.05
High-precision tenants underpay; low-precision overpay. 2026 Q3: precision-tier pricing.
Shared infra is wonderful; "GEO-triggered RAG repair token usage" is hard to attribute. Currently GEO API calls count against RAG tenant quota — financially imprecise.
Upgrading embedding model (text-embedding-3-small → -large) requires full re-embed. For a large tenant: USD 2,000+. We've deferred such upgrades — tech debt accumulates.
How often to compile?
- Daily: waste (most pages unchanged)
- Monthly: stale for regulated domains
- Event-driven: "change" itself is hard to define
Current: fingerprint + weekly lint + manual trigger — no clean theory.
Customer says "your website says CEO is Bob." RAG Wiki says "Alice." Who wins?
- Trust RAG: maybe the customer was spoofed
- Trust customer: maybe our system is stale
This is a trust chain problem with no engineering answer yet.
Claude 200k, Gemini 2M — tempting to "stuff everything in prompt." Our position: RAG doesn't die, it mutates.
- Cost: 200k input per query → USD 0.5+, unsustainable
- "Lost in the middle" — LLMs lose focus in ultra-long contexts
- Permission control: multi-tenant can't put everything in LLM
L1 Wiki becomes the tool to align LLM attention precisely, rather than a substitute for vector retrieval.
Text Wiki is natural. What is a Wiki for images / video / audio?
- Formula photo → OCR + visual description?
- Construction video → timeline of events?
- Meeting recording → summary + speaker separation?
No unified answer.
Tentative (subject to market feedback):
| Quarter | Item | Priority |
|---|---|---|
| 2026 Q2 | Multimodal embedding (CLIP-style) | High |
| 2026 Q2 | Rerank default-on evaluation | Medium |
| 2026 Q2 | GEO ↔ RAG Wiki patch API launch | High |
| 2026 Q3 | Precision-tier pricing | High |
| 2026 Q3 | EU region deployment (K8s) | Medium |
| 2026 Q3 | Japanese NLI self-training | Medium |
| 2026 Q4 | Long-context + Wiki hybrid strategy | Medium |
| 2026 Q4 | Self-hosted edition | Low |
This book is a living document: minor versions each quarter, major annually. GitHub Issues capture reader feedback. Updates in CHANGELOG.md.
- pgvector needs migration plan past 5M vectors
- Chinese tokenization, multimodal, multi-region are main engineering gaps
- LLM bias in Wiki compile, RRF k=60 empirical, Chinese NLI quality are algorithmic gaps
- Pricing ↔ cost, cross-product attribution, breaking changes are commercial gaps
- Long-context LLMs reshape, don't end, RAG
- 2026 roadmap centers on multimodal, Wiki patch API, precision-tier pricing
Navigation: ← Ch 11 · 📖 Contents · Appendix A →