Commit Graph

4 Commits

Author SHA1 Message Date
hailin 73dee93d19 feat(docling): persist model cache via Docker volume
- Add docling_models volume mounted at /models in container
- Set HF_HOME=/models/huggingface at runtime (via docker-compose env)
- Models download once → persist in volume → survive container rebuilds
- Build-time preload uses || to not block build if network unavailable

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 07:18:14 -08:00
hailin 764613bd86 fix(docling): use standalone script for model pre-download
Inline Python one-liner had syntax errors (try/except/finally can't be
single-line). Move to scripts/preload_models.py for reliable execution.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 07:16:20 -08:00
hailin d725864cd6 fix(docling): pre-download models during Docker build
DocumentConverter() constructor only sets up config, models are lazily
downloaded on first convert(). Fix by running an actual PDF conversion
during build to trigger HuggingFace model download and cache.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 07:13:54 -08:00
hailin 57d21526a5 feat(knowledge): add Docling document parsing microservice
Add IBM Docling as a Python FastAPI microservice for high-quality document
parsing with table structure recognition (TableFormer ~94% accuracy) and
OCR support, replacing pdf-parse/mammoth as the primary text extractor.

Architecture:
- New docling-service (Python FastAPI, port 3007) in Docker network
- knowledge-service calls docling-service via HTTP POST multipart/form-data
- Graceful fallback: if Docling fails, falls back to pdf-parse/mammoth
- Text/Markdown files skip Docling (no benefit for plain text)

Changes:
- New: packages/services/docling-service/ (main.py, Dockerfile, requirements.txt)
- docker-compose.yml: add docling-service, wire DOCLING_SERVICE_URL to
  knowledge-service, add missing FILE_SERVICE_URL to conversation-service
- text-extraction.service.ts: inject ConfigService, add extractViaDocling()
  with automatic fallback to legacy extractors
- .env.example: add FILE_SERVICE_PORT/URL and DOCLING_SERVICE_PORT/URL

Inter-service communication map:
  conversation-service → file-service (FILE_SERVICE_URL, attachments)
  conversation-service → knowledge-service (KNOWLEDGE_SERVICE_URL, RAG)
  knowledge-service → docling-service (DOCLING_SERVICE_URL, document parsing)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 05:24:10 -08:00