Add IBM Docling as a Python FastAPI microservice for high-quality document
parsing with table structure recognition (TableFormer ~94% accuracy) and
OCR support, replacing pdf-parse/mammoth as the primary text extractor.
Architecture:
- New docling-service (Python FastAPI, port 3007) in Docker network
- knowledge-service calls docling-service via HTTP POST multipart/form-data
- Graceful fallback: if Docling fails, falls back to pdf-parse/mammoth
- Text/Markdown files skip Docling (no benefit for plain text)
Changes:
- New: packages/services/docling-service/ (main.py, Dockerfile, requirements.txt)
- docker-compose.yml: add docling-service, wire DOCLING_SERVICE_URL to
knowledge-service, add missing FILE_SERVICE_URL to conversation-service
- text-extraction.service.ts: inject ConfigService, add extractViaDocling()
with automatic fallback to legacy extractors
- .env.example: add FILE_SERVICE_PORT/URL and DOCLING_SERVICE_PORT/URL
Inter-service communication map:
conversation-service → file-service (FILE_SERVICE_URL, attachments)
conversation-service → knowledge-service (KNOWLEDGE_SERVICE_URL, RAG)
knowledge-service → docling-service (DOCLING_SERVICE_URL, document parsing)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>