DocuMind AI

An end-to-end Computer Vision and RAG pipeline that transforms static documents into actionable intelligence using YOLOv8 and Llama 3.

Vision YOLOv8 / OpenCV

Intelligence Llama 3 / RAG

Vector DB FAISS / Qdrant

Backend FastAPI / Docker

System Architecture

The pipeline utilizes a decoupled architecture: **FastAPI** handles high-concurrency requests, while the **CV Pipeline** processes documents asynchronously to ensure zero-latency user experience.

Phase 1:
Visual Perception

1

Layout Analysis

YOLOv8 detects bounding boxes for headers, paragraphs, and tables to preserve document hierarchy.
2

Intelligent OCR

PaddleOCR extracts raw text while maintaining spatial coordinates for accurate data mapping.
3

Table Structuring

Custom algorithms convert visual table grids into clean, structured JSON formats for LLM ingestion.

Phase 2:
Semantic Retrieval

Embedding Model Sentence-Transformers

Vector Storage FAISS / Qdrant

LLM Engine Llama 3 (8B) / GPT-4 API

*The system uses a Cross-Encoder re-ranking step to ensure 95%+ retrieval relevance for complex user queries.

DocuMind AI

System Architecture

Phase 1: Visual Perception

Layout Analysis

Intelligent OCR

Table Structuring

Phase 2: Semantic Retrieval

User Interface (React + Tailwind)

Phase 1:
Visual Perception

Phase 2:
Semantic Retrieval