DocuMind AI

An end-to-end Computer Vision and RAG pipeline that transforms static documents into actionable intelligence using YOLOv8 and Llama 3.

Vision YOLOv8 / OpenCV
Intelligence Llama 3 / RAG
Vector DB FAISS / Qdrant
Backend FastAPI / Docker

System Architecture

The pipeline utilizes a decoupled architecture: **FastAPI** handles high-concurrency requests, while the **CV Pipeline** processes documents asynchronously to ensure zero-latency user experience.

Phase 1:
Visual Perception

  • 1

    Layout Analysis

    YOLOv8 detects bounding boxes for headers, paragraphs, and tables to preserve document hierarchy.

  • 2

    Intelligent OCR

    PaddleOCR extracts raw text while maintaining spatial coordinates for accurate data mapping.

  • 3

    Table Structuring

    Custom algorithms convert visual table grids into clean, structured JSON formats for LLM ingestion.

Phase 2:
Semantic Retrieval

Embedding Model Sentence-Transformers
Vector Storage FAISS / Qdrant
LLM Engine Llama 3 (8B) / GPT-4 API

*The system uses a Cross-Encoder re-ranking step to ensure 95%+ retrieval relevance for complex user queries.

User Interface (React + Tailwind)

Dashboard Preview