UA πŸ‡ΊπŸ‡¦ EN πŸ‡¬πŸ‡§
← Back to portfolio

Docs Indexer β€” documentation indexing tool

A Python utility that downloads documentation pages, cleans them, splits content into semantic chunks, attaches metadata and exports everything into a structured format for LLM and RAG systems.

Python Requests BeautifulSoup CLI

Project overview

Docs Indexer automates documentation processing by downloading pages, cleaning the raw HTML, splitting content into semantic blocks and enriching it with metadata for further indexing.

Tech stack

Goal

Build a flexible documentation indexer for LLM/RAG processing.

How to run

python -m docs_indexer.main

# URLs file:
data/urls.txt

# Output:
data/output/pages.json
data/output/meta.json

What this project shows about me