A Python utility that downloads documentation pages, cleans them, splits content into semantic chunks, attaches metadata and exports everything into a structured format for LLM and RAG systems.
Docs Indexer automates documentation processing by downloading pages, cleaning the raw HTML, splitting content into semantic blocks and enriching it with metadata for further indexing.
python -m docs_indexerpython -mBuild a flexible documentation indexer for LLM/RAG processing.
python -m docs_indexer.main
# URLs file:
data/urls.txt
# Output:
data/output/pages.json
data/output/meta.json