Docs Indexer — Python documentation indexing tool

Project overview

Docs Indexer automates documentation processing by downloading pages, cleaning the raw HTML, splitting content into semantic blocks and enriching it with metadata for further indexing.

📥 Downloads documentation pages from URLs.
🧹 Cleans HTML from unnecessary elements.
✂️ Splits content into semantic chunks.
🧩 Adds metadata (title, url, section, index).
📦 Exports to JSON/Markdown for LLM pipelines.
⚙️ Run via CLI: python -m docs_indexer

View GitHub repository

Tech stack

Python 3.12
Requests for downloading HTML
BeautifulSoup for parsing
lxml / re for cleaning
CLI runner via python -m

Goal

Build a flexible documentation indexer for LLM/RAG processing.

Type: Dev tool / automation
Role: Python Developer (solo)

How to run

python -m docs_indexer.main

# URLs file:
data/urls.txt

# Output:
data/output/pages.json
data/output/meta.json

What this project shows about me

I can build tools for preparing data for LLM/RAG pipelines.
I understand HTML parsing, scraping and content processing.
I design practical Python CLI utilities.
I structure code cleanly and modularly.