zstmfhy/docs-to-notebooklm
Overview
This skill automates bulk scraping of technical documentation sites and synchronizes the results to Google NotebookLM. It supports common static doc frameworks (VitePress, Docusaurus, GitBook, VuePress) and handles dynamic pages via Playwright. The tool extracts navigation links, converts pages to Markdown, and uploads documents in batches respecting NotebookLM limits.
How this skill works
It crawls a documentation site starting from a root URL, extracts sidebar/navigation links, and saves them as structured link files with progress metadata. Pages are loaded (including JS-driven content) with Playwright, parsed to remove chrome and converted to Markdown. Markdown files are uploaded to NotebookLM using the CLI, with automatic batching (max 50 sources per notebook), progress tracking, and retry lists for failures.
When to use it
- You need to import an entire documentation site into NotebookLM for AI analysis or search.
- You maintain internal docs behind authentication and want periodic incremental syncs.
- You want automated, repeatable exports from VitePress/Docusaurus/GitBook/VuePress sites.
- You have many files and need automatic batching to respect NotebookLM source limits.
- You require resumable downloads/uploads to avoid rework after interruptions.
Best practices
- Set a conservative delay (1–2s) between page loads to avoid rate limits and incomplete JS rendering.
- Run Playwright with --headless off during initial debugging to observe rendering issues.
- Use the progress JSON files to implement incremental syncs and safe retries instead of re-downloading everything.
- Limit concurrency when downloading large sites to reduce memory and network spikes.
- Inspect _failed_uploads.txt and re-run uploads only for failed items to save time.
Example use cases
- Migrate a public VitePress API reference into NotebookLM for team Q&A and embeddings.
- Periodically sync protected internal docs (using cookie auth) to a private NotebookLM notebook for searchable knowledge.
- Bulk-import a large product docs site; tool will split 120 files into multiple notebooks automatically.
- Convert a GitBook site to Markdown archive and then upload in batches for vector search experiments.
- Resume a stalled crawl: use extract_progress.json and download_progress.json to continue where it left off.
FAQ
The skill automatically creates multiple notebooks. It names them sequentially (e.g., "Name", "Name (2)") and distributes files up to 50 sources per notebook.
Can it handle sites requiring login?
Yes. You can pass authentication cookies to the extractor so Playwright loads authenticated pages; subsequent downloads and uploads will use the same saved link set.
How do I recover from failed uploads?
Failed file paths are recorded in _failed_uploads.txt. Re-run the upload script pointing at those files or the same directory; the tool supports retries and continues progress tracking.