From Chaos to Pipeline: Refactoring 5000 Articles Out of a Notion Fortress

I needed 5,000+ Chinese articles out of a Notion workspace. No API access. No export button. The workspace belonged to someone else — I was a paying subscriber with view-only browser access.

Over the past few months, I built extraction scripts. Lots of them. By the time I looked at my infra/ folder last week, I had 21 scripts, 19 JSON files, 4 Chrome profiles with cryptic names, and no idea which script to run for what.

It worked — kind of. I'd extracted about 80% of the articles. But every time I needed to grab a new batch, I'd spend 20 minutes just figuring out which script was the "right" one.

What Went Wrong (And Why It Always Does)

The chaos was not because I am disorganized. It is because extraction is inherently iterative, and each iteration left artifacts behind.

Here is what the natural progression looks like:

v1: Simple script. Hard-coded dates. Works for January 2024.
v2: Copy v1, change the dates. Works for February.
v3: Add some logic to handle edge cases discovered in March.
v4: New approach entirely because v3 broke on a different year.
v5-v7: Each adds one feature (date picker, scrolling, save-server).
Suddenly: 7 scripts, each slightly different, no single source of truth.

Sound familiar? This is the same pattern that plagues any data extraction project. The scripts compound, but nobody consolidates.

The root cause: I was solving each month's problem independently instead of building a system that handles all months.

The Refactoring: One CLI, One DB, Parallel Instances

The fix was straightforward once I saw the pattern. I needed three things:

1. Single Entry Point

All 21 scripts became one: bsxf.js

node bsxf.js status                           # Where are we?
node bsxf.js scrape --year 2025               # Phase 1: collect URLs
node bsxf.js fetch --year 2024 --year 2021    # Phase 2: download articles
node bsxf.js retry --year 2024                # Fix failures

Six commands. That is it. No more "which script do I run?"

2. State in SQLite, Not in My Head

Before: "I think I already got January... let me check the JSON... wait, which JSON?"

After: one database tracks every article — block ID, date, title, fetch status, error message, run history. node bsxf.js status gives me the full picture in 2 seconds.

Every X means missing, every ! means partial, and the DB tells me exactly which articles failed and why.

3. Parallel Without Collision

The breakthrough was stupid-simple: one Chrome profile per year.

Each year gets its own Playwright browser context. No session conflicts. No cookie collisions. I can fetch 2024, 2021, and 2025 simultaneously from one command.

JavaScript's Promise.all runs them in parallel. Three Chrome windows pop up, each cruising through its year's articles independently. The terminal output interleaves with [2024], [2021], [2025] prefixes so I can tell them apart.

What Made It Difficult

Notion fights automation. No API for this workspace. Headless browsers get blocked. The DOM virtualizes — only ~40 rows render at a time, so you cannot just scroll and scrape.

The solution was a collapse/expand strategy: collapse all month groups, then expand one at a time, extract its articles, collapse it, move to the next. It bypasses the virtualization ceiling entirely.

Date filters are React inputs, not normal HTML forms. You cannot just set input.value — React ignores it. You have to simulate actual keystrokes with pressSequentially() and click in the right order (end date first, then start date, because Notion auto-adjusts ranges).

Same-title articles exist. The author sometimes publishes two articles on the same day with the exact same title. The filesystem can only save one YYMMDD Title.md. Fix: detect the collision, append (2) to the filename.

What I Actually Learned

Scripts Are Debt, CLIs Are Assets

A one-off script solves today's problem. A CLI with named commands solves the category of problems. The marginal cost of adding --year 2025 to an existing CLI is near zero. The cost of writing scraper-v8-2025.js from scratch is hours.

State Management Makes Everything Easier

The moment I added SQLite, three problems vanished:

"Did I already fetch this?" — Query the DB
"What failed?" — WHERE fetch_status = 'failed'
"What is left?" — WHERE fetch_status = 'pending'

I was tracking this in my head before. My head is not a database.

Parallel = Profile Isolation, Not Thread Safety

I did not need mutexes or message queues. Just separate Chrome profiles. The browser sessions are completely independent — different cookies, different local storage, different everything. The only shared resource is the SQLite database, and better-sqlite3 handles concurrent reads natively.

Archive, Do Not Delete

All 21 old scripts went into archive/. I did not delete them. They contain institutional knowledge — edge cases, workarounds, and Notion DOM quirks that I might need again. Storage is cheap; rediscovery is expensive.

Where This Pattern Applies Next

This was not just about articles. The pattern — single CLI + state DB + parallel profiles — works for any gated-platform extraction:

WeChat Official Account archives — same session-gating problem
Paid course platforms — video/document extraction with login requirements
Any Notion workspace — the collapse/expand strategy is reusable
Multi-account scraping — one profile per account, parallel fetch

The key insight: if you are building extraction script number 3 for the same platform, stop. Consolidate into a CLI. You will thank yourself by script number 5.

From 21 scripts and 80% coverage to 1 CLI and 97% coverage, in one session. Sometimes the best engineering is the engineering you do on your own tools.

Adrian Gan is the CEO of Mipos Sdn Bhd, and spends an unreasonable amount of time building systems to extract wisdom from Chinese internet authors.