🐛 fix · run statsmerged → main

A counter that counted
past the finish line

Why a scraping run could report 602 successes out of 600 tasks — and the one-line-of-thinking fix that stops it.

→ / space / click to advance · ← to go back

The symptom

successful

602

/

total

600

Production runs occasionally showed successful > total. Physically impossible — a run has 600 tasks, it can't succeed 602 times.

The number wasn't wrong data. It was the same task counted twice.

Where the counter lives

One task → one `+1` on the run row

customer→ gateway→ conveyor worker→ scrape URL

when a task finishes, the worker does:

UpdateRunStats(completed += 1, successful += 1)

A bare, commutative ADD on DynamoDB. Fast, lock-free — and assumes each task settles exactly once.

The crack · SQS at-least-once

A message can be delivered twice

received

⟳ redelivered — worker still busy!

worker A picks it uptimeout expires mid-scrape →worker B picks up the SAME message

If a scrape takes longer than the visibility timeout (slow target site), SQS assumes A died and hands the message to B. Now two workers hold the same task.

The race · before the fix

Both workers settle. Both add `+1`.

1A reads task → pending · not terminal → proceed

2B reads task → processing · "resuming" → proceed

3A scrape ok → PutTask(successful) → ADD +1

4B scrape ok → PutTask(successful) → ADD +1

successful

2

for

one task

1

Why the existing guard missed it

The check and the write weren't atomic

// HandleMessage — the "guard" that was supposed to dedupe
switch task.Status {
case Pending:                 // first delivery  → proceed
case Processing:              // redelivery      → proceed (!)
case Successful, Failed:      // already done    → drop ✓
    return nil
}
                    ▲ read here …
        … gateway call (slow) …
                    ▼ write terminal + ADD here — nothing held in between

The terminal drop only catches a sequential redelivery (A fully done before B reads). Two concurrent deliveries both read "not terminal" and sail through.

The fix · one idea

Make the write itself
the gatekeeper.

Instead of "read status, then blindly overwrite", the terminal write becomes a conditional write: "set me to successful — but only if I'm still processing."

// new store primitive
PutTaskIfStatus(task, allowed=[processing]) → ErrConflict if not

Evaluated server-side inside the single DynamoDB PutItem. No version field, no extra read.

The race · after the fix

Only the CAS winner counts

1A PutTaskIfStatus(processing→successful) WINS → ADD +1

2B PutTaskIfStatus(processing→successful) ErrConflict (status already successful)

3B sees conflict → returns without the ADD ✓

successful

1

=

one task

1

Both still make one gateway call — that's inherent to at-least-once. Only the counting is now exactly-once.

Two guard points, same primitive

Claim and settle are both conditional

① Claim

PutTaskIfStatus(
→ processing,
allowed=[pending,processing])

Stops a duplicate from reverting an already-terminal row back to processing — which would re-open the whole window.

② Settle

PutTaskIfStatus(
→ successful / failed,
allowed=[processing])

Gates the ADD (and the retry re-enqueue) on winning. Exactly one of N duplicates counts.

Alternatives considered

Why a status CAS — not the other tools

✗
Version field on the task row. Optimistic lock ⇒ a read-then-write on every settle. The spec explicitly rejected this cost; task rows stay unversioned (result keys are deterministic anyway).
✗
SQS FIFO / content-dedup queue. Dedup window is 5 min; a slow scrape outlives it. Also loses per-run queue throughput.
✗
Separate idempotency table. Another row + TTL to maintain, for a fact the task row already knows: its own status.
✓
Conditional PutItem on status. Zero extra reads, zero new rows — the write we already do simply refuses to run twice.

Edge cases covered

What else the CAS quietly fixes

↻
Retry path. A duplicate that hits a transient failure won't re-enqueue a second attempt or double-count spend — it loses the same CAS.
⟲
Terminal-row revert. B can no longer clobber A's successful row back to processing and lose the result key.
=
Sequential redelivery. Still handled by the original terminal-status drop — the CAS is purely the concurrent-window backstop.
∑
Run completion. Already safe — it used a versioned CAS. Only the counters were unguarded.

Reproduced, then regression-locked

Real worker · DynamoDB Local · 5× concurrent dup

▲ before

total       = 1
completed   = 5
successful  = 5
→ FAIL: 5 > 1

▼ after

total       = 1
completed   = 1
successful  = 1
→ PASS ✓

go test -tags=integration -race -run DoubleCount · success + failure paths · full suite green

The bug wasn't the ADD.
It was who was allowed to run it.

4 files +98 lines no version field no extra read exactly-once counting

github.com/ZenRows/conveyor · PR #166 · branch fix/cor-stats-double-count

A counter that countedpast the finish line

One task → one +1 on the run row

A message can be delivered twice

Both workers settle. Both add +1.

The check and the write weren't atomic

Make the write itselfthe gatekeeper.

Only the CAS winner counts

Claim and settle are both conditional

① Claim

② Settle

Why a status CAS — not the other tools

What else the CAS quietly fixes

Real worker · DynamoDB Local · 5× concurrent dup

▲ before

▼ after

The bug wasn't the ADD.It was who was allowed to run it.

A counter that counted
past the finish line

One task → one `+1` on the run row

Both workers settle. Both add `+1`.

Make the write itself
the gatekeeper.

The bug wasn't the ADD.
It was who was allowed to run it.