conveyor · PR #166
🐛 fix · run statsmerged → main

A counter that counted
past the finish line

Why a scraping run could report 602 successes out of 600 tasks — and the one-line-of-thinking fix that stops it.

→ / space / click to advance  ·  ← to go back

The symptom
successful
602
/
total
600

Production runs occasionally showed successful > total. Physically impossible — a run has 600 tasks, it can't succeed 602 times.

The number wasn't wrong data. It was the same task counted twice.

Where the counter lives

One task → one +1 on the run row

customer gateway conveyor worker scrape URL
when a task finishes, the worker does:
UpdateRunStats(completed += 1, successful += 1)

A bare, commutative ADD on DynamoDB. Fast, lock-free — and assumes each task settles exactly once.

The crack · SQS at-least-once

A message can be delivered twice

received
⟳ redelivered — worker still busy!
worker A picks it uptimeout expires mid-scrape →worker B picks up the SAME message

If a scrape takes longer than the visibility timeout (slow target site), SQS assumes A died and hands the message to B. Now two workers hold the same task.

The race · before the fix

Both workers settle. Both add +1.

1A  reads task → pending · not terminal → proceed
2B  reads task → processing · "resuming" → proceed
3A  scrape ok → PutTask(successful)ADD +1
4B  scrape ok → PutTask(successful)ADD +1
successful
2
for
one task
1
Why the existing guard missed it

The check and the write weren't atomic

// HandleMessage — the "guard" that was supposed to dedupe
switch task.Status {
case Pending:                 // first delivery  → proceed
case Processing:              // redelivery      → proceed (!)
case Successful, Failed:      // already done    → drop ✓
    return nil
}
                    ▲ read here …
        … gateway call (slow) …
                    ▼ write terminal + ADD here — nothing held in between

The terminal drop only catches a sequential redelivery (A fully done before B reads). Two concurrent deliveries both read "not terminal" and sail through.

The fix · one idea

Make the write itself
the gatekeeper.

Instead of "read status, then blindly overwrite", the terminal write becomes a conditional write: "set me to successful — but only if I'm still processing."

// new store primitive
PutTaskIfStatus(task, allowed=[processing]) → ErrConflict if not

Evaluated server-side inside the single DynamoDB PutItem. No version field, no extra read.

The race · after the fix

Only the CAS winner counts

1A  PutTaskIfStatus(processing→successful)  WINS → ADD +1
2B  PutTaskIfStatus(processing→successful)  ErrConflict (status already successful)
3B  sees conflict → returns without the ADD
successful
1
=
one task
1

Both still make one gateway call — that's inherent to at-least-once. Only the counting is now exactly-once.

Two guard points, same primitive

Claim and settle are both conditional

① Claim

PutTaskIfStatus(
  → processing,
  allowed=[pending,processing])

Stops a duplicate from reverting an already-terminal row back to processing — which would re-open the whole window.

② Settle

PutTaskIfStatus(
  → successful / failed,
  allowed=[processing])

Gates the ADD (and the retry re-enqueue) on winning. Exactly one of N duplicates counts.

Alternatives considered

Why a status CAS — not the other tools

Edge cases covered

What else the CAS quietly fixes

Reproduced, then regression-locked

Real worker · DynamoDB Local · 5× concurrent dup

▲ before

total       = 1
completed   = 5
successful  = 5
→ FAIL: 5 > 1

▼ after

total       = 1
completed   = 1
successful  = 1
→ PASS ✓

go test -tags=integration -race -run DoubleCount  ·  success + failure paths  ·  full suite green

The bug wasn't the ADD.
It was who was allowed to run it.

4 files +98 lines no version field no extra read exactly-once counting

github.com/ZenRows/conveyor · PR #166 · branch fix/cor-stats-double-count

← → · space · click