TODO / Roadmap
A living roadmap for improving the cart actor system. Focus areas:
- Reliability & correctness
- Simplicity of mutation & ownership flows
- Developer experience (DX)
- Operability (observability, tracing, metrics)
- Performance & scalability
- Security & multi-tenant readiness
1. Immediate Next Steps (High-Leverage)
| Priority |
Task |
Goal |
Effort |
Owner |
Notes |
| P0 |
Add mutation registry coverage test |
Ensure no unregistered mutations silently fail |
S |
|
Failing fast in CI |
| P0 |
Add decodeJSON helper + 400 mapping for EOF |
Reduce noisy 500 logs |
S |
|
Improves client API clarity |
| P0 |
Regenerate protos & prune unused messages (CreateCheckoutOrder, Checkout RPC remnants) |
Eliminate dead types |
S |
|
Avoid confusion |
| P0 |
Add integration test: multi-node ownership negotiation |
Validate quorum logic |
M |
|
Spin up 2–3 nodes ephemeral |
| P1 |
Export Prometheus metrics for per-mutation counts & latency |
Operability |
M |
|
Wrap registry handlers |
| P1 |
Add graceful shutdown ordering (Closing → wait for acks → stop gRPC) |
Reduce in-flight mutation failures |
S |
|
Add context cancellation |
| P1 |
Add coverage for InitializeCheckout / OrderCreated flows |
Checkout reliability |
S |
|
Simulate Klarna stub |
| P2 |
Add optional batching client (apply multiple mutations locally then persist) |
Performance |
M |
|
Only if needed |
2. Simplification Opportunities
A. RemoteGrain Proxy Mapping
Current: manual switch building each RPC call.
Simplify by:
- Generating a thin client adapter from proto RPC descriptors (codegen).
- Or using a registry similar to mutation registry but for “outbound call constructors”.
Benefit: adding a new mutation = add proto + register server handler + register outbound invoker (no switch edits).
B. Ownership Negotiation
Current: ad hoc quorum rule in SyncedPool.
Simplify:
- Introduce explicit
OwnershipLease{holder, expiresAt, version}.
- Use monotonic version increment—reject stale ConfirmOwner replies.
- Optional: add randomized backoff to reduce thundering herd on contested cart ids.
C. CartId Handling
Current: ephemeral 16-byte array with trimmed string semantics.
Simplify:
- Use ULID / UUIDv7 (time-ordered, collision-resistant) for easier external correlation.
- Provide helper
NewCartIdString() and keep internal fixed-size if still desired.
D. Mutation Signatures
Current: registry assumes func(*CartGrain, *T) error.
Extension option: allow pure transforms returning a delta struct (for audit/logging):
Only implement if auditing/event-sourcing reintroduced.
3. Developer Experience Improvements
| Task |
Rationale |
Approach |
Makefile targets: make run-single, make run-multi N=3 |
Faster local cluster spin-up |
Docker compose or background “mini cluster” scripts |
| Template for new mutation (generator) |
Reduce boilerplate |
go:generate scanning proto for new RPCs |
| Lint config (golangci-lint) |
Catch subtle issues early |
Add .golangci.yml |
| Pre-commit hook for proto regeneration check |
Avoid stale generated code |
Script compares git diff after make protogen |
| Example client (Go + curl snippets auto-generated) |
Onboarding |
Codegen a markdown from proto comments |
4. Observability / Metrics / Tracing
| Area |
Metric / Trace |
Notes |
| Mutation registry |
cart_mutations_total{type,success}; duration histogram |
Wrap handler |
| Ownership negotiation |
cart_ownership_attempts_total{result} |
result=accepted,rejected,timeout |
| Remote latency |
cart_remote_mutation_seconds{method} |
Use client interceptors |
| Pings |
cart_remote_missed_pings_total{host} |
Already count, expose |
| Checkout flow |
checkout_attempts_total, checkout_failures_total |
Differentiate Klarna vs internal errors |
| Tracing |
Span: HTTP handler → SyncedPool.Apply → (Remote?) gRPC → mutation handler |
Add OpenTelemetry instrumentation |
5. Performance & Scalability
| Concern |
Idea |
Trade-Off |
| High mutation rate on single cart |
Introduce optional mutation queue (serialize explicitly) |
Slight latency increase per op |
| Remote call overhead |
Add client-side gRPC pooling & per-host circuit breaker |
Complexity vs resilience |
| TTL purge efficiency |
Use min-heap or timing wheel instead of slice scan |
More code, better big-N performance |
| Batch network latency |
Add BatchMutate RPC (list of mutations applied atomically) |
Lost single-op simplicity |
6. Reliability Features
| Feature |
Description |
Priority |
| Lease fencing token |
Include ownership_version in all remote mutate requests |
M |
| Retry policy |
Limited retry for transient network errors (idempotent mutations only) |
L |
| Dead host reconciliation |
On host removal, proactively attempt re-acquire of its carts |
M |
| Drain mode |
Node marks itself “draining” → refuses new ownership claims |
M |
7. Security & Hardening
| Area |
Next Step |
Detail |
| Transport |
mTLS on gRPC |
Use SPIFFE IDs or simple CA |
| AuthN/AuthZ |
Interceptor enforcing service token |
Inject metadata header |
| Input validation |
Strengthen JSON decode responses |
Disallow unknown fields globally |
| Rate limiting |
Per-IP / per-cart throttling |
Guard hotspot abuse |
| Multi-tenancy |
Tenant id dimension in cart id or metadata |
Partition metrics & ownership |
8. Testing Strategy Enhancements
| Gap |
Improvement |
| No multi-node integration test in CI |
Spin ephemeral in-process servers on randomized ports |
| Mutation regression |
Table-driven tests auto-discover handlers via registry |
| Ownership race |
Stress test: concurrent Apply on same new cart id from N goroutines |
| Checkout external dependency |
Klarna mock server (HTTptest) + deterministic responses |
| Fuzzing |
Fuzz BuildCheckoutOrderPayload & mutation handlers for panics |
9. Cleanup / Tech Debt
| Item |
Action |
| Remove deprecated proto remnants (CreateCheckoutOrder, Checkout RPC) |
Delete & regenerate |
| Consolidate duplicate tax computations |
Single helper with tax config |
| Delivery price hard-coded (4900) |
Config or pricing strategy interface |
| Mixed naming (camel vs snake JSON historically) |
Provide stable external API doc; accept old forms if needed |
| Manual remote mutation switch (if still present) |
Replace with generated outbound registry |
| Mixed error responses (string bodies) |
Standardize JSON: { "error": "...", "code": 400 } |
10. Potential Future Features
| Feature |
Value |
Complexity |
Streaming WatchState RPC |
Real-time cart updates for clients |
Medium |
| Event sourcing / audit log |
Replay, analytics, debugging |
High |
| Promotion / coupon engine plugin |
Business extensibility |
Medium |
| Partial cart reservation / inventory lock |
Stock accuracy under concurrency |
High |
| Multi-currency pricing |
Globalization |
Medium |
| GraphQL facade |
Client flexibility |
Medium |
11. Suggested Prioritized Backlog (Condensed)
- Coverage test + decode error mapping (P0)
- Proto regeneration & cleanup (P0)
- Metrics wrapper for registry (P1)
- Multi-node ownership integration test (P1)
- Delivery pricing abstraction (P2)
- Lease version in remote RPCs (P2)
- BatchMutate evaluation (P3)
- TLS / auth hardening (P3) if going multi-tenant/public
- Event sourcing (Evaluate after stability) (P4)
12. Simplifying the Developer Workflow
| Pain |
Simplifier |
| Manual mutation boilerplate |
Code generator for registry stubs |
| Forgetting totals |
Enforce WithTotals lint: fail if mutation touches items/deliveries without flag |
| Hard to inspect remote ownership |
/internal/ownership debug endpoint (JSON of local + remoteIndex) |
| Hard to see mutation timings |
Add ?debug=latency header to return per-mutation durations |
| Cookie dev confusion (Secure flag) |
Env var: DEV_INSECURE_COOKIES=1 |
13. Example: Mutation Codegen Sketch (Future)
Input: cart_actor.proto
Output: mutation_auto.go
- Detect messages used in RPC wrappers (e.g.,
AddItemRequest → payload field).
- Generate
RegisterMutation template if handler not found.
- Mark with
// TODO implement logic.
14. Risk / Impact Matrix (Abbreviated)
| Change |
Risk |
Mitigation |
| Replace remote switch with registry |
Possible missing registration → runtime error |
Coverage test gating CI |
| Lease introduction |
Split-brain if version mishandled |
Increment + assert monotonic; test race |
| BatchMutate |
Large atomic operations starving others |
Size limits & fair scheduling |
| Event sourcing |
Storage + replay complexity |
Start with append-only log + compaction job |
15. Contributing Workflow (Proposed)
- Add / modify proto → run
make protogen
- Implement mutation logic → add
RegisterMutation invocation
- Add/Update tests (unit + integration)
- Run
make verify (lint, test, coverage, proto diff)
- Open PR (template auto-checklist referencing this TODO)
- Merge requires green CI + coverage threshold
16. Open Questions
| Question |
Notes |
| Do we need sticky sessions for HTTP layer scaling? |
Currently cart id routing suffices |
| Should deliveries prune invalid line references on SetCartRequest? |
Inconsistency risk; add optional cleanup |
| Is checkout idempotency strict enough? |
Multiple create vs update semantics |
| Add version field to CartState for optimistic concurrency? |
Could enable external CAS writes |
17. Tracking
Mark any completed tasks with [x]:
Last updated: roadmap draft – refine after first metrics & scaling test run.