# TODO / Roadmap A living roadmap for improving the cart actor system. Focus areas: 1. Reliability & correctness 2. Simplicity of mutation & ownership flows 3. Developer experience (DX) 4. Operability (observability, tracing, metrics) 5. Performance & scalability 6. Security & multi-tenant readiness --- ## 1. Immediate Next Steps (High-Leverage) | Priority | Task | Goal | Effort | Owner | Notes | |----------|------|------|--------|-------|-------| | P0 | Add mutation registry coverage test | Ensure no unregistered mutations silently fail | S | | Failing fast in CI | | P0 | Add decodeJSON helper + 400 mapping for EOF | Reduce noisy 500 logs | S | | Improves client API clarity | | P0 | Regenerate protos & prune unused messages (CreateCheckoutOrder, Checkout RPC remnants) | Eliminate dead types | S | | Avoid confusion | | P0 | Add integration test: multi-node ownership negotiation | Validate quorum logic | M | | Spin up 2–3 nodes ephemeral | | P1 | Export Prometheus metrics for per-mutation counts & latency | Operability | M | | Wrap registry handlers | | P1 | Add graceful shutdown ordering (Closing → wait for acks → stop gRPC) | Reduce in-flight mutation failures | S | | Add context cancellation | | P1 | Add coverage for InitializeCheckout / OrderCreated flows | Checkout reliability | S | | Simulate Klarna stub | | P2 | Add optional batching client (apply multiple mutations locally then persist) | Performance | M | | Only if needed | --- ## 2. Simplification Opportunities ### A. RemoteGrain Proxy Mapping Current: manual switch building each RPC call. Simplify by: - Generating a thin client adapter from proto RPC descriptors (codegen). - Or using a registry similar to mutation registry but for “outbound call constructors”. Benefit: adding a new mutation = add proto + register server handler + register outbound invoker (no switch edits). ### B. Ownership Negotiation Current: ad hoc quorum rule in `SyncedPool`. Simplify: - Introduce explicit `OwnershipLease{holder, expiresAt, version}`. - Use monotonic version increment—reject stale ConfirmOwner replies. - Optional: add randomized backoff to reduce thundering herd on contested cart ids. ### C. CartId Handling Current: ephemeral 16-byte array with trimmed string semantics. Simplify: - Use ULID / UUIDv7 (time-ordered, collision-resistant) for easier external correlation. - Provide helper `NewCartIdString()` and keep internal fixed-size if still desired. ### D. Mutation Signatures Current: registry assumes `func(*CartGrain, *T) error`. Extension option: allow pure transforms returning a delta struct (for audit/logging): ``` type MutationResult struct { Changed bool Events []interface{} } ``` Only implement if auditing/event-sourcing reintroduced. --- ## 3. Developer Experience Improvements | Task | Rationale | Approach | |------|-----------|----------| | Makefile targets: `make run-single`, `make run-multi N=3` | Faster local cluster spin-up | Docker compose or background “mini cluster” scripts | | Template for new mutation (generator) | Reduce boilerplate | `go:generate` scanning proto for new RPCs | | Lint config (golangci-lint) | Catch subtle issues early | Add `.golangci.yml` | | Pre-commit hook for proto regeneration check | Avoid stale generated code | Script compares git diff after `make protogen` | | Example client (Go + curl snippets auto-generated) | Onboarding | Codegen a markdown from proto comments | --- ## 4. Observability / Metrics / Tracing | Area | Metric / Trace | Notes | |------|----------------|-------| | Mutation registry | `cart_mutations_total{type,success}`; duration histogram | Wrap handler | | Ownership negotiation | `cart_ownership_attempts_total{result}` | result=accepted,rejected,timeout | | Remote latency | `cart_remote_mutation_seconds{method}` | Use client interceptors | | Pings | `cart_remote_missed_pings_total{host}` | Already count, expose | | Checkout flow | `checkout_attempts_total`, `checkout_failures_total` | Differentiate Klarna vs internal errors | | Tracing | Span: HTTP handler → SyncedPool.Apply → (Remote?) gRPC → mutation handler | Add OpenTelemetry instrumentation | --- ## 5. Performance & Scalability | Concern | Idea | Trade-Off | |---------|------|-----------| | High mutation rate on single cart | Introduce optional mutation queue (serialize explicitly) | Slight latency increase per op | | Remote call overhead | Add client-side gRPC pooling & per-host circuit breaker | Complexity vs resilience | | TTL purge efficiency | Use min-heap or timing wheel instead of slice scan | More code, better big-N performance | | Batch network latency | Add `BatchMutate` RPC (list of mutations applied atomically) | Lost single-op simplicity | --- ## 6. Reliability Features | Feature | Description | Priority | |---------|-------------|----------| | Lease fencing token | Include `ownership_version` in all remote mutate requests | M | | Retry policy | Limited retry for transient network errors (idempotent mutations only) | L | | Dead host reconciliation | On host removal, proactively attempt re-acquire of its carts | M | | Drain mode | Node marks itself “draining” → refuses new ownership claims | M | --- ## 7. Security & Hardening | Area | Next Step | Detail | |------|-----------|--------| | Transport | mTLS on gRPC | Use SPIFFE IDs or simple CA | | AuthN/AuthZ | Interceptor enforcing service token | Inject metadata header | | Input validation | Strengthen JSON decode responses | Disallow unknown fields globally | | Rate limiting | Per-IP / per-cart throttling | Guard hotspot abuse | | Multi-tenancy | Tenant id dimension in cart id or metadata | Partition metrics & ownership | --- ## 8. Testing Strategy Enhancements | Gap | Improvement | |-----|------------| | No multi-node integration test in CI | Spin ephemeral in-process servers on randomized ports | | Mutation regression | Table-driven tests auto-discover handlers via registry | | Ownership race | Stress test: concurrent Apply on same new cart id from N goroutines | | Checkout external dependency | Klarna mock server (HTTptest) + deterministic responses | | Fuzzing | Fuzz `BuildCheckoutOrderPayload` & mutation handlers for panics | --- ## 9. Cleanup / Tech Debt | Item | Action | |------|--------| | Remove deprecated proto remnants (CreateCheckoutOrder, Checkout RPC) | Delete & regenerate | | Consolidate duplicate tax computations | Single helper with tax config | | Delivery price hard-coded (4900) | Config or pricing strategy interface | | Mixed naming (camel vs snake JSON historically) | Provide stable external API doc; accept old forms if needed | | Manual remote mutation switch (if still present) | Replace with generated outbound registry | | Mixed error responses (string bodies) | Standardize JSON: `{ "error": "...", "code": 400 }` | --- ## 10. Potential Future Features | Feature | Value | Complexity | |---------|-------|------------| | Streaming `WatchState` RPC | Real-time cart updates for clients | Medium | | Event sourcing / audit log | Replay, analytics, debugging | High | | Promotion / coupon engine plugin | Business extensibility | Medium | | Partial cart reservation / inventory lock | Stock accuracy under concurrency | High | | Multi-currency pricing | Globalization | Medium | | GraphQL facade | Client flexibility | Medium | --- ## 11. Suggested Prioritized Backlog (Condensed) 1. Coverage test + decode error mapping (P0) 2. Proto regeneration & cleanup (P0) 3. Metrics wrapper for registry (P1) 4. Multi-node ownership integration test (P1) 5. Delivery pricing abstraction (P2) 6. Lease version in remote RPCs (P2) 7. BatchMutate evaluation (P3) 8. TLS / auth hardening (P3) if going multi-tenant/public 9. Event sourcing (Evaluate after stability) (P4) --- ## 12. Simplifying the Developer Workflow | Pain | Simplifier | |------|------------| | Manual mutation boilerplate | Code generator for registry stubs | | Forgetting totals | Enforce WithTotals lint: fail if mutation touches items/deliveries without flag | | Hard to inspect remote ownership | `/internal/ownership` debug endpoint (JSON of local + remoteIndex) | | Hard to see mutation timings | Add `?debug=latency` header to return per-mutation durations | | Cookie dev confusion (Secure flag) | Env var: `DEV_INSECURE_COOKIES=1` | --- ## 13. Example: Mutation Codegen Sketch (Future) Input: cart_actor.proto Output: `mutation_auto.go` - Detect messages used in RPC wrappers (e.g., `AddItemRequest` → payload field). - Generate `RegisterMutation` template if handler not found. - Mark with `// TODO implement logic`. --- ## 14. Risk / Impact Matrix (Abbreviated) | Change | Risk | Mitigation | |--------|------|-----------| | Replace remote switch with registry | Possible missing registration → runtime error | Coverage test gating CI | | Lease introduction | Split-brain if version mishandled | Increment + assert monotonic; test race | | BatchMutate | Large atomic operations starving others | Size limits & fair scheduling | | Event sourcing | Storage + replay complexity | Start with append-only log + compaction job | --- ## 15. Contributing Workflow (Proposed) 1. Add / modify proto → run `make protogen` 2. Implement mutation logic → add `RegisterMutation` invocation 3. Add/Update tests (unit + integration) 4. Run `make verify` (lint, test, coverage, proto diff) 5. Open PR (template auto-checklist referencing this TODO) 6. Merge requires green CI + coverage threshold --- ## 16. Open Questions | Question | Notes | |----------|-------| | Do we need sticky sessions for HTTP layer scaling? | Currently cart id routing suffices | | Should deliveries prune invalid line references on SetCartRequest? | Inconsistency risk; add optional cleanup | | Is checkout idempotency strict enough? | Multiple create vs update semantics | | Add version field to CartState for optimistic concurrency? | Could enable external CAS writes | --- ## 17. Tracking Mark any completed tasks with `[x]`: - [ ] Coverage test - [ ] Decode helper + 400 mapping - [ ] Proto cleanup - [ ] Registry metrics instrumentation - [ ] Ownership multi-node test - [ ] Lease versioning - [ ] Delivery pricing abstraction - [ ] TLS/mTLS internal - [ ] BatchMutate design doc --- _Last updated: roadmap draft – refine after first metrics & scaling test run._