Files
go-cart-actor/TODO.md
matst80 716f1121aa
Some checks failed
Build and Publish / BuildAndDeploy (push) Successful in 3m7s
Build and Publish / BuildAndDeployAmd64 (push) Has been cancelled
even more refactoring
2025-10-10 11:46:19 +00:00

10 KiB
Raw Blame History

TODO / Roadmap

A living roadmap for improving the cart actor system. Focus areas:

  1. Reliability & correctness
  2. Simplicity of mutation & ownership flows
  3. Developer experience (DX)
  4. Operability (observability, tracing, metrics)
  5. Performance & scalability
  6. Security & multi-tenant readiness

1. Immediate Next Steps (High-Leverage)

Priority Task Goal Effort Owner Notes
P0 Add mutation registry coverage test Ensure no unregistered mutations silently fail S Failing fast in CI
P0 Add decodeJSON helper + 400 mapping for EOF Reduce noisy 500 logs S Improves client API clarity
P0 Regenerate protos & prune unused messages (CreateCheckoutOrder, Checkout RPC remnants) Eliminate dead types S Avoid confusion
P0 Add integration test: multi-node ownership negotiation Validate quorum logic M Spin up 23 nodes ephemeral
P1 Export Prometheus metrics for per-mutation counts & latency Operability M Wrap registry handlers
P1 Add graceful shutdown ordering (Closing → wait for acks → stop gRPC) Reduce in-flight mutation failures S Add context cancellation
P1 Add coverage for InitializeCheckout / OrderCreated flows Checkout reliability S Simulate Klarna stub
P2 Add optional batching client (apply multiple mutations locally then persist) Performance M Only if needed

2. Simplification Opportunities

A. RemoteGrain Proxy Mapping

Current: manual switch building each RPC call. Simplify by:

  • Generating a thin client adapter from proto RPC descriptors (codegen).
  • Or using a registry similar to mutation registry but for “outbound call constructors”. Benefit: adding a new mutation = add proto + register server handler + register outbound invoker (no switch edits).

B. Ownership Negotiation

Current: ad hoc quorum rule in SyncedPool. Simplify:

  • Introduce explicit OwnershipLease{holder, expiresAt, version}.
  • Use monotonic version increment—reject stale ConfirmOwner replies.
  • Optional: add randomized backoff to reduce thundering herd on contested cart ids.

C. CartId Handling

Current: ephemeral 16-byte array with trimmed string semantics. Simplify:

  • Use ULID / UUIDv7 (time-ordered, collision-resistant) for easier external correlation.
  • Provide helper NewCartIdString() and keep internal fixed-size if still desired.

D. Mutation Signatures

Current: registry assumes func(*CartGrain, *T) error. Extension option: allow pure transforms returning a delta struct (for audit/logging):

type MutationResult struct {
  Changed bool
  Events  []interface{}
}

Only implement if auditing/event-sourcing reintroduced.


3. Developer Experience Improvements

Task Rationale Approach
Makefile targets: make run-single, make run-multi N=3 Faster local cluster spin-up Docker compose or background “mini cluster” scripts
Template for new mutation (generator) Reduce boilerplate go:generate scanning proto for new RPCs
Lint config (golangci-lint) Catch subtle issues early Add .golangci.yml
Pre-commit hook for proto regeneration check Avoid stale generated code Script compares git diff after make protogen
Example client (Go + curl snippets auto-generated) Onboarding Codegen a markdown from proto comments

4. Observability / Metrics / Tracing

Area Metric / Trace Notes
Mutation registry cart_mutations_total{type,success}; duration histogram Wrap handler
Ownership negotiation cart_ownership_attempts_total{result} result=accepted,rejected,timeout
Remote latency cart_remote_mutation_seconds{method} Use client interceptors
Pings cart_remote_missed_pings_total{host} Already count, expose
Checkout flow checkout_attempts_total, checkout_failures_total Differentiate Klarna vs internal errors
Tracing Span: HTTP handler → SyncedPool.Apply → (Remote?) gRPC → mutation handler Add OpenTelemetry instrumentation

5. Performance & Scalability

Concern Idea Trade-Off
High mutation rate on single cart Introduce optional mutation queue (serialize explicitly) Slight latency increase per op
Remote call overhead Add client-side gRPC pooling & per-host circuit breaker Complexity vs resilience
TTL purge efficiency Use min-heap or timing wheel instead of slice scan More code, better big-N performance
Batch network latency Add BatchMutate RPC (list of mutations applied atomically) Lost single-op simplicity

6. Reliability Features

Feature Description Priority
Lease fencing token Include ownership_version in all remote mutate requests M
Retry policy Limited retry for transient network errors (idempotent mutations only) L
Dead host reconciliation On host removal, proactively attempt re-acquire of its carts M
Drain mode Node marks itself “draining” → refuses new ownership claims M

7. Security & Hardening

Area Next Step Detail
Transport mTLS on gRPC Use SPIFFE IDs or simple CA
AuthN/AuthZ Interceptor enforcing service token Inject metadata header
Input validation Strengthen JSON decode responses Disallow unknown fields globally
Rate limiting Per-IP / per-cart throttling Guard hotspot abuse
Multi-tenancy Tenant id dimension in cart id or metadata Partition metrics & ownership

8. Testing Strategy Enhancements

Gap Improvement
No multi-node integration test in CI Spin ephemeral in-process servers on randomized ports
Mutation regression Table-driven tests auto-discover handlers via registry
Ownership race Stress test: concurrent Apply on same new cart id from N goroutines
Checkout external dependency Klarna mock server (HTTptest) + deterministic responses
Fuzzing Fuzz BuildCheckoutOrderPayload & mutation handlers for panics

9. Cleanup / Tech Debt

Item Action
Remove deprecated proto remnants (CreateCheckoutOrder, Checkout RPC) Delete & regenerate
Consolidate duplicate tax computations Single helper with tax config
Delivery price hard-coded (4900) Config or pricing strategy interface
Mixed naming (camel vs snake JSON historically) Provide stable external API doc; accept old forms if needed
Manual remote mutation switch (if still present) Replace with generated outbound registry
Mixed error responses (string bodies) Standardize JSON: { "error": "...", "code": 400 }

10. Potential Future Features

Feature Value Complexity
Streaming WatchState RPC Real-time cart updates for clients Medium
Event sourcing / audit log Replay, analytics, debugging High
Promotion / coupon engine plugin Business extensibility Medium
Partial cart reservation / inventory lock Stock accuracy under concurrency High
Multi-currency pricing Globalization Medium
GraphQL facade Client flexibility Medium

11. Suggested Prioritized Backlog (Condensed)

  1. Coverage test + decode error mapping (P0)
  2. Proto regeneration & cleanup (P0)
  3. Metrics wrapper for registry (P1)
  4. Multi-node ownership integration test (P1)
  5. Delivery pricing abstraction (P2)
  6. Lease version in remote RPCs (P2)
  7. BatchMutate evaluation (P3)
  8. TLS / auth hardening (P3) if going multi-tenant/public
  9. Event sourcing (Evaluate after stability) (P4)

12. Simplifying the Developer Workflow

Pain Simplifier
Manual mutation boilerplate Code generator for registry stubs
Forgetting totals Enforce WithTotals lint: fail if mutation touches items/deliveries without flag
Hard to inspect remote ownership /internal/ownership debug endpoint (JSON of local + remoteIndex)
Hard to see mutation timings Add ?debug=latency header to return per-mutation durations
Cookie dev confusion (Secure flag) Env var: DEV_INSECURE_COOKIES=1

13. Example: Mutation Codegen Sketch (Future)

Input: cart_actor.proto
Output: mutation_auto.go

  • Detect messages used in RPC wrappers (e.g., AddItemRequest → payload field).
  • Generate RegisterMutation template if handler not found.
  • Mark with // TODO implement logic.

14. Risk / Impact Matrix (Abbreviated)

Change Risk Mitigation
Replace remote switch with registry Possible missing registration → runtime error Coverage test gating CI
Lease introduction Split-brain if version mishandled Increment + assert monotonic; test race
BatchMutate Large atomic operations starving others Size limits & fair scheduling
Event sourcing Storage + replay complexity Start with append-only log + compaction job

15. Contributing Workflow (Proposed)

  1. Add / modify proto → run make protogen
  2. Implement mutation logic → add RegisterMutation invocation
  3. Add/Update tests (unit + integration)
  4. Run make verify (lint, test, coverage, proto diff)
  5. Open PR (template auto-checklist referencing this TODO)
  6. Merge requires green CI + coverage threshold

16. Open Questions

Question Notes
Do we need sticky sessions for HTTP layer scaling? Currently cart id routing suffices
Should deliveries prune invalid line references on SetCartRequest? Inconsistency risk; add optional cleanup
Is checkout idempotency strict enough? Multiple create vs update semantics
Add version field to CartState for optimistic concurrency? Could enable external CAS writes

17. Tracking

Mark any completed tasks with [x]:

  • Coverage test
  • Decode helper + 400 mapping
  • Proto cleanup
  • Registry metrics instrumentation
  • Ownership multi-node test
  • Lease versioning
  • Delivery pricing abstraction
  • TLS/mTLS internal
  • BatchMutate design doc

Last updated: roadmap draft refine after first metrics & scaling test run.