# TODO / Roadmap

A living roadmap for improving the cart actor system. Focus areas:
1. Reliability & correctness
2. Simplicity of mutation & ownership flows
3. Developer experience (DX)
4. Operability (observability, tracing, metrics)
5. Performance & scalability
6. Security & multi-tenant readiness

---

## 1. Immediate Next Steps (High-Leverage)

| Priority | Task | Goal | Effort | Owner | Notes |
|----------|------|------|--------|-------|-------|
| P0 | Add mutation registry coverage test | Ensure no unregistered mutations silently fail | S |  | Failing fast in CI |
| P0 | Add decodeJSON helper + 400 mapping for EOF | Reduce noisy 500 logs | S |  | Improves client API clarity |
| P0 | Regenerate protos & prune unused messages (CreateCheckoutOrder, Checkout RPC remnants) | Eliminate dead types | S |  | Avoid confusion |
| P0 | Add integration test: multi-node ownership negotiation | Validate quorum logic | M |  | Spin up 2–3 nodes ephemeral |
| P1 | Export Prometheus metrics for per-mutation counts & latency | Operability | M |  | Wrap registry handlers |
| P1 | Add graceful shutdown ordering (Closing → wait for acks → stop gRPC) | Reduce in-flight mutation failures | S |  | Add context cancellation |
| P1 | Add coverage for InitializeCheckout / OrderCreated flows | Checkout reliability | S |  | Simulate Klarna stub |
| P2 | Add optional batching client (apply multiple mutations locally then persist) | Performance | M |  | Only if needed |

---

## 2. Simplification Opportunities

### A. RemoteGrain Proxy Mapping
Current: manual switch building each RPC call.
Simplify by:
- Generating a thin client adapter from proto RPC descriptors (codegen).
- Or using a registry similar to mutation registry but for “outbound call constructors”.
Benefit: adding a new mutation = add proto + register server handler + register outbound invoker (no switch edits).

### B. Ownership Negotiation
Current: ad hoc quorum rule in `SyncedPool`.
Simplify:
- Introduce explicit `OwnershipLease{holder, expiresAt, version}`.
- Use monotonic version increment—reject stale ConfirmOwner replies.
- Optional: add randomized backoff to reduce thundering herd on contested cart ids.

### C. CartId Handling
Current: ephemeral 16-byte array with trimmed string semantics.
Simplify:
- Use ULID / UUIDv7 (time-ordered, collision-resistant) for easier external correlation.
- Provide helper `NewCartIdString()` and keep internal fixed-size if still desired.

### D. Mutation Signatures
Current: registry assumes `func(*CartGrain, *T) error`.
Extension option: allow pure transforms returning a delta struct (for audit/logging):
```
type MutationResult struct {
  Changed bool
  Events  []interface{}
}
```
Only implement if auditing/event-sourcing reintroduced.

---

## 3. Developer Experience Improvements

| Task | Rationale | Approach |
|------|-----------|----------|
| Makefile targets: `make run-single`, `make run-multi N=3` | Faster local cluster spin-up | Docker compose or background “mini cluster” scripts |
| Template for new mutation (generator) | Reduce boilerplate | `go:generate` scanning proto for new RPCs |
| Lint config (golangci-lint) | Catch subtle issues early | Add `.golangci.yml` |
| Pre-commit hook for proto regeneration check | Avoid stale generated code | Script compares git diff after `make protogen` |
| Example client (Go + curl snippets auto-generated) | Onboarding | Codegen a markdown from proto comments |

---

## 4. Observability / Metrics / Tracing

| Area | Metric / Trace | Notes |
|------|----------------|-------|
| Mutation registry | `cart_mutations_total{type,success}`; duration histogram | Wrap handler |
| Ownership negotiation | `cart_ownership_attempts_total{result}` | result=accepted,rejected,timeout |
| Remote latency | `cart_remote_mutation_seconds{method}` | Use client interceptors |
| Pings | `cart_remote_missed_pings_total{host}` | Already count, expose |
| Checkout flow | `checkout_attempts_total`, `checkout_failures_total` | Differentiate Klarna vs internal errors |
| Tracing | Span: HTTP handler → SyncedPool.Apply → (Remote?) gRPC → mutation handler | Add OpenTelemetry instrumentation |

---

## 5. Performance & Scalability

| Concern | Idea | Trade-Off |
|---------|------|-----------|
| High mutation rate on single cart | Introduce optional mutation queue (serialize explicitly) | Slight latency increase per op |
| Remote call overhead | Add client-side gRPC pooling & per-host circuit breaker | Complexity vs resilience |
| TTL purge efficiency | Use min-heap or timing wheel instead of slice scan | More code, better big-N performance |
| Batch network latency | Add `BatchMutate` RPC (list of mutations applied atomically) | Lost single-op simplicity |

---

## 6. Reliability Features

| Feature | Description | Priority |
|---------|-------------|----------|
| Lease fencing token | Include `ownership_version` in all remote mutate requests | M |
| Retry policy | Limited retry for transient network errors (idempotent mutations only) | L |
| Dead host reconciliation | On host removal, proactively attempt re-acquire of its carts | M |
| Drain mode | Node marks itself “draining” → refuses new ownership claims | M |

---

## 7. Security & Hardening

| Area | Next Step | Detail |
|------|-----------|--------|
| Transport | mTLS on gRPC | Use SPIFFE IDs or simple CA |
| AuthN/AuthZ | Interceptor enforcing service token | Inject metadata header |
| Input validation | Strengthen JSON decode responses | Disallow unknown fields globally |
| Rate limiting | Per-IP / per-cart throttling | Guard hotspot abuse |
| Multi-tenancy | Tenant id dimension in cart id or metadata | Partition metrics & ownership |

---

## 8. Testing Strategy Enhancements

| Gap | Improvement |
|-----|------------|
| No multi-node integration test in CI | Spin ephemeral in-process servers on randomized ports |
| Mutation regression | Table-driven tests auto-discover handlers via registry |
| Ownership race | Stress test: concurrent Apply on same new cart id from N goroutines |
| Checkout external dependency | Klarna mock server (HTTptest) + deterministic responses |
| Fuzzing | Fuzz `BuildCheckoutOrderPayload` & mutation handlers for panics |

---

## 9. Cleanup / Tech Debt

| Item | Action |
|------|--------|
| Remove deprecated proto remnants (CreateCheckoutOrder, Checkout RPC) | Delete & regenerate |
| Consolidate duplicate tax computations | Single helper with tax config |
| Delivery price hard-coded (4900) | Config or pricing strategy interface |
| Mixed naming (camel vs snake JSON historically) | Provide stable external API doc; accept old forms if needed |
| Manual remote mutation switch (if still present) | Replace with generated outbound registry |
| Mixed error responses (string bodies) | Standardize JSON: `{ "error": "...", "code": 400 }` |

---

## 10. Potential Future Features

| Feature | Value | Complexity |
|---------|-------|------------|
| Streaming `WatchState` RPC | Real-time cart updates for clients | Medium |
| Event sourcing / audit log | Replay, analytics, debugging | High |
| Promotion / coupon engine plugin | Business extensibility | Medium |
| Partial cart reservation / inventory lock | Stock accuracy under concurrency | High |
| Multi-currency pricing | Globalization | Medium |
| GraphQL facade | Client flexibility | Medium |

---

## 11. Suggested Prioritized Backlog (Condensed)

1. Coverage test + decode error mapping (P0)
2. Proto regeneration & cleanup (P0)
3. Metrics wrapper for registry (P1)
4. Multi-node ownership integration test (P1)
5. Delivery pricing abstraction (P2)
6. Lease version in remote RPCs (P2)
7. BatchMutate evaluation (P3)
8. TLS / auth hardening (P3) if going multi-tenant/public
9. Event sourcing (Evaluate after stability) (P4)

---

## 12. Simplifying the Developer Workflow

| Pain | Simplifier |
|------|------------|
| Manual mutation boilerplate | Code generator for registry stubs |
| Forgetting totals | Enforce WithTotals lint: fail if mutation touches items/deliveries without flag |
| Hard to inspect remote ownership | `/internal/ownership` debug endpoint (JSON of local + remoteIndex) |
| Hard to see mutation timings | Add `?debug=latency` header to return per-mutation durations |
| Cookie dev confusion (Secure flag) | Env var: `DEV_INSECURE_COOKIES=1` |

---

## 13. Example: Mutation Codegen Sketch (Future)

Input: cart_actor.proto  
Output: `mutation_auto.go`
- Detect messages used in RPC wrappers (e.g., `AddItemRequest` → payload field).
- Generate `RegisterMutation` template if handler not found.
- Mark with `// TODO implement logic`.

---

## 14. Risk / Impact Matrix (Abbreviated)

| Change | Risk | Mitigation |
|--------|------|-----------|
| Replace remote switch with registry | Possible missing registration → runtime error | Coverage test gating CI |
| Lease introduction | Split-brain if version mishandled | Increment + assert monotonic; test race |
| BatchMutate | Large atomic operations starving others | Size limits & fair scheduling |
| Event sourcing | Storage + replay complexity | Start with append-only log + compaction job |

---

## 15. Contributing Workflow (Proposed)

1. Add / modify proto → run `make protogen`
2. Implement mutation logic → add `RegisterMutation` invocation
3. Add/Update tests (unit + integration)
4. Run `make verify` (lint, test, coverage, proto diff)
5. Open PR (template auto-checklist referencing this TODO)
6. Merge requires green CI + coverage threshold

---

## 16. Open Questions

| Question | Notes |
|----------|-------|
| Do we need sticky sessions for HTTP layer scaling? | Currently cart id routing suffices |
| Should deliveries prune invalid line references on SetCartRequest? | Inconsistency risk; add optional cleanup |
| Is checkout idempotency strict enough? | Multiple create vs update semantics |
| Add version field to CartState for optimistic concurrency? | Could enable external CAS writes |

---

## 17. Tracking

Mark any completed tasks with `[x]`:

- [ ] Coverage test
- [ ] Decode helper + 400 mapping
- [ ] Proto cleanup
- [ ] Registry metrics instrumentation
- [ ] Ownership multi-node test
- [ ] Lease versioning
- [ ] Delivery pricing abstraction
- [ ] TLS/mTLS internal
- [ ] BatchMutate design doc

---

_Last updated: roadmap draft – refine after first metrics & scaling test run._