Co-authored-by: matst80 <mats.tornberg@gmail.com> Reviewed-on: https://git.tornberg.me/mats/go-cart-actor/pulls/4 Co-authored-by: Mats Törnberg <mats@tornberg.me> Co-committed-by: Mats Törnberg <mats@tornberg.me>
245 lines
10 KiB
Markdown
245 lines
10 KiB
Markdown
# TODO / Roadmap
|
||
|
||
A living roadmap for improving the cart actor system. Focus areas:
|
||
1. Reliability & correctness
|
||
2. Simplicity of mutation & ownership flows
|
||
3. Developer experience (DX)
|
||
4. Operability (observability, tracing, metrics)
|
||
5. Performance & scalability
|
||
6. Security & multi-tenant readiness
|
||
|
||
---
|
||
|
||
## 1. Immediate Next Steps (High-Leverage)
|
||
|
||
| Priority | Task | Goal | Effort | Owner | Notes |
|
||
|----------|------|------|--------|-------|-------|
|
||
| P0 | Add mutation registry coverage test | Ensure no unregistered mutations silently fail | S | | Failing fast in CI |
|
||
| P0 | Add decodeJSON helper + 400 mapping for EOF | Reduce noisy 500 logs | S | | Improves client API clarity |
|
||
| P0 | Regenerate protos & prune unused messages (CreateCheckoutOrder, Checkout RPC remnants) | Eliminate dead types | S | | Avoid confusion |
|
||
| P0 | Add integration test: multi-node ownership negotiation | Validate quorum logic | M | | Spin up 2–3 nodes ephemeral |
|
||
| P1 | Export Prometheus metrics for per-mutation counts & latency | Operability | M | | Wrap registry handlers |
|
||
| P1 | Add graceful shutdown ordering (Closing → wait for acks → stop gRPC) | Reduce in-flight mutation failures | S | | Add context cancellation |
|
||
| P1 | Add coverage for InitializeCheckout / OrderCreated flows | Checkout reliability | S | | Simulate Klarna stub |
|
||
| P2 | Add optional batching client (apply multiple mutations locally then persist) | Performance | M | | Only if needed |
|
||
|
||
---
|
||
|
||
## 2. Simplification Opportunities
|
||
|
||
### A. RemoteGrain Proxy Mapping
|
||
Current: manual switch building each RPC call.
|
||
Simplify by:
|
||
- Generating a thin client adapter from proto RPC descriptors (codegen).
|
||
- Or using a registry similar to mutation registry but for “outbound call constructors”.
|
||
Benefit: adding a new mutation = add proto + register server handler + register outbound invoker (no switch edits).
|
||
|
||
### B. Ownership Negotiation
|
||
Current: ad hoc quorum rule in `SyncedPool`.
|
||
Simplify:
|
||
- Introduce explicit `OwnershipLease{holder, expiresAt, version}`.
|
||
- Use monotonic version increment—reject stale ConfirmOwner replies.
|
||
- Optional: add randomized backoff to reduce thundering herd on contested cart ids.
|
||
|
||
### C. CartId Handling
|
||
Current: ephemeral 16-byte array with trimmed string semantics.
|
||
Simplify:
|
||
- Use ULID / UUIDv7 (time-ordered, collision-resistant) for easier external correlation.
|
||
- Provide helper `NewCartIdString()` and keep internal fixed-size if still desired.
|
||
|
||
### D. Mutation Signatures
|
||
Current: registry assumes `func(*CartGrain, *T) error`.
|
||
Extension option: allow pure transforms returning a delta struct (for audit/logging):
|
||
```
|
||
type MutationResult struct {
|
||
Changed bool
|
||
Events []interface{}
|
||
}
|
||
```
|
||
Only implement if auditing/event-sourcing reintroduced.
|
||
|
||
---
|
||
|
||
## 3. Developer Experience Improvements
|
||
|
||
| Task | Rationale | Approach |
|
||
|------|-----------|----------|
|
||
| Makefile targets: `make run-single`, `make run-multi N=3` | Faster local cluster spin-up | Docker compose or background “mini cluster” scripts |
|
||
| Template for new mutation (generator) | Reduce boilerplate | `go:generate` scanning proto for new RPCs |
|
||
| Lint config (golangci-lint) | Catch subtle issues early | Add `.golangci.yml` |
|
||
| Pre-commit hook for proto regeneration check | Avoid stale generated code | Script compares git diff after `make protogen` |
|
||
| Example client (Go + curl snippets auto-generated) | Onboarding | Codegen a markdown from proto comments |
|
||
|
||
---
|
||
|
||
## 4. Observability / Metrics / Tracing
|
||
|
||
| Area | Metric / Trace | Notes |
|
||
|------|----------------|-------|
|
||
| Mutation registry | `cart_mutations_total{type,success}`; duration histogram | Wrap handler |
|
||
| Ownership negotiation | `cart_ownership_attempts_total{result}` | result=accepted,rejected,timeout |
|
||
| Remote latency | `cart_remote_mutation_seconds{method}` | Use client interceptors |
|
||
| Pings | `cart_remote_missed_pings_total{host}` | Already count, expose |
|
||
| Checkout flow | `checkout_attempts_total`, `checkout_failures_total` | Differentiate Klarna vs internal errors |
|
||
| Tracing | Span: HTTP handler → SyncedPool.Apply → (Remote?) gRPC → mutation handler | Add OpenTelemetry instrumentation |
|
||
|
||
---
|
||
|
||
## 5. Performance & Scalability
|
||
|
||
| Concern | Idea | Trade-Off |
|
||
|---------|------|-----------|
|
||
| High mutation rate on single cart | Introduce optional mutation queue (serialize explicitly) | Slight latency increase per op |
|
||
| Remote call overhead | Add client-side gRPC pooling & per-host circuit breaker | Complexity vs resilience |
|
||
| TTL purge efficiency | Use min-heap or timing wheel instead of slice scan | More code, better big-N performance |
|
||
| Batch network latency | Add `BatchMutate` RPC (list of mutations applied atomically) | Lost single-op simplicity |
|
||
|
||
---
|
||
|
||
## 6. Reliability Features
|
||
|
||
| Feature | Description | Priority |
|
||
|---------|-------------|----------|
|
||
| Lease fencing token | Include `ownership_version` in all remote mutate requests | M |
|
||
| Retry policy | Limited retry for transient network errors (idempotent mutations only) | L |
|
||
| Dead host reconciliation | On host removal, proactively attempt re-acquire of its carts | M |
|
||
| Drain mode | Node marks itself “draining” → refuses new ownership claims | M |
|
||
|
||
---
|
||
|
||
## 7. Security & Hardening
|
||
|
||
| Area | Next Step | Detail |
|
||
|------|-----------|--------|
|
||
| Transport | mTLS on gRPC | Use SPIFFE IDs or simple CA |
|
||
| AuthN/AuthZ | Interceptor enforcing service token | Inject metadata header |
|
||
| Input validation | Strengthen JSON decode responses | Disallow unknown fields globally |
|
||
| Rate limiting | Per-IP / per-cart throttling | Guard hotspot abuse |
|
||
| Multi-tenancy | Tenant id dimension in cart id or metadata | Partition metrics & ownership |
|
||
|
||
---
|
||
|
||
## 8. Testing Strategy Enhancements
|
||
|
||
| Gap | Improvement |
|
||
|-----|------------|
|
||
| No multi-node integration test in CI | Spin ephemeral in-process servers on randomized ports |
|
||
| Mutation regression | Table-driven tests auto-discover handlers via registry |
|
||
| Ownership race | Stress test: concurrent Apply on same new cart id from N goroutines |
|
||
| Checkout external dependency | Klarna mock server (HTTptest) + deterministic responses |
|
||
| Fuzzing | Fuzz `BuildCheckoutOrderPayload` & mutation handlers for panics |
|
||
|
||
---
|
||
|
||
## 9. Cleanup / Tech Debt
|
||
|
||
| Item | Action |
|
||
|------|--------|
|
||
| Remove deprecated proto remnants (CreateCheckoutOrder, Checkout RPC) | Delete & regenerate |
|
||
| Consolidate duplicate tax computations | Single helper with tax config |
|
||
| Delivery price hard-coded (4900) | Config or pricing strategy interface |
|
||
| Mixed naming (camel vs snake JSON historically) | Provide stable external API doc; accept old forms if needed |
|
||
| Manual remote mutation switch (if still present) | Replace with generated outbound registry |
|
||
| Mixed error responses (string bodies) | Standardize JSON: `{ "error": "...", "code": 400 }` |
|
||
|
||
---
|
||
|
||
## 10. Potential Future Features
|
||
|
||
| Feature | Value | Complexity |
|
||
|---------|-------|------------|
|
||
| Streaming `WatchState` RPC | Real-time cart updates for clients | Medium |
|
||
| Event sourcing / audit log | Replay, analytics, debugging | High |
|
||
| Promotion / coupon engine plugin | Business extensibility | Medium |
|
||
| Partial cart reservation / inventory lock | Stock accuracy under concurrency | High |
|
||
| Multi-currency pricing | Globalization | Medium |
|
||
| GraphQL facade | Client flexibility | Medium |
|
||
|
||
---
|
||
|
||
## 11. Suggested Prioritized Backlog (Condensed)
|
||
|
||
1. Coverage test + decode error mapping (P0)
|
||
2. Proto regeneration & cleanup (P0)
|
||
3. Metrics wrapper for registry (P1)
|
||
4. Multi-node ownership integration test (P1)
|
||
5. Delivery pricing abstraction (P2)
|
||
6. Lease version in remote RPCs (P2)
|
||
7. BatchMutate evaluation (P3)
|
||
8. TLS / auth hardening (P3) if going multi-tenant/public
|
||
9. Event sourcing (Evaluate after stability) (P4)
|
||
|
||
---
|
||
|
||
## 12. Simplifying the Developer Workflow
|
||
|
||
| Pain | Simplifier |
|
||
|------|------------|
|
||
| Manual mutation boilerplate | Code generator for registry stubs |
|
||
| Forgetting totals | Enforce WithTotals lint: fail if mutation touches items/deliveries without flag |
|
||
| Hard to inspect remote ownership | `/internal/ownership` debug endpoint (JSON of local + remoteIndex) |
|
||
| Hard to see mutation timings | Add `?debug=latency` header to return per-mutation durations |
|
||
| Cookie dev confusion (Secure flag) | Env var: `DEV_INSECURE_COOKIES=1` |
|
||
|
||
---
|
||
|
||
## 13. Example: Mutation Codegen Sketch (Future)
|
||
|
||
Input: cart_actor.proto
|
||
Output: `mutation_auto.go`
|
||
- Detect messages used in RPC wrappers (e.g., `AddItemRequest` → payload field).
|
||
- Generate `RegisterMutation` template if handler not found.
|
||
- Mark with `// TODO implement logic`.
|
||
|
||
---
|
||
|
||
## 14. Risk / Impact Matrix (Abbreviated)
|
||
|
||
| Change | Risk | Mitigation |
|
||
|--------|------|-----------|
|
||
| Replace remote switch with registry | Possible missing registration → runtime error | Coverage test gating CI |
|
||
| Lease introduction | Split-brain if version mishandled | Increment + assert monotonic; test race |
|
||
| BatchMutate | Large atomic operations starving others | Size limits & fair scheduling |
|
||
| Event sourcing | Storage + replay complexity | Start with append-only log + compaction job |
|
||
|
||
---
|
||
|
||
## 15. Contributing Workflow (Proposed)
|
||
|
||
1. Add / modify proto → run `make protogen`
|
||
2. Implement mutation logic → add `RegisterMutation` invocation
|
||
3. Add/Update tests (unit + integration)
|
||
4. Run `make verify` (lint, test, coverage, proto diff)
|
||
5. Open PR (template auto-checklist referencing this TODO)
|
||
6. Merge requires green CI + coverage threshold
|
||
|
||
---
|
||
|
||
## 16. Open Questions
|
||
|
||
| Question | Notes |
|
||
|----------|-------|
|
||
| Do we need sticky sessions for HTTP layer scaling? | Currently cart id routing suffices |
|
||
| Should deliveries prune invalid line references on SetCartRequest? | Inconsistency risk; add optional cleanup |
|
||
| Is checkout idempotency strict enough? | Multiple create vs update semantics |
|
||
| Add version field to CartState for optimistic concurrency? | Could enable external CAS writes |
|
||
|
||
---
|
||
|
||
## 17. Tracking
|
||
|
||
Mark any completed tasks with `[x]`:
|
||
|
||
- [ ] Coverage test
|
||
- [ ] Decode helper + 400 mapping
|
||
- [ ] Proto cleanup
|
||
- [ ] Registry metrics instrumentation
|
||
- [ ] Ownership multi-node test
|
||
- [ ] Lease versioning
|
||
- [ ] Delivery pricing abstraction
|
||
- [ ] TLS/mTLS internal
|
||
- [ ] BatchMutate design doc
|
||
|
||
---
|
||
|
||
_Last updated: roadmap draft – refine after first metrics & scaling test run._ |