Files
go-cart-actor/TODO.md
Mats Törnberg f5014fe906
All checks were successful
Build and Publish / Metadata (push) Successful in 11s
Build and Publish / BuildAndDeployAmd64 (push) Successful in 1m14s
Build and Publish / BuildAndDeployArm64 (push) Successful in 3m54s
Complete refactor to new grpc control plane and only http proxy for carts (#4)
Co-authored-by: matst80 <mats.tornberg@gmail.com>
Reviewed-on: https://git.tornberg.me/mats/go-cart-actor/pulls/4
Co-authored-by: Mats Törnberg <mats@tornberg.me>
Co-committed-by: Mats Törnberg <mats@tornberg.me>
2025-10-14 22:31:12 +02:00

245 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# TODO / Roadmap
A living roadmap for improving the cart actor system. Focus areas:
1. Reliability & correctness
2. Simplicity of mutation & ownership flows
3. Developer experience (DX)
4. Operability (observability, tracing, metrics)
5. Performance & scalability
6. Security & multi-tenant readiness
---
## 1. Immediate Next Steps (High-Leverage)
| Priority | Task | Goal | Effort | Owner | Notes |
|----------|------|------|--------|-------|-------|
| P0 | Add mutation registry coverage test | Ensure no unregistered mutations silently fail | S | | Failing fast in CI |
| P0 | Add decodeJSON helper + 400 mapping for EOF | Reduce noisy 500 logs | S | | Improves client API clarity |
| P0 | Regenerate protos & prune unused messages (CreateCheckoutOrder, Checkout RPC remnants) | Eliminate dead types | S | | Avoid confusion |
| P0 | Add integration test: multi-node ownership negotiation | Validate quorum logic | M | | Spin up 23 nodes ephemeral |
| P1 | Export Prometheus metrics for per-mutation counts & latency | Operability | M | | Wrap registry handlers |
| P1 | Add graceful shutdown ordering (Closing → wait for acks → stop gRPC) | Reduce in-flight mutation failures | S | | Add context cancellation |
| P1 | Add coverage for InitializeCheckout / OrderCreated flows | Checkout reliability | S | | Simulate Klarna stub |
| P2 | Add optional batching client (apply multiple mutations locally then persist) | Performance | M | | Only if needed |
---
## 2. Simplification Opportunities
### A. RemoteGrain Proxy Mapping
Current: manual switch building each RPC call.
Simplify by:
- Generating a thin client adapter from proto RPC descriptors (codegen).
- Or using a registry similar to mutation registry but for “outbound call constructors”.
Benefit: adding a new mutation = add proto + register server handler + register outbound invoker (no switch edits).
### B. Ownership Negotiation
Current: ad hoc quorum rule in `SyncedPool`.
Simplify:
- Introduce explicit `OwnershipLease{holder, expiresAt, version}`.
- Use monotonic version increment—reject stale ConfirmOwner replies.
- Optional: add randomized backoff to reduce thundering herd on contested cart ids.
### C. CartId Handling
Current: ephemeral 16-byte array with trimmed string semantics.
Simplify:
- Use ULID / UUIDv7 (time-ordered, collision-resistant) for easier external correlation.
- Provide helper `NewCartIdString()` and keep internal fixed-size if still desired.
### D. Mutation Signatures
Current: registry assumes `func(*CartGrain, *T) error`.
Extension option: allow pure transforms returning a delta struct (for audit/logging):
```
type MutationResult struct {
Changed bool
Events []interface{}
}
```
Only implement if auditing/event-sourcing reintroduced.
---
## 3. Developer Experience Improvements
| Task | Rationale | Approach |
|------|-----------|----------|
| Makefile targets: `make run-single`, `make run-multi N=3` | Faster local cluster spin-up | Docker compose or background “mini cluster” scripts |
| Template for new mutation (generator) | Reduce boilerplate | `go:generate` scanning proto for new RPCs |
| Lint config (golangci-lint) | Catch subtle issues early | Add `.golangci.yml` |
| Pre-commit hook for proto regeneration check | Avoid stale generated code | Script compares git diff after `make protogen` |
| Example client (Go + curl snippets auto-generated) | Onboarding | Codegen a markdown from proto comments |
---
## 4. Observability / Metrics / Tracing
| Area | Metric / Trace | Notes |
|------|----------------|-------|
| Mutation registry | `cart_mutations_total{type,success}`; duration histogram | Wrap handler |
| Ownership negotiation | `cart_ownership_attempts_total{result}` | result=accepted,rejected,timeout |
| Remote latency | `cart_remote_mutation_seconds{method}` | Use client interceptors |
| Pings | `cart_remote_missed_pings_total{host}` | Already count, expose |
| Checkout flow | `checkout_attempts_total`, `checkout_failures_total` | Differentiate Klarna vs internal errors |
| Tracing | Span: HTTP handler → SyncedPool.Apply → (Remote?) gRPC → mutation handler | Add OpenTelemetry instrumentation |
---
## 5. Performance & Scalability
| Concern | Idea | Trade-Off |
|---------|------|-----------|
| High mutation rate on single cart | Introduce optional mutation queue (serialize explicitly) | Slight latency increase per op |
| Remote call overhead | Add client-side gRPC pooling & per-host circuit breaker | Complexity vs resilience |
| TTL purge efficiency | Use min-heap or timing wheel instead of slice scan | More code, better big-N performance |
| Batch network latency | Add `BatchMutate` RPC (list of mutations applied atomically) | Lost single-op simplicity |
---
## 6. Reliability Features
| Feature | Description | Priority |
|---------|-------------|----------|
| Lease fencing token | Include `ownership_version` in all remote mutate requests | M |
| Retry policy | Limited retry for transient network errors (idempotent mutations only) | L |
| Dead host reconciliation | On host removal, proactively attempt re-acquire of its carts | M |
| Drain mode | Node marks itself “draining” → refuses new ownership claims | M |
---
## 7. Security & Hardening
| Area | Next Step | Detail |
|------|-----------|--------|
| Transport | mTLS on gRPC | Use SPIFFE IDs or simple CA |
| AuthN/AuthZ | Interceptor enforcing service token | Inject metadata header |
| Input validation | Strengthen JSON decode responses | Disallow unknown fields globally |
| Rate limiting | Per-IP / per-cart throttling | Guard hotspot abuse |
| Multi-tenancy | Tenant id dimension in cart id or metadata | Partition metrics & ownership |
---
## 8. Testing Strategy Enhancements
| Gap | Improvement |
|-----|------------|
| No multi-node integration test in CI | Spin ephemeral in-process servers on randomized ports |
| Mutation regression | Table-driven tests auto-discover handlers via registry |
| Ownership race | Stress test: concurrent Apply on same new cart id from N goroutines |
| Checkout external dependency | Klarna mock server (HTTptest) + deterministic responses |
| Fuzzing | Fuzz `BuildCheckoutOrderPayload` & mutation handlers for panics |
---
## 9. Cleanup / Tech Debt
| Item | Action |
|------|--------|
| Remove deprecated proto remnants (CreateCheckoutOrder, Checkout RPC) | Delete & regenerate |
| Consolidate duplicate tax computations | Single helper with tax config |
| Delivery price hard-coded (4900) | Config or pricing strategy interface |
| Mixed naming (camel vs snake JSON historically) | Provide stable external API doc; accept old forms if needed |
| Manual remote mutation switch (if still present) | Replace with generated outbound registry |
| Mixed error responses (string bodies) | Standardize JSON: `{ "error": "...", "code": 400 }` |
---
## 10. Potential Future Features
| Feature | Value | Complexity |
|---------|-------|------------|
| Streaming `WatchState` RPC | Real-time cart updates for clients | Medium |
| Event sourcing / audit log | Replay, analytics, debugging | High |
| Promotion / coupon engine plugin | Business extensibility | Medium |
| Partial cart reservation / inventory lock | Stock accuracy under concurrency | High |
| Multi-currency pricing | Globalization | Medium |
| GraphQL facade | Client flexibility | Medium |
---
## 11. Suggested Prioritized Backlog (Condensed)
1. Coverage test + decode error mapping (P0)
2. Proto regeneration & cleanup (P0)
3. Metrics wrapper for registry (P1)
4. Multi-node ownership integration test (P1)
5. Delivery pricing abstraction (P2)
6. Lease version in remote RPCs (P2)
7. BatchMutate evaluation (P3)
8. TLS / auth hardening (P3) if going multi-tenant/public
9. Event sourcing (Evaluate after stability) (P4)
---
## 12. Simplifying the Developer Workflow
| Pain | Simplifier |
|------|------------|
| Manual mutation boilerplate | Code generator for registry stubs |
| Forgetting totals | Enforce WithTotals lint: fail if mutation touches items/deliveries without flag |
| Hard to inspect remote ownership | `/internal/ownership` debug endpoint (JSON of local + remoteIndex) |
| Hard to see mutation timings | Add `?debug=latency` header to return per-mutation durations |
| Cookie dev confusion (Secure flag) | Env var: `DEV_INSECURE_COOKIES=1` |
---
## 13. Example: Mutation Codegen Sketch (Future)
Input: cart_actor.proto
Output: `mutation_auto.go`
- Detect messages used in RPC wrappers (e.g., `AddItemRequest` → payload field).
- Generate `RegisterMutation` template if handler not found.
- Mark with `// TODO implement logic`.
---
## 14. Risk / Impact Matrix (Abbreviated)
| Change | Risk | Mitigation |
|--------|------|-----------|
| Replace remote switch with registry | Possible missing registration → runtime error | Coverage test gating CI |
| Lease introduction | Split-brain if version mishandled | Increment + assert monotonic; test race |
| BatchMutate | Large atomic operations starving others | Size limits & fair scheduling |
| Event sourcing | Storage + replay complexity | Start with append-only log + compaction job |
---
## 15. Contributing Workflow (Proposed)
1. Add / modify proto → run `make protogen`
2. Implement mutation logic → add `RegisterMutation` invocation
3. Add/Update tests (unit + integration)
4. Run `make verify` (lint, test, coverage, proto diff)
5. Open PR (template auto-checklist referencing this TODO)
6. Merge requires green CI + coverage threshold
---
## 16. Open Questions
| Question | Notes |
|----------|-------|
| Do we need sticky sessions for HTTP layer scaling? | Currently cart id routing suffices |
| Should deliveries prune invalid line references on SetCartRequest? | Inconsistency risk; add optional cleanup |
| Is checkout idempotency strict enough? | Multiple create vs update semantics |
| Add version field to CartState for optimistic concurrency? | Could enable external CAS writes |
---
## 17. Tracking
Mark any completed tasks with `[x]`:
- [ ] Coverage test
- [ ] Decode helper + 400 mapping
- [ ] Proto cleanup
- [ ] Registry metrics instrumentation
- [ ] Ownership multi-node test
- [ ] Lease versioning
- [ ] Delivery pricing abstraction
- [ ] TLS/mTLS internal
- [ ] BatchMutate design doc
---
_Last updated: roadmap draft refine after first metrics & scaling test run._