even more refactoring
Some checks failed
Build and Publish / BuildAndDeploy (push) Successful in 3m7s
Build and Publish / BuildAndDeployAmd64 (push) Has been cancelled

This commit is contained in:
matst80
2025-10-10 11:46:19 +00:00
parent 12d87036f6
commit 716f1121aa
32 changed files with 3857 additions and 953 deletions

245
TODO.md Normal file
View File

@@ -0,0 +1,245 @@
# TODO / Roadmap
A living roadmap for improving the cart actor system. Focus areas:
1. Reliability & correctness
2. Simplicity of mutation & ownership flows
3. Developer experience (DX)
4. Operability (observability, tracing, metrics)
5. Performance & scalability
6. Security & multi-tenant readiness
---
## 1. Immediate Next Steps (High-Leverage)
| Priority | Task | Goal | Effort | Owner | Notes |
|----------|------|------|--------|-------|-------|
| P0 | Add mutation registry coverage test | Ensure no unregistered mutations silently fail | S | | Failing fast in CI |
| P0 | Add decodeJSON helper + 400 mapping for EOF | Reduce noisy 500 logs | S | | Improves client API clarity |
| P0 | Regenerate protos & prune unused messages (CreateCheckoutOrder, Checkout RPC remnants) | Eliminate dead types | S | | Avoid confusion |
| P0 | Add integration test: multi-node ownership negotiation | Validate quorum logic | M | | Spin up 23 nodes ephemeral |
| P1 | Export Prometheus metrics for per-mutation counts & latency | Operability | M | | Wrap registry handlers |
| P1 | Add graceful shutdown ordering (Closing → wait for acks → stop gRPC) | Reduce in-flight mutation failures | S | | Add context cancellation |
| P1 | Add coverage for InitializeCheckout / OrderCreated flows | Checkout reliability | S | | Simulate Klarna stub |
| P2 | Add optional batching client (apply multiple mutations locally then persist) | Performance | M | | Only if needed |
---
## 2. Simplification Opportunities
### A. RemoteGrain Proxy Mapping
Current: manual switch building each RPC call.
Simplify by:
- Generating a thin client adapter from proto RPC descriptors (codegen).
- Or using a registry similar to mutation registry but for “outbound call constructors”.
Benefit: adding a new mutation = add proto + register server handler + register outbound invoker (no switch edits).
### B. Ownership Negotiation
Current: ad hoc quorum rule in `SyncedPool`.
Simplify:
- Introduce explicit `OwnershipLease{holder, expiresAt, version}`.
- Use monotonic version increment—reject stale ConfirmOwner replies.
- Optional: add randomized backoff to reduce thundering herd on contested cart ids.
### C. CartId Handling
Current: ephemeral 16-byte array with trimmed string semantics.
Simplify:
- Use ULID / UUIDv7 (time-ordered, collision-resistant) for easier external correlation.
- Provide helper `NewCartIdString()` and keep internal fixed-size if still desired.
### D. Mutation Signatures
Current: registry assumes `func(*CartGrain, *T) error`.
Extension option: allow pure transforms returning a delta struct (for audit/logging):
```
type MutationResult struct {
Changed bool
Events []interface{}
}
```
Only implement if auditing/event-sourcing reintroduced.
---
## 3. Developer Experience Improvements
| Task | Rationale | Approach |
|------|-----------|----------|
| Makefile targets: `make run-single`, `make run-multi N=3` | Faster local cluster spin-up | Docker compose or background “mini cluster” scripts |
| Template for new mutation (generator) | Reduce boilerplate | `go:generate` scanning proto for new RPCs |
| Lint config (golangci-lint) | Catch subtle issues early | Add `.golangci.yml` |
| Pre-commit hook for proto regeneration check | Avoid stale generated code | Script compares git diff after `make protogen` |
| Example client (Go + curl snippets auto-generated) | Onboarding | Codegen a markdown from proto comments |
---
## 4. Observability / Metrics / Tracing
| Area | Metric / Trace | Notes |
|------|----------------|-------|
| Mutation registry | `cart_mutations_total{type,success}`; duration histogram | Wrap handler |
| Ownership negotiation | `cart_ownership_attempts_total{result}` | result=accepted,rejected,timeout |
| Remote latency | `cart_remote_mutation_seconds{method}` | Use client interceptors |
| Pings | `cart_remote_missed_pings_total{host}` | Already count, expose |
| Checkout flow | `checkout_attempts_total`, `checkout_failures_total` | Differentiate Klarna vs internal errors |
| Tracing | Span: HTTP handler → SyncedPool.Apply → (Remote?) gRPC → mutation handler | Add OpenTelemetry instrumentation |
---
## 5. Performance & Scalability
| Concern | Idea | Trade-Off |
|---------|------|-----------|
| High mutation rate on single cart | Introduce optional mutation queue (serialize explicitly) | Slight latency increase per op |
| Remote call overhead | Add client-side gRPC pooling & per-host circuit breaker | Complexity vs resilience |
| TTL purge efficiency | Use min-heap or timing wheel instead of slice scan | More code, better big-N performance |
| Batch network latency | Add `BatchMutate` RPC (list of mutations applied atomically) | Lost single-op simplicity |
---
## 6. Reliability Features
| Feature | Description | Priority |
|---------|-------------|----------|
| Lease fencing token | Include `ownership_version` in all remote mutate requests | M |
| Retry policy | Limited retry for transient network errors (idempotent mutations only) | L |
| Dead host reconciliation | On host removal, proactively attempt re-acquire of its carts | M |
| Drain mode | Node marks itself “draining” → refuses new ownership claims | M |
---
## 7. Security & Hardening
| Area | Next Step | Detail |
|------|-----------|--------|
| Transport | mTLS on gRPC | Use SPIFFE IDs or simple CA |
| AuthN/AuthZ | Interceptor enforcing service token | Inject metadata header |
| Input validation | Strengthen JSON decode responses | Disallow unknown fields globally |
| Rate limiting | Per-IP / per-cart throttling | Guard hotspot abuse |
| Multi-tenancy | Tenant id dimension in cart id or metadata | Partition metrics & ownership |
---
## 8. Testing Strategy Enhancements
| Gap | Improvement |
|-----|------------|
| No multi-node integration test in CI | Spin ephemeral in-process servers on randomized ports |
| Mutation regression | Table-driven tests auto-discover handlers via registry |
| Ownership race | Stress test: concurrent Apply on same new cart id from N goroutines |
| Checkout external dependency | Klarna mock server (HTTptest) + deterministic responses |
| Fuzzing | Fuzz `BuildCheckoutOrderPayload` & mutation handlers for panics |
---
## 9. Cleanup / Tech Debt
| Item | Action |
|------|--------|
| Remove deprecated proto remnants (CreateCheckoutOrder, Checkout RPC) | Delete & regenerate |
| Consolidate duplicate tax computations | Single helper with tax config |
| Delivery price hard-coded (4900) | Config or pricing strategy interface |
| Mixed naming (camel vs snake JSON historically) | Provide stable external API doc; accept old forms if needed |
| Manual remote mutation switch (if still present) | Replace with generated outbound registry |
| Mixed error responses (string bodies) | Standardize JSON: `{ "error": "...", "code": 400 }` |
---
## 10. Potential Future Features
| Feature | Value | Complexity |
|---------|-------|------------|
| Streaming `WatchState` RPC | Real-time cart updates for clients | Medium |
| Event sourcing / audit log | Replay, analytics, debugging | High |
| Promotion / coupon engine plugin | Business extensibility | Medium |
| Partial cart reservation / inventory lock | Stock accuracy under concurrency | High |
| Multi-currency pricing | Globalization | Medium |
| GraphQL facade | Client flexibility | Medium |
---
## 11. Suggested Prioritized Backlog (Condensed)
1. Coverage test + decode error mapping (P0)
2. Proto regeneration & cleanup (P0)
3. Metrics wrapper for registry (P1)
4. Multi-node ownership integration test (P1)
5. Delivery pricing abstraction (P2)
6. Lease version in remote RPCs (P2)
7. BatchMutate evaluation (P3)
8. TLS / auth hardening (P3) if going multi-tenant/public
9. Event sourcing (Evaluate after stability) (P4)
---
## 12. Simplifying the Developer Workflow
| Pain | Simplifier |
|------|------------|
| Manual mutation boilerplate | Code generator for registry stubs |
| Forgetting totals | Enforce WithTotals lint: fail if mutation touches items/deliveries without flag |
| Hard to inspect remote ownership | `/internal/ownership` debug endpoint (JSON of local + remoteIndex) |
| Hard to see mutation timings | Add `?debug=latency` header to return per-mutation durations |
| Cookie dev confusion (Secure flag) | Env var: `DEV_INSECURE_COOKIES=1` |
---
## 13. Example: Mutation Codegen Sketch (Future)
Input: cart_actor.proto
Output: `mutation_auto.go`
- Detect messages used in RPC wrappers (e.g., `AddItemRequest` → payload field).
- Generate `RegisterMutation` template if handler not found.
- Mark with `// TODO implement logic`.
---
## 14. Risk / Impact Matrix (Abbreviated)
| Change | Risk | Mitigation |
|--------|------|-----------|
| Replace remote switch with registry | Possible missing registration → runtime error | Coverage test gating CI |
| Lease introduction | Split-brain if version mishandled | Increment + assert monotonic; test race |
| BatchMutate | Large atomic operations starving others | Size limits & fair scheduling |
| Event sourcing | Storage + replay complexity | Start with append-only log + compaction job |
---
## 15. Contributing Workflow (Proposed)
1. Add / modify proto → run `make protogen`
2. Implement mutation logic → add `RegisterMutation` invocation
3. Add/Update tests (unit + integration)
4. Run `make verify` (lint, test, coverage, proto diff)
5. Open PR (template auto-checklist referencing this TODO)
6. Merge requires green CI + coverage threshold
---
## 16. Open Questions
| Question | Notes |
|----------|-------|
| Do we need sticky sessions for HTTP layer scaling? | Currently cart id routing suffices |
| Should deliveries prune invalid line references on SetCartRequest? | Inconsistency risk; add optional cleanup |
| Is checkout idempotency strict enough? | Multiple create vs update semantics |
| Add version field to CartState for optimistic concurrency? | Could enable external CAS writes |
---
## 17. Tracking
Mark any completed tasks with `[x]`:
- [ ] Coverage test
- [ ] Decode helper + 400 mapping
- [ ] Proto cleanup
- [ ] Registry metrics instrumentation
- [ ] Ownership multi-node test
- [ ] Lease versioning
- [ ] Delivery pricing abstraction
- [ ] TLS/mTLS internal
- [ ] BatchMutate design doc
---
_Last updated: roadmap draft refine after first metrics & scaling test run._