# gRPC Migration Plan File: GRPC-MIGRATION-PLAN.md Author: (Generated plan) Status: Draft for review Target Release: Next major version (breaking change – no mixed compatibility) --- ## 1. Overview This document describes the full migration of the current custom TCP frame-based protocol (both the cart mutation/state channel on port `1337` and the control plane on port `1338`) to gRPC. We will remove all legacy packet framing (`FrameWithPayload`, `RemoteGrain`, `GenericListener` handlers for these two ports) and replace them with two gRPC services: 1. Cart Actor Service (mutations + state retrieval) 2. Control Plane Service (cluster membership, negotiation, ownership change, lifecycle) We intentionally keep: - Internal `CartGrain` logic, message storage format, disk persistence, and JSON cart serialization. - Existing message type numeric mapping for backward compatibility with persisted event logs. - HTTP/REST API layer unchanged (it still consumes JSON state from the local/remote grain pipeline). We do NOT implement mixed-version compatibility; migration occurs atomically (cluster restart with new image). --- ## 2. Goals - Remove custom binary frame protocol & simplify maintenance. - Provide clearer, strongly defined interfaces via `.proto` schemas. - Improve observability via gRPC interceptors (metrics & tracing hooks). - Reduce per-call overhead compared with the current manual connection pooling + handwritten framing (HTTP/2 multiplexing + connection reuse). - Prepare groundwork for future enhancements (streaming, typed state, event streaming) without rewriting again. --- ## 3. Non-Goals (Phase 1) - Converting the cart state payload from JSON to a strongly typed proto. - Introducing authentication / mTLS (may be added later). - Changing persistence or replay format. - Changing the HTTP API contract. - Implementing streaming watchers or push updates. --- ## 4. Architecture After Migration Ports: - `:1337` → gRPC CartActor service. - `:1338` → gRPC ControlPlane service. Each node: - Runs one gRPC server with both services (can use a single listener bound to two services or keep two separate listeners; we will keep two ports initially to minimize operational surprise, but they could be merged later). - Maintains a connection pool of `*grpc.ClientConn` objects keyed by remote hostname (one per remote host, reused for both services). Call Flow (Mutation): 1. HTTP request hits `PoolServer`. 2. `SyncedPool.getGrain(cartId)`: - Local: direct invocation. - Remote: uses `RemoteGrainGRPC` (new) which invokes `CartActor.Mutate`. 3. Response JSON returned unchanged. Control Plane Flow: - Discovery (K8s watch) still triggers `AddRemote(host)`. - Instead of custom `Ping`, `Negotiate`, etc. via frames, call gRPC methods on `ControlPlane` service. - Ownership changes use `ConfirmOwner` RPC. --- ## 5. Proto Design ### 5.1 Cart Actor Proto (Envelope Pattern) We keep an envelope with `bytes payload` holding the serialized underlying cart mutation proto (existing types in `messages.proto`). This minimizes churn. Indented code block (proto sketch): syntax = "proto3"; package cart; option go_package = "git.tornberg.me/go-cart-actor/proto;proto"; enum MutationType { MUTATION_TYPE_UNSPECIFIED = 0; MUTATION_ADD_REQUEST = 1; MUTATION_ADD_ITEM = 2; MUTATION_REMOVE_ITEM = 4; MUTATION_REMOVE_DELIVERY = 5; MUTATION_CHANGE_QUANTITY = 6; MUTATION_SET_DELIVERY = 7; MUTATION_SET_PICKUP_POINT = 8; MUTATION_CREATE_CHECKOUT_ORDER = 9; MUTATION_SET_CART_ITEMS = 10; MUTATION_ORDER_COMPLETED = 11; } message MutationRequest { string cart_id = 1; MutationType type = 2; bytes payload = 3; // Serialized specific mutation proto int64 client_timestamp = 4; // Optional; server fills if zero } message MutationReply { int32 status_code = 1; bytes payload = 2; // JSON cart state or error string } message StateRequest { string cart_id = 1; } message StateReply { int32 status_code = 1; bytes payload = 2; // JSON cart state } service CartActor { rpc Mutate(MutationRequest) returns (MutationReply); rpc GetState(StateRequest) returns (StateReply); } ### 5.2 Control Plane Proto syntax = "proto3"; package control; option go_package = "git.tornberg.me/go-cart-actor/proto;proto"; message Empty {} message PingReply { string host = 1; int64 unix_time = 2; } message NegotiateRequest { repeated string known_hosts = 1; } message NegotiateReply { repeated string hosts = 1; // Healthy hosts returned } message CartIdsReply { repeated string cart_ids = 1; } message OwnerChangeRequest { string cart_id = 1; string new_host = 2; } message OwnerChangeAck { bool accepted = 1; string message = 2; } message ClosingNotice { string host = 1; } service ControlPlane { rpc Ping(Empty) returns (PingReply); rpc Negotiate(NegotiateRequest) returns (NegotiateReply); rpc GetCartIds(Empty) returns (CartIdsReply); rpc ConfirmOwner(OwnerChangeRequest) returns (OwnerChangeAck); rpc Closing(ClosingNotice) returns (OwnerChangeAck); } --- ## 6. Message Type Mapping | Legacy Constant | Numeric | New Enum Value | |-----------------|---------|-----------------------------| | AddRequestType | 1 | MUTATION_ADD_REQUEST | | AddItemType | 2 | MUTATION_ADD_ITEM | | RemoveItemType | 4 | MUTATION_REMOVE_ITEM | | RemoveDeliveryType | 5 | MUTATION_REMOVE_DELIVERY | | ChangeQuantityType | 6 | MUTATION_CHANGE_QUANTITY | | SetDeliveryType | 7 | MUTATION_SET_DELIVERY | | SetPickupPointType | 8 | MUTATION_SET_PICKUP_POINT | | CreateCheckoutOrderType | 9 | MUTATION_CREATE_CHECKOUT_ORDER | | SetCartItemsType | 10 | MUTATION_SET_CART_ITEMS | | OrderCompletedType | 11 | MUTATION_ORDER_COMPLETED | Persisted events keep original numeric codes; reconstruction simply casts to `MutationType`. --- ## 7. Components To Remove / Replace Remove (after migration complete): - `remote-grain.go` - `rpc-server.go` - Any packet/frame-specific types solely used by the above (search: `FrameWithPayload`, `RemoteHandleMutation`, `RemoteGetState` where not reused by disk or internal logic). - The constants representing network frame types in `synced-pool.go` (RemoteNegotiate, AckChange, etc.) replaced by gRPC calls. - netpool usage for remote cart channel (control plane also no longer needs `Connection` abstraction). Retain (until reworked or optionally cleaned later): - `message.go` (for persistence) - `message-handler.go` - `cart-grain.go` - `messages.proto` (underlying mutation messages) - HTTP API server and REST handlers. --- ## 8. New / Modified Components New files (planned): - `proto/cart_actor.proto` - `proto/control_plane.proto` - `grpc/cart_actor_server.go` (server impl) - `grpc/cart_actor_client.go` (client adapter implementing `Grain`) - `grpc/control_plane_server.go` - `grpc/control_plane_client.go` - `grpc/interceptors.go` (metrics, logging, optional tracing hooks) - `remote_grain_grpc.go` (adapter bridging existing interfaces) - `control_plane_adapter.go` (replaces frame handlers in `SyncedPool`) Modified: - `synced-pool.go` (remote host management now uses gRPC clients; negotiation logic updated) - `main.go` (initialize both gRPC services on startup) - `go.mod` (add `google.golang.org/grpc`) --- ## 9. Step-by-Step Migration Plan 1. Add proto files and generate Go code (`protoc --go_out --go-grpc_out`). 2. Implement `CartActorServer`: - Translate `MutationRequest` to `Message`. - Use existing handler registry for payload encode/decode. - Return JSON cart state. 3. Implement `CartActorClient` wrapper (`RemoteGrainGRPC`) implementing: - `HandleMessage`: Build envelope, call `Mutate`. - `GetCurrentState`: Call `GetState`. 4. Implement `ControlPlaneServer` with methods: - `Ping`: returns host + time. - `Negotiate`: merge host lists; emulate old logic. - `GetCartIds`: iterate local grains. - `ConfirmOwner`: replicate quorum flow (accept always; error path for future). - `Closing`: schedule remote removal. 5. Implement `ControlPlaneClient` used inside `SyncedPool.AddRemote`. 6. Refactor `SyncedPool`: - Replace frame handlers registration with gRPC client calls. - Replace `Server.AddHandler(...)` start-up with launching gRPC server. - Implement periodic health checks using `Ping`. 7. Remove old connection constructs for 1337/1338. 8. Metrics: - Add unary interceptor capturing duration and status. - Replace packet counters with `cart_grpc_mutate_calls_total`, `cart_grpc_control_calls_total`, histograms for latency. 9. Update `main.go` to start: - gRPC server(s). - HTTP server as before. 10. Delete legacy files & update README build instructions. 11. Load testing & profiling on Raspberry Pi hardware (or ARM emulation). 12. Final cleanup & dead code removal (search for now-unused constants & structs). 13. Tag release. --- ## 10. Performance Considerations (Raspberry Pi Focus) - Single `*grpc.ClientConn` per remote host (HTTP/2 multiplexing) to reduce file descriptor and handshake overhead. - Use small keepalive pings (optional) only if connections drop; default may suffice. - Avoid reflection / dynamic dispatch in hot path: pre-build a mapping from `MutationType` to handler function. - Reuse byte buffers: - Implement a `sync.Pool` for mutation serialization to reduce GC pressure. - Enforce per-RPC deadlines (e.g. 300–400ms) to avoid pile-ups. - Backpressure: - Before dispatch: if local grain pool at capacity and target grain is remote, abort early with 503 to caller (optional). - Disable gRPC compression for small payloads (mutation messages are small). Condition compression if payload > threshold (e.g. 8KB). - Compile with `-ldflags="-s -w"` in production to reduce binary size (optional). - Enable `GOMAXPROCS` tuned to CPU cores; Pi often benefits from leaving default but monitor. - Use histograms with limited buckets to reduce Prometheus cardinality. --- ## 11. Testing Strategy Unit: - Message type mapping tests (legacy -> enum). - Envelope roundtrip: Original proto -> payload -> gRPC -> server decode -> internal Message. Integration: - Two-node cluster simulation: - Mutate cart on Node A, ownership moves, verify remote access from Node B. - Quorum failure simulation (temporarily reject `ConfirmOwner`). - Control plane negotiation: start nodes in staggered order, assert final membership. Load/Perf: - Benchmark local mutation vs remote mutation latency. - High concurrency test (N goroutines each performing X mutations). - Memory profiling (ensure no large buffer retention). Failure Injection: - Kill a node mid-mutation; client call should timeout and not corrupt local state. - Simulated network partition: drop `Ping` replies; ensure host removal path triggers. --- ## 12. Rollback Strategy Because no mixed-version compatibility is provided, rollback = redeploy previous version containing legacy protocol: 1. Stop all new-version pods. 2. Deploy old version cluster-wide. 3. No data migration needed (event persistence unaffected). Note: Avoid partial upgrades; perform full rolling restart quickly to prevent split-brain (new nodes won’t talk to old nodes). --- ## 13. Risks & Mitigations | Risk | Description | Mitigation | |------|-------------|------------| | Full-cluster restart required | No mixed compatibility | Schedule maintenance window | | gRPC adds CPU overhead | Envelope + marshaling cost | Buffer reuse, keep small messages uncompressed | | Ownership race | Timing differences after refactor | Add explicit logs + tests around `RequestOwnership` path | | Hidden dependency on frame-level status codes | Some code may assume `FrameWithPayload` fields | Wrap gRPC responses into minimal compatibility structs until fully removed | | Memory growth | Connection reuse & pooled buffers not implemented initially | Add `sync.Pool` & track memory via pprof early | --- ## 14. Logging & Observability - Structured log entries for: - Ownership changes - Negotiation rounds - Remote spawn events - Mutation failures (with cart id, mutation type) - Metrics: - `cart_grpc_mutate_duration_seconds` (histogram) - `cart_grpc_mutate_errors_total` - `cart_grpc_control_duration_seconds` - `cart_remote_hosts` (gauge) - Retain existing grain counts. - Optional future: OpenTelemetry tracing (span per remote mutation). --- ## 15. Future Enhancements (Post-Migration) - Replace JSON state with `CartState` proto and provide streaming watch API. - mTLS between nodes (certificate rotation via K8s Secret or SPIRE). - Distributed tracing integration. - Ownership leasing with TTL and optimistic renewal. - Delta replication or CRDT-based conflict resolution for experimentation. --- ## 16. Task Breakdown & Estimates | Task | Estimate | |------|----------| | Proto definitions & generation | 0.5d | | CartActor server/client | 1.0d | | ControlPlane server/client | 1.0d | | SyncedPool refactor | 1.0d | | Metrics & interceptors | 0.5d | | Remove legacy code & cleanup | 0.5d | | Tests (unit + integration) | 1.5d | | Benchmark & tuning | 0.5–1.0d | | Total | ~6–7d | --- ## 17. Open Questions (Confirm Before Implementation) 1. Combine both services on a single port (simplify ops) or keep dual-port first? (Default here: keep dual, but easy to merge.) 2. Minimum Go version remains 1.24.x—acceptable to add `google.golang.org/grpc` latest? 3. Accept adding `sync.Pool` micro-optimizations in first pass or postpone? --- ## 18. Acceptance Criteria - All previous integration tests (adjusted to gRPC) pass. - Cart operations (add, remove, delivery, checkout) function across at least a 2‑node cluster. - Control plane negotiation forms consistent host list. - Latency for a remote mutation does not degrade beyond an acceptable threshold (define baseline before merge). - Legacy networking code fully removed. --- ## 19. Next Steps (If Approved) 1. Implement proto files and commit. 2. Scaffold server & client code. 3. Refactor `SyncedPool` and `main.go`. 4. Add metrics and tests. 5. Run benchmark on target Pi hardware. 6. Review & merge. --- End of Plan.