Files
go-cart-actor/GRPC-MIGRATION-PLAN.md
2025-10-10 06:45:23 +00:00

396 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# gRPC Migration Plan
File: GRPC-MIGRATION-PLAN.md
Author: (Generated plan)
Status: Draft for review
Target Release: Next major version (breaking change no mixed compatibility)
---
## 1. Overview
This document describes the full migration of the current custom TCP frame-based protocol (both the cart mutation/state channel on port `1337` and the control plane on port `1338`) to gRPC. We will remove all legacy packet framing (`FrameWithPayload`, `RemoteGrain`, `GenericListener` handlers for these two ports) and replace them with two gRPC services:
1. Cart Actor Service (mutations + state retrieval)
2. Control Plane Service (cluster membership, negotiation, ownership change, lifecycle)
We intentionally keep:
- Internal `CartGrain` logic, message storage format, disk persistence, and JSON cart serialization.
- Existing message type numeric mapping for backward compatibility with persisted event logs.
- HTTP/REST API layer unchanged (it still consumes JSON state from the local/remote grain pipeline).
We do NOT implement mixed-version compatibility; migration occurs atomically (cluster restart with new image).
---
## 2. Goals
- Remove custom binary frame protocol & simplify maintenance.
- Provide clearer, strongly defined interfaces via `.proto` schemas.
- Improve observability via gRPC interceptors (metrics & tracing hooks).
- Reduce per-call overhead compared with the current manual connection pooling + handwritten framing (HTTP/2 multiplexing + connection reuse).
- Prepare groundwork for future enhancements (streaming, typed state, event streaming) without rewriting again.
---
## 3. Non-Goals (Phase 1)
- Converting the cart state payload from JSON to a strongly typed proto.
- Introducing authentication / mTLS (may be added later).
- Changing persistence or replay format.
- Changing the HTTP API contract.
- Implementing streaming watchers or push updates.
---
## 4. Architecture After Migration
Ports:
- `:1337` → gRPC CartActor service.
- `:1338` → gRPC ControlPlane service.
Each node:
- Runs one gRPC server with both services (can use a single listener bound to two services or keep two separate listeners; we will keep two ports initially to minimize operational surprise, but they could be merged later).
- Maintains a connection pool of `*grpc.ClientConn` objects keyed by remote hostname (one per remote host, reused for both services).
Call Flow (Mutation):
1. HTTP request hits `PoolServer`.
2. `SyncedPool.getGrain(cartId)`:
- Local: direct invocation.
- Remote: uses `RemoteGrainGRPC` (new) which invokes `CartActor.Mutate`.
3. Response JSON returned unchanged.
Control Plane Flow:
- Discovery (K8s watch) still triggers `AddRemote(host)`.
- Instead of custom `Ping`, `Negotiate`, etc. via frames, call gRPC methods on `ControlPlane` service.
- Ownership changes use `ConfirmOwner` RPC.
---
## 5. Proto Design
### 5.1 Cart Actor Proto (Envelope Pattern)
We keep an envelope with `bytes payload` holding the serialized underlying cart mutation proto (existing types in `messages.proto`). This minimizes churn.
Indented code block (proto sketch):
syntax = "proto3";
package cart;
option go_package = "git.tornberg.me/go-cart-actor/proto;proto";
enum MutationType {
MUTATION_TYPE_UNSPECIFIED = 0;
MUTATION_ADD_REQUEST = 1;
MUTATION_ADD_ITEM = 2;
MUTATION_REMOVE_ITEM = 4;
MUTATION_REMOVE_DELIVERY = 5;
MUTATION_CHANGE_QUANTITY = 6;
MUTATION_SET_DELIVERY = 7;
MUTATION_SET_PICKUP_POINT = 8;
MUTATION_CREATE_CHECKOUT_ORDER = 9;
MUTATION_SET_CART_ITEMS = 10;
MUTATION_ORDER_COMPLETED = 11;
}
message MutationRequest {
string cart_id = 1;
MutationType type = 2;
bytes payload = 3; // Serialized specific mutation proto
int64 client_timestamp = 4; // Optional; server fills if zero
}
message MutationReply {
int32 status_code = 1;
bytes payload = 2; // JSON cart state or error string
}
message StateRequest {
string cart_id = 1;
}
message StateReply {
int32 status_code = 1;
bytes payload = 2; // JSON cart state
}
service CartActor {
rpc Mutate(MutationRequest) returns (MutationReply);
rpc GetState(StateRequest) returns (StateReply);
}
### 5.2 Control Plane Proto
syntax = "proto3";
package control;
option go_package = "git.tornberg.me/go-cart-actor/proto;proto";
message Empty {}
message PingReply {
string host = 1;
int64 unix_time = 2;
}
message NegotiateRequest {
repeated string known_hosts = 1;
}
message NegotiateReply {
repeated string hosts = 1; // Healthy hosts returned
}
message CartIdsReply {
repeated string cart_ids = 1;
}
message OwnerChangeRequest {
string cart_id = 1;
string new_host = 2;
}
message OwnerChangeAck {
bool accepted = 1;
string message = 2;
}
message ClosingNotice {
string host = 1;
}
service ControlPlane {
rpc Ping(Empty) returns (PingReply);
rpc Negotiate(NegotiateRequest) returns (NegotiateReply);
rpc GetCartIds(Empty) returns (CartIdsReply);
rpc ConfirmOwner(OwnerChangeRequest) returns (OwnerChangeAck);
rpc Closing(ClosingNotice) returns (OwnerChangeAck);
}
---
## 6. Message Type Mapping
| Legacy Constant | Numeric | New Enum Value |
|-----------------|---------|-----------------------------|
| AddRequestType | 1 | MUTATION_ADD_REQUEST |
| AddItemType | 2 | MUTATION_ADD_ITEM |
| RemoveItemType | 4 | MUTATION_REMOVE_ITEM |
| RemoveDeliveryType | 5 | MUTATION_REMOVE_DELIVERY |
| ChangeQuantityType | 6 | MUTATION_CHANGE_QUANTITY |
| SetDeliveryType | 7 | MUTATION_SET_DELIVERY |
| SetPickupPointType | 8 | MUTATION_SET_PICKUP_POINT |
| CreateCheckoutOrderType | 9 | MUTATION_CREATE_CHECKOUT_ORDER |
| SetCartItemsType | 10 | MUTATION_SET_CART_ITEMS |
| OrderCompletedType | 11 | MUTATION_ORDER_COMPLETED |
Persisted events keep original numeric codes; reconstruction simply casts to `MutationType`.
---
## 7. Components To Remove / Replace
Remove (after migration complete):
- `remote-grain.go`
- `rpc-server.go`
- Any packet/frame-specific types solely used by the above (search: `FrameWithPayload`, `RemoteHandleMutation`, `RemoteGetState` where not reused by disk or internal logic).
- The constants representing network frame types in `synced-pool.go` (RemoteNegotiate, AckChange, etc.) replaced by gRPC calls.
- netpool usage for remote cart channel (control plane also no longer needs `Connection` abstraction).
Retain (until reworked or optionally cleaned later):
- `message.go` (for persistence)
- `message-handler.go`
- `cart-grain.go`
- `messages.proto` (underlying mutation messages)
- HTTP API server and REST handlers.
---
## 8. New / Modified Components
New files (planned):
- `proto/cart_actor.proto`
- `proto/control_plane.proto`
- `grpc/cart_actor_server.go` (server impl)
- `grpc/cart_actor_client.go` (client adapter implementing `Grain`)
- `grpc/control_plane_server.go`
- `grpc/control_plane_client.go`
- `grpc/interceptors.go` (metrics, logging, optional tracing hooks)
- `remote_grain_grpc.go` (adapter bridging existing interfaces)
- `control_plane_adapter.go` (replaces frame handlers in `SyncedPool`)
Modified:
- `synced-pool.go` (remote host management now uses gRPC clients; negotiation logic updated)
- `main.go` (initialize both gRPC services on startup)
- `go.mod` (add `google.golang.org/grpc`)
---
## 9. Step-by-Step Migration Plan
1. Add proto files and generate Go code (`protoc --go_out --go-grpc_out`).
2. Implement `CartActorServer`:
- Translate `MutationRequest` to `Message`.
- Use existing handler registry for payload encode/decode.
- Return JSON cart state.
3. Implement `CartActorClient` wrapper (`RemoteGrainGRPC`) implementing:
- `HandleMessage`: Build envelope, call `Mutate`.
- `GetCurrentState`: Call `GetState`.
4. Implement `ControlPlaneServer` with methods:
- `Ping`: returns host + time.
- `Negotiate`: merge host lists; emulate old logic.
- `GetCartIds`: iterate local grains.
- `ConfirmOwner`: replicate quorum flow (accept always; error path for future).
- `Closing`: schedule remote removal.
5. Implement `ControlPlaneClient` used inside `SyncedPool.AddRemote`.
6. Refactor `SyncedPool`:
- Replace frame handlers registration with gRPC client calls.
- Replace `Server.AddHandler(...)` start-up with launching gRPC server.
- Implement periodic health checks using `Ping`.
7. Remove old connection constructs for 1337/1338.
8. Metrics:
- Add unary interceptor capturing duration and status.
- Replace packet counters with `cart_grpc_mutate_calls_total`, `cart_grpc_control_calls_total`, histograms for latency.
9. Update `main.go` to start:
- gRPC server(s).
- HTTP server as before.
10. Delete legacy files & update README build instructions.
11. Load testing & profiling on Raspberry Pi hardware (or ARM emulation).
12. Final cleanup & dead code removal (search for now-unused constants & structs).
13. Tag release.
---
## 10. Performance Considerations (Raspberry Pi Focus)
- Single `*grpc.ClientConn` per remote host (HTTP/2 multiplexing) to reduce file descriptor and handshake overhead.
- Use small keepalive pings (optional) only if connections drop; default may suffice.
- Avoid reflection / dynamic dispatch in hot path: pre-build a mapping from `MutationType` to handler function.
- Reuse byte buffers:
- Implement a `sync.Pool` for mutation serialization to reduce GC pressure.
- Enforce per-RPC deadlines (e.g. 300400ms) to avoid pile-ups.
- Backpressure:
- Before dispatch: if local grain pool at capacity and target grain is remote, abort early with 503 to caller (optional).
- Disable gRPC compression for small payloads (mutation messages are small). Condition compression if payload > threshold (e.g. 8KB).
- Compile with `-ldflags="-s -w"` in production to reduce binary size (optional).
- Enable `GOMAXPROCS` tuned to CPU cores; Pi often benefits from leaving default but monitor.
- Use histograms with limited buckets to reduce Prometheus cardinality.
---
## 11. Testing Strategy
Unit:
- Message type mapping tests (legacy -> enum).
- Envelope roundtrip: Original proto -> payload -> gRPC -> server decode -> internal Message.
Integration:
- Two-node cluster simulation:
- Mutate cart on Node A, ownership moves, verify remote access from Node B.
- Quorum failure simulation (temporarily reject `ConfirmOwner`).
- Control plane negotiation: start nodes in staggered order, assert final membership.
Load/Perf:
- Benchmark local mutation vs remote mutation latency.
- High concurrency test (N goroutines each performing X mutations).
- Memory profiling (ensure no large buffer retention).
Failure Injection:
- Kill a node mid-mutation; client call should timeout and not corrupt local state.
- Simulated network partition: drop `Ping` replies; ensure host removal path triggers.
---
## 12. Rollback Strategy
Because no mixed-version compatibility is provided, rollback = redeploy previous version containing legacy protocol:
1. Stop all new-version pods.
2. Deploy old version cluster-wide.
3. No data migration needed (event persistence unaffected).
Note: Avoid partial upgrades; perform full rolling restart quickly to prevent split-brain (new nodes wont talk to old nodes).
---
## 13. Risks & Mitigations
| Risk | Description | Mitigation |
|------|-------------|------------|
| Full-cluster restart required | No mixed compatibility | Schedule maintenance window |
| gRPC adds CPU overhead | Envelope + marshaling cost | Buffer reuse, keep small messages uncompressed |
| Ownership race | Timing differences after refactor | Add explicit logs + tests around `RequestOwnership` path |
| Hidden dependency on frame-level status codes | Some code may assume `FrameWithPayload` fields | Wrap gRPC responses into minimal compatibility structs until fully removed |
| Memory growth | Connection reuse & pooled buffers not implemented initially | Add `sync.Pool` & track memory via pprof early |
---
## 14. Logging & Observability
- Structured log entries for:
- Ownership changes
- Negotiation rounds
- Remote spawn events
- Mutation failures (with cart id, mutation type)
- Metrics:
- `cart_grpc_mutate_duration_seconds` (histogram)
- `cart_grpc_mutate_errors_total`
- `cart_grpc_control_duration_seconds`
- `cart_remote_hosts` (gauge)
- Retain existing grain counts.
- Optional future: OpenTelemetry tracing (span per remote mutation).
---
## 15. Future Enhancements (Post-Migration)
- Replace JSON state with `CartState` proto and provide streaming watch API.
- mTLS between nodes (certificate rotation via K8s Secret or SPIRE).
- Distributed tracing integration.
- Ownership leasing with TTL and optimistic renewal.
- Delta replication or CRDT-based conflict resolution for experimentation.
---
## 16. Task Breakdown & Estimates
| Task | Estimate |
|------|----------|
| Proto definitions & generation | 0.5d |
| CartActor server/client | 1.0d |
| ControlPlane server/client | 1.0d |
| SyncedPool refactor | 1.0d |
| Metrics & interceptors | 0.5d |
| Remove legacy code & cleanup | 0.5d |
| Tests (unit + integration) | 1.5d |
| Benchmark & tuning | 0.51.0d |
| Total | ~67d |
---
## 17. Open Questions (Confirm Before Implementation)
1. Combine both services on a single port (simplify ops) or keep dual-port first? (Default here: keep dual, but easy to merge.)
2. Minimum Go version remains 1.24.x—acceptable to add `google.golang.org/grpc` latest?
3. Accept adding `sync.Pool` micro-optimizations in first pass or postpone?
---
## 18. Acceptance Criteria
- All previous integration tests (adjusted to gRPC) pass.
- Cart operations (add, remove, delivery, checkout) function across at least a 2node cluster.
- Control plane negotiation forms consistent host list.
- Latency for a remote mutation does not degrade beyond an acceptable threshold (define baseline before merge).
- Legacy networking code fully removed.
---
## 19. Next Steps (If Approved)
1. Implement proto files and commit.
2. Scaffold server & client code.
3. Refactor `SyncedPool` and `main.go`.
4. Add metrics and tests.
5. Run benchmark on target Pi hardware.
6. Review & merge.
---
End of Plan.