complete rewrite to grpc
This commit is contained in:
396
GRPC-MIGRATION-PLAN.md
Normal file
396
GRPC-MIGRATION-PLAN.md
Normal file
@@ -0,0 +1,396 @@
|
||||
# gRPC Migration Plan
|
||||
|
||||
File: GRPC-MIGRATION-PLAN.md
|
||||
Author: (Generated plan)
|
||||
Status: Draft for review
|
||||
Target Release: Next major version (breaking change – no mixed compatibility)
|
||||
|
||||
---
|
||||
|
||||
## 1. Overview
|
||||
|
||||
This document describes the full migration of the current custom TCP frame-based protocol (both the cart mutation/state channel on port `1337` and the control plane on port `1338`) to gRPC. We will remove all legacy packet framing (`FrameWithPayload`, `RemoteGrain`, `GenericListener` handlers for these two ports) and replace them with two gRPC services:
|
||||
|
||||
1. Cart Actor Service (mutations + state retrieval)
|
||||
2. Control Plane Service (cluster membership, negotiation, ownership change, lifecycle)
|
||||
|
||||
We intentionally keep:
|
||||
- Internal `CartGrain` logic, message storage format, disk persistence, and JSON cart serialization.
|
||||
- Existing message type numeric mapping for backward compatibility with persisted event logs.
|
||||
- HTTP/REST API layer unchanged (it still consumes JSON state from the local/remote grain pipeline).
|
||||
|
||||
We do NOT implement mixed-version compatibility; migration occurs atomically (cluster restart with new image).
|
||||
|
||||
---
|
||||
|
||||
## 2. Goals
|
||||
|
||||
- Remove custom binary frame protocol & simplify maintenance.
|
||||
- Provide clearer, strongly defined interfaces via `.proto` schemas.
|
||||
- Improve observability via gRPC interceptors (metrics & tracing hooks).
|
||||
- Reduce per-call overhead compared with the current manual connection pooling + handwritten framing (HTTP/2 multiplexing + connection reuse).
|
||||
- Prepare groundwork for future enhancements (streaming, typed state, event streaming) without rewriting again.
|
||||
|
||||
---
|
||||
|
||||
## 3. Non-Goals (Phase 1)
|
||||
|
||||
- Converting the cart state payload from JSON to a strongly typed proto.
|
||||
- Introducing authentication / mTLS (may be added later).
|
||||
- Changing persistence or replay format.
|
||||
- Changing the HTTP API contract.
|
||||
- Implementing streaming watchers or push updates.
|
||||
|
||||
---
|
||||
|
||||
## 4. Architecture After Migration
|
||||
|
||||
Ports:
|
||||
- `:1337` → gRPC CartActor service.
|
||||
- `:1338` → gRPC ControlPlane service.
|
||||
|
||||
Each node:
|
||||
- Runs one gRPC server with both services (can use a single listener bound to two services or keep two separate listeners; we will keep two ports initially to minimize operational surprise, but they could be merged later).
|
||||
- Maintains a connection pool of `*grpc.ClientConn` objects keyed by remote hostname (one per remote host, reused for both services).
|
||||
|
||||
Call Flow (Mutation):
|
||||
1. HTTP request hits `PoolServer`.
|
||||
2. `SyncedPool.getGrain(cartId)`:
|
||||
- Local: direct invocation.
|
||||
- Remote: uses `RemoteGrainGRPC` (new) which invokes `CartActor.Mutate`.
|
||||
3. Response JSON returned unchanged.
|
||||
|
||||
Control Plane Flow:
|
||||
- Discovery (K8s watch) still triggers `AddRemote(host)`.
|
||||
- Instead of custom `Ping`, `Negotiate`, etc. via frames, call gRPC methods on `ControlPlane` service.
|
||||
- Ownership changes use `ConfirmOwner` RPC.
|
||||
|
||||
---
|
||||
|
||||
## 5. Proto Design
|
||||
|
||||
### 5.1 Cart Actor Proto (Envelope Pattern)
|
||||
|
||||
We keep an envelope with `bytes payload` holding the serialized underlying cart mutation proto (existing types in `messages.proto`). This minimizes churn.
|
||||
|
||||
Indented code block (proto sketch):
|
||||
|
||||
syntax = "proto3";
|
||||
package cart;
|
||||
option go_package = "git.tornberg.me/go-cart-actor/proto;proto";
|
||||
|
||||
enum MutationType {
|
||||
MUTATION_TYPE_UNSPECIFIED = 0;
|
||||
MUTATION_ADD_REQUEST = 1;
|
||||
MUTATION_ADD_ITEM = 2;
|
||||
MUTATION_REMOVE_ITEM = 4;
|
||||
MUTATION_REMOVE_DELIVERY = 5;
|
||||
MUTATION_CHANGE_QUANTITY = 6;
|
||||
MUTATION_SET_DELIVERY = 7;
|
||||
MUTATION_SET_PICKUP_POINT = 8;
|
||||
MUTATION_CREATE_CHECKOUT_ORDER = 9;
|
||||
MUTATION_SET_CART_ITEMS = 10;
|
||||
MUTATION_ORDER_COMPLETED = 11;
|
||||
}
|
||||
|
||||
message MutationRequest {
|
||||
string cart_id = 1;
|
||||
MutationType type = 2;
|
||||
bytes payload = 3; // Serialized specific mutation proto
|
||||
int64 client_timestamp = 4; // Optional; server fills if zero
|
||||
}
|
||||
|
||||
message MutationReply {
|
||||
int32 status_code = 1;
|
||||
bytes payload = 2; // JSON cart state or error string
|
||||
}
|
||||
|
||||
message StateRequest {
|
||||
string cart_id = 1;
|
||||
}
|
||||
|
||||
message StateReply {
|
||||
int32 status_code = 1;
|
||||
bytes payload = 2; // JSON cart state
|
||||
}
|
||||
|
||||
service CartActor {
|
||||
rpc Mutate(MutationRequest) returns (MutationReply);
|
||||
rpc GetState(StateRequest) returns (StateReply);
|
||||
}
|
||||
|
||||
### 5.2 Control Plane Proto
|
||||
|
||||
syntax = "proto3";
|
||||
package control;
|
||||
option go_package = "git.tornberg.me/go-cart-actor/proto;proto";
|
||||
|
||||
message Empty {}
|
||||
|
||||
message PingReply {
|
||||
string host = 1;
|
||||
int64 unix_time = 2;
|
||||
}
|
||||
|
||||
message NegotiateRequest {
|
||||
repeated string known_hosts = 1;
|
||||
}
|
||||
message NegotiateReply {
|
||||
repeated string hosts = 1; // Healthy hosts returned
|
||||
}
|
||||
|
||||
message CartIdsReply {
|
||||
repeated string cart_ids = 1;
|
||||
}
|
||||
|
||||
message OwnerChangeRequest {
|
||||
string cart_id = 1;
|
||||
string new_host = 2;
|
||||
}
|
||||
message OwnerChangeAck {
|
||||
bool accepted = 1;
|
||||
string message = 2;
|
||||
}
|
||||
|
||||
message ClosingNotice {
|
||||
string host = 1;
|
||||
}
|
||||
|
||||
service ControlPlane {
|
||||
rpc Ping(Empty) returns (PingReply);
|
||||
rpc Negotiate(NegotiateRequest) returns (NegotiateReply);
|
||||
rpc GetCartIds(Empty) returns (CartIdsReply);
|
||||
rpc ConfirmOwner(OwnerChangeRequest) returns (OwnerChangeAck);
|
||||
rpc Closing(ClosingNotice) returns (OwnerChangeAck);
|
||||
}
|
||||
|
||||
---
|
||||
|
||||
## 6. Message Type Mapping
|
||||
|
||||
| Legacy Constant | Numeric | New Enum Value |
|
||||
|-----------------|---------|-----------------------------|
|
||||
| AddRequestType | 1 | MUTATION_ADD_REQUEST |
|
||||
| AddItemType | 2 | MUTATION_ADD_ITEM |
|
||||
| RemoveItemType | 4 | MUTATION_REMOVE_ITEM |
|
||||
| RemoveDeliveryType | 5 | MUTATION_REMOVE_DELIVERY |
|
||||
| ChangeQuantityType | 6 | MUTATION_CHANGE_QUANTITY |
|
||||
| SetDeliveryType | 7 | MUTATION_SET_DELIVERY |
|
||||
| SetPickupPointType | 8 | MUTATION_SET_PICKUP_POINT |
|
||||
| CreateCheckoutOrderType | 9 | MUTATION_CREATE_CHECKOUT_ORDER |
|
||||
| SetCartItemsType | 10 | MUTATION_SET_CART_ITEMS |
|
||||
| OrderCompletedType | 11 | MUTATION_ORDER_COMPLETED |
|
||||
|
||||
Persisted events keep original numeric codes; reconstruction simply casts to `MutationType`.
|
||||
|
||||
---
|
||||
|
||||
## 7. Components To Remove / Replace
|
||||
|
||||
Remove (after migration complete):
|
||||
- `remote-grain.go`
|
||||
- `rpc-server.go`
|
||||
- Any packet/frame-specific types solely used by the above (search: `FrameWithPayload`, `RemoteHandleMutation`, `RemoteGetState` where not reused by disk or internal logic).
|
||||
- The constants representing network frame types in `synced-pool.go` (RemoteNegotiate, AckChange, etc.) replaced by gRPC calls.
|
||||
- netpool usage for remote cart channel (control plane also no longer needs `Connection` abstraction).
|
||||
|
||||
Retain (until reworked or optionally cleaned later):
|
||||
- `message.go` (for persistence)
|
||||
- `message-handler.go`
|
||||
- `cart-grain.go`
|
||||
- `messages.proto` (underlying mutation messages)
|
||||
- HTTP API server and REST handlers.
|
||||
|
||||
---
|
||||
|
||||
## 8. New / Modified Components
|
||||
|
||||
New files (planned):
|
||||
- `proto/cart_actor.proto`
|
||||
- `proto/control_plane.proto`
|
||||
- `grpc/cart_actor_server.go` (server impl)
|
||||
- `grpc/cart_actor_client.go` (client adapter implementing `Grain`)
|
||||
- `grpc/control_plane_server.go`
|
||||
- `grpc/control_plane_client.go`
|
||||
- `grpc/interceptors.go` (metrics, logging, optional tracing hooks)
|
||||
- `remote_grain_grpc.go` (adapter bridging existing interfaces)
|
||||
- `control_plane_adapter.go` (replaces frame handlers in `SyncedPool`)
|
||||
|
||||
Modified:
|
||||
- `synced-pool.go` (remote host management now uses gRPC clients; negotiation logic updated)
|
||||
- `main.go` (initialize both gRPC services on startup)
|
||||
- `go.mod` (add `google.golang.org/grpc`)
|
||||
|
||||
---
|
||||
|
||||
## 9. Step-by-Step Migration Plan
|
||||
|
||||
1. Add proto files and generate Go code (`protoc --go_out --go-grpc_out`).
|
||||
2. Implement `CartActorServer`:
|
||||
- Translate `MutationRequest` to `Message`.
|
||||
- Use existing handler registry for payload encode/decode.
|
||||
- Return JSON cart state.
|
||||
3. Implement `CartActorClient` wrapper (`RemoteGrainGRPC`) implementing:
|
||||
- `HandleMessage`: Build envelope, call `Mutate`.
|
||||
- `GetCurrentState`: Call `GetState`.
|
||||
4. Implement `ControlPlaneServer` with methods:
|
||||
- `Ping`: returns host + time.
|
||||
- `Negotiate`: merge host lists; emulate old logic.
|
||||
- `GetCartIds`: iterate local grains.
|
||||
- `ConfirmOwner`: replicate quorum flow (accept always; error path for future).
|
||||
- `Closing`: schedule remote removal.
|
||||
5. Implement `ControlPlaneClient` used inside `SyncedPool.AddRemote`.
|
||||
6. Refactor `SyncedPool`:
|
||||
- Replace frame handlers registration with gRPC client calls.
|
||||
- Replace `Server.AddHandler(...)` start-up with launching gRPC server.
|
||||
- Implement periodic health checks using `Ping`.
|
||||
7. Remove old connection constructs for 1337/1338.
|
||||
8. Metrics:
|
||||
- Add unary interceptor capturing duration and status.
|
||||
- Replace packet counters with `cart_grpc_mutate_calls_total`, `cart_grpc_control_calls_total`, histograms for latency.
|
||||
9. Update `main.go` to start:
|
||||
- gRPC server(s).
|
||||
- HTTP server as before.
|
||||
10. Delete legacy files & update README build instructions.
|
||||
11. Load testing & profiling on Raspberry Pi hardware (or ARM emulation).
|
||||
12. Final cleanup & dead code removal (search for now-unused constants & structs).
|
||||
13. Tag release.
|
||||
|
||||
---
|
||||
|
||||
## 10. Performance Considerations (Raspberry Pi Focus)
|
||||
|
||||
- Single `*grpc.ClientConn` per remote host (HTTP/2 multiplexing) to reduce file descriptor and handshake overhead.
|
||||
- Use small keepalive pings (optional) only if connections drop; default may suffice.
|
||||
- Avoid reflection / dynamic dispatch in hot path: pre-build a mapping from `MutationType` to handler function.
|
||||
- Reuse byte buffers:
|
||||
- Implement a `sync.Pool` for mutation serialization to reduce GC pressure.
|
||||
- Enforce per-RPC deadlines (e.g. 300–400ms) to avoid pile-ups.
|
||||
- Backpressure:
|
||||
- Before dispatch: if local grain pool at capacity and target grain is remote, abort early with 503 to caller (optional).
|
||||
- Disable gRPC compression for small payloads (mutation messages are small). Condition compression if payload > threshold (e.g. 8KB).
|
||||
- Compile with `-ldflags="-s -w"` in production to reduce binary size (optional).
|
||||
- Enable `GOMAXPROCS` tuned to CPU cores; Pi often benefits from leaving default but monitor.
|
||||
- Use histograms with limited buckets to reduce Prometheus cardinality.
|
||||
|
||||
---
|
||||
|
||||
## 11. Testing Strategy
|
||||
|
||||
Unit:
|
||||
- Message type mapping tests (legacy -> enum).
|
||||
- Envelope roundtrip: Original proto -> payload -> gRPC -> server decode -> internal Message.
|
||||
|
||||
Integration:
|
||||
- Two-node cluster simulation:
|
||||
- Mutate cart on Node A, ownership moves, verify remote access from Node B.
|
||||
- Quorum failure simulation (temporarily reject `ConfirmOwner`).
|
||||
- Control plane negotiation: start nodes in staggered order, assert final membership.
|
||||
|
||||
Load/Perf:
|
||||
- Benchmark local mutation vs remote mutation latency.
|
||||
- High concurrency test (N goroutines each performing X mutations).
|
||||
- Memory profiling (ensure no large buffer retention).
|
||||
|
||||
Failure Injection:
|
||||
- Kill a node mid-mutation; client call should timeout and not corrupt local state.
|
||||
- Simulated network partition: drop `Ping` replies; ensure host removal path triggers.
|
||||
|
||||
---
|
||||
|
||||
## 12. Rollback Strategy
|
||||
|
||||
Because no mixed-version compatibility is provided, rollback = redeploy previous version containing legacy protocol:
|
||||
1. Stop all new-version pods.
|
||||
2. Deploy old version cluster-wide.
|
||||
3. No data migration needed (event persistence unaffected).
|
||||
|
||||
Note: Avoid partial upgrades; perform full rolling restart quickly to prevent split-brain (new nodes won’t talk to old nodes).
|
||||
|
||||
---
|
||||
|
||||
## 13. Risks & Mitigations
|
||||
|
||||
| Risk | Description | Mitigation |
|
||||
|------|-------------|------------|
|
||||
| Full-cluster restart required | No mixed compatibility | Schedule maintenance window |
|
||||
| gRPC adds CPU overhead | Envelope + marshaling cost | Buffer reuse, keep small messages uncompressed |
|
||||
| Ownership race | Timing differences after refactor | Add explicit logs + tests around `RequestOwnership` path |
|
||||
| Hidden dependency on frame-level status codes | Some code may assume `FrameWithPayload` fields | Wrap gRPC responses into minimal compatibility structs until fully removed |
|
||||
| Memory growth | Connection reuse & pooled buffers not implemented initially | Add `sync.Pool` & track memory via pprof early |
|
||||
|
||||
---
|
||||
|
||||
## 14. Logging & Observability
|
||||
|
||||
- Structured log entries for:
|
||||
- Ownership changes
|
||||
- Negotiation rounds
|
||||
- Remote spawn events
|
||||
- Mutation failures (with cart id, mutation type)
|
||||
- Metrics:
|
||||
- `cart_grpc_mutate_duration_seconds` (histogram)
|
||||
- `cart_grpc_mutate_errors_total`
|
||||
- `cart_grpc_control_duration_seconds`
|
||||
- `cart_remote_hosts` (gauge)
|
||||
- Retain existing grain counts.
|
||||
- Optional future: OpenTelemetry tracing (span per remote mutation).
|
||||
|
||||
---
|
||||
|
||||
## 15. Future Enhancements (Post-Migration)
|
||||
|
||||
- Replace JSON state with `CartState` proto and provide streaming watch API.
|
||||
- mTLS between nodes (certificate rotation via K8s Secret or SPIRE).
|
||||
- Distributed tracing integration.
|
||||
- Ownership leasing with TTL and optimistic renewal.
|
||||
- Delta replication or CRDT-based conflict resolution for experimentation.
|
||||
|
||||
---
|
||||
|
||||
## 16. Task Breakdown & Estimates
|
||||
|
||||
| Task | Estimate |
|
||||
|------|----------|
|
||||
| Proto definitions & generation | 0.5d |
|
||||
| CartActor server/client | 1.0d |
|
||||
| ControlPlane server/client | 1.0d |
|
||||
| SyncedPool refactor | 1.0d |
|
||||
| Metrics & interceptors | 0.5d |
|
||||
| Remove legacy code & cleanup | 0.5d |
|
||||
| Tests (unit + integration) | 1.5d |
|
||||
| Benchmark & tuning | 0.5–1.0d |
|
||||
| Total | ~6–7d |
|
||||
|
||||
---
|
||||
|
||||
## 17. Open Questions (Confirm Before Implementation)
|
||||
|
||||
1. Combine both services on a single port (simplify ops) or keep dual-port first? (Default here: keep dual, but easy to merge.)
|
||||
2. Minimum Go version remains 1.24.x—acceptable to add `google.golang.org/grpc` latest?
|
||||
3. Accept adding `sync.Pool` micro-optimizations in first pass or postpone?
|
||||
|
||||
---
|
||||
|
||||
## 18. Acceptance Criteria
|
||||
|
||||
- All previous integration tests (adjusted to gRPC) pass.
|
||||
- Cart operations (add, remove, delivery, checkout) function across at least a 2‑node cluster.
|
||||
- Control plane negotiation forms consistent host list.
|
||||
- Latency for a remote mutation does not degrade beyond an acceptable threshold (define baseline before merge).
|
||||
- Legacy networking code fully removed.
|
||||
|
||||
---
|
||||
|
||||
## 19. Next Steps (If Approved)
|
||||
|
||||
1. Implement proto files and commit.
|
||||
2. Scaffold server & client code.
|
||||
3. Refactor `SyncedPool` and `main.go`.
|
||||
4. Add metrics and tests.
|
||||
5. Run benchmark on target Pi hardware.
|
||||
6. Review & merge.
|
||||
|
||||
---
|
||||
|
||||
End of Plan.
|
||||
Reference in New Issue
Block a user