complete rewrite to grpc

This commit is contained in:
matst80
2025-10-10 06:45:23 +00:00
parent f735540c3d
commit 4c973b239f
31 changed files with 3080 additions and 1816 deletions

396
GRPC-MIGRATION-PLAN.md Normal file
View File

@@ -0,0 +1,396 @@
# gRPC Migration Plan
File: GRPC-MIGRATION-PLAN.md
Author: (Generated plan)
Status: Draft for review
Target Release: Next major version (breaking change no mixed compatibility)
---
## 1. Overview
This document describes the full migration of the current custom TCP frame-based protocol (both the cart mutation/state channel on port `1337` and the control plane on port `1338`) to gRPC. We will remove all legacy packet framing (`FrameWithPayload`, `RemoteGrain`, `GenericListener` handlers for these two ports) and replace them with two gRPC services:
1. Cart Actor Service (mutations + state retrieval)
2. Control Plane Service (cluster membership, negotiation, ownership change, lifecycle)
We intentionally keep:
- Internal `CartGrain` logic, message storage format, disk persistence, and JSON cart serialization.
- Existing message type numeric mapping for backward compatibility with persisted event logs.
- HTTP/REST API layer unchanged (it still consumes JSON state from the local/remote grain pipeline).
We do NOT implement mixed-version compatibility; migration occurs atomically (cluster restart with new image).
---
## 2. Goals
- Remove custom binary frame protocol & simplify maintenance.
- Provide clearer, strongly defined interfaces via `.proto` schemas.
- Improve observability via gRPC interceptors (metrics & tracing hooks).
- Reduce per-call overhead compared with the current manual connection pooling + handwritten framing (HTTP/2 multiplexing + connection reuse).
- Prepare groundwork for future enhancements (streaming, typed state, event streaming) without rewriting again.
---
## 3. Non-Goals (Phase 1)
- Converting the cart state payload from JSON to a strongly typed proto.
- Introducing authentication / mTLS (may be added later).
- Changing persistence or replay format.
- Changing the HTTP API contract.
- Implementing streaming watchers or push updates.
---
## 4. Architecture After Migration
Ports:
- `:1337` → gRPC CartActor service.
- `:1338` → gRPC ControlPlane service.
Each node:
- Runs one gRPC server with both services (can use a single listener bound to two services or keep two separate listeners; we will keep two ports initially to minimize operational surprise, but they could be merged later).
- Maintains a connection pool of `*grpc.ClientConn` objects keyed by remote hostname (one per remote host, reused for both services).
Call Flow (Mutation):
1. HTTP request hits `PoolServer`.
2. `SyncedPool.getGrain(cartId)`:
- Local: direct invocation.
- Remote: uses `RemoteGrainGRPC` (new) which invokes `CartActor.Mutate`.
3. Response JSON returned unchanged.
Control Plane Flow:
- Discovery (K8s watch) still triggers `AddRemote(host)`.
- Instead of custom `Ping`, `Negotiate`, etc. via frames, call gRPC methods on `ControlPlane` service.
- Ownership changes use `ConfirmOwner` RPC.
---
## 5. Proto Design
### 5.1 Cart Actor Proto (Envelope Pattern)
We keep an envelope with `bytes payload` holding the serialized underlying cart mutation proto (existing types in `messages.proto`). This minimizes churn.
Indented code block (proto sketch):
syntax = "proto3";
package cart;
option go_package = "git.tornberg.me/go-cart-actor/proto;proto";
enum MutationType {
MUTATION_TYPE_UNSPECIFIED = 0;
MUTATION_ADD_REQUEST = 1;
MUTATION_ADD_ITEM = 2;
MUTATION_REMOVE_ITEM = 4;
MUTATION_REMOVE_DELIVERY = 5;
MUTATION_CHANGE_QUANTITY = 6;
MUTATION_SET_DELIVERY = 7;
MUTATION_SET_PICKUP_POINT = 8;
MUTATION_CREATE_CHECKOUT_ORDER = 9;
MUTATION_SET_CART_ITEMS = 10;
MUTATION_ORDER_COMPLETED = 11;
}
message MutationRequest {
string cart_id = 1;
MutationType type = 2;
bytes payload = 3; // Serialized specific mutation proto
int64 client_timestamp = 4; // Optional; server fills if zero
}
message MutationReply {
int32 status_code = 1;
bytes payload = 2; // JSON cart state or error string
}
message StateRequest {
string cart_id = 1;
}
message StateReply {
int32 status_code = 1;
bytes payload = 2; // JSON cart state
}
service CartActor {
rpc Mutate(MutationRequest) returns (MutationReply);
rpc GetState(StateRequest) returns (StateReply);
}
### 5.2 Control Plane Proto
syntax = "proto3";
package control;
option go_package = "git.tornberg.me/go-cart-actor/proto;proto";
message Empty {}
message PingReply {
string host = 1;
int64 unix_time = 2;
}
message NegotiateRequest {
repeated string known_hosts = 1;
}
message NegotiateReply {
repeated string hosts = 1; // Healthy hosts returned
}
message CartIdsReply {
repeated string cart_ids = 1;
}
message OwnerChangeRequest {
string cart_id = 1;
string new_host = 2;
}
message OwnerChangeAck {
bool accepted = 1;
string message = 2;
}
message ClosingNotice {
string host = 1;
}
service ControlPlane {
rpc Ping(Empty) returns (PingReply);
rpc Negotiate(NegotiateRequest) returns (NegotiateReply);
rpc GetCartIds(Empty) returns (CartIdsReply);
rpc ConfirmOwner(OwnerChangeRequest) returns (OwnerChangeAck);
rpc Closing(ClosingNotice) returns (OwnerChangeAck);
}
---
## 6. Message Type Mapping
| Legacy Constant | Numeric | New Enum Value |
|-----------------|---------|-----------------------------|
| AddRequestType | 1 | MUTATION_ADD_REQUEST |
| AddItemType | 2 | MUTATION_ADD_ITEM |
| RemoveItemType | 4 | MUTATION_REMOVE_ITEM |
| RemoveDeliveryType | 5 | MUTATION_REMOVE_DELIVERY |
| ChangeQuantityType | 6 | MUTATION_CHANGE_QUANTITY |
| SetDeliveryType | 7 | MUTATION_SET_DELIVERY |
| SetPickupPointType | 8 | MUTATION_SET_PICKUP_POINT |
| CreateCheckoutOrderType | 9 | MUTATION_CREATE_CHECKOUT_ORDER |
| SetCartItemsType | 10 | MUTATION_SET_CART_ITEMS |
| OrderCompletedType | 11 | MUTATION_ORDER_COMPLETED |
Persisted events keep original numeric codes; reconstruction simply casts to `MutationType`.
---
## 7. Components To Remove / Replace
Remove (after migration complete):
- `remote-grain.go`
- `rpc-server.go`
- Any packet/frame-specific types solely used by the above (search: `FrameWithPayload`, `RemoteHandleMutation`, `RemoteGetState` where not reused by disk or internal logic).
- The constants representing network frame types in `synced-pool.go` (RemoteNegotiate, AckChange, etc.) replaced by gRPC calls.
- netpool usage for remote cart channel (control plane also no longer needs `Connection` abstraction).
Retain (until reworked or optionally cleaned later):
- `message.go` (for persistence)
- `message-handler.go`
- `cart-grain.go`
- `messages.proto` (underlying mutation messages)
- HTTP API server and REST handlers.
---
## 8. New / Modified Components
New files (planned):
- `proto/cart_actor.proto`
- `proto/control_plane.proto`
- `grpc/cart_actor_server.go` (server impl)
- `grpc/cart_actor_client.go` (client adapter implementing `Grain`)
- `grpc/control_plane_server.go`
- `grpc/control_plane_client.go`
- `grpc/interceptors.go` (metrics, logging, optional tracing hooks)
- `remote_grain_grpc.go` (adapter bridging existing interfaces)
- `control_plane_adapter.go` (replaces frame handlers in `SyncedPool`)
Modified:
- `synced-pool.go` (remote host management now uses gRPC clients; negotiation logic updated)
- `main.go` (initialize both gRPC services on startup)
- `go.mod` (add `google.golang.org/grpc`)
---
## 9. Step-by-Step Migration Plan
1. Add proto files and generate Go code (`protoc --go_out --go-grpc_out`).
2. Implement `CartActorServer`:
- Translate `MutationRequest` to `Message`.
- Use existing handler registry for payload encode/decode.
- Return JSON cart state.
3. Implement `CartActorClient` wrapper (`RemoteGrainGRPC`) implementing:
- `HandleMessage`: Build envelope, call `Mutate`.
- `GetCurrentState`: Call `GetState`.
4. Implement `ControlPlaneServer` with methods:
- `Ping`: returns host + time.
- `Negotiate`: merge host lists; emulate old logic.
- `GetCartIds`: iterate local grains.
- `ConfirmOwner`: replicate quorum flow (accept always; error path for future).
- `Closing`: schedule remote removal.
5. Implement `ControlPlaneClient` used inside `SyncedPool.AddRemote`.
6. Refactor `SyncedPool`:
- Replace frame handlers registration with gRPC client calls.
- Replace `Server.AddHandler(...)` start-up with launching gRPC server.
- Implement periodic health checks using `Ping`.
7. Remove old connection constructs for 1337/1338.
8. Metrics:
- Add unary interceptor capturing duration and status.
- Replace packet counters with `cart_grpc_mutate_calls_total`, `cart_grpc_control_calls_total`, histograms for latency.
9. Update `main.go` to start:
- gRPC server(s).
- HTTP server as before.
10. Delete legacy files & update README build instructions.
11. Load testing & profiling on Raspberry Pi hardware (or ARM emulation).
12. Final cleanup & dead code removal (search for now-unused constants & structs).
13. Tag release.
---
## 10. Performance Considerations (Raspberry Pi Focus)
- Single `*grpc.ClientConn` per remote host (HTTP/2 multiplexing) to reduce file descriptor and handshake overhead.
- Use small keepalive pings (optional) only if connections drop; default may suffice.
- Avoid reflection / dynamic dispatch in hot path: pre-build a mapping from `MutationType` to handler function.
- Reuse byte buffers:
- Implement a `sync.Pool` for mutation serialization to reduce GC pressure.
- Enforce per-RPC deadlines (e.g. 300400ms) to avoid pile-ups.
- Backpressure:
- Before dispatch: if local grain pool at capacity and target grain is remote, abort early with 503 to caller (optional).
- Disable gRPC compression for small payloads (mutation messages are small). Condition compression if payload > threshold (e.g. 8KB).
- Compile with `-ldflags="-s -w"` in production to reduce binary size (optional).
- Enable `GOMAXPROCS` tuned to CPU cores; Pi often benefits from leaving default but monitor.
- Use histograms with limited buckets to reduce Prometheus cardinality.
---
## 11. Testing Strategy
Unit:
- Message type mapping tests (legacy -> enum).
- Envelope roundtrip: Original proto -> payload -> gRPC -> server decode -> internal Message.
Integration:
- Two-node cluster simulation:
- Mutate cart on Node A, ownership moves, verify remote access from Node B.
- Quorum failure simulation (temporarily reject `ConfirmOwner`).
- Control plane negotiation: start nodes in staggered order, assert final membership.
Load/Perf:
- Benchmark local mutation vs remote mutation latency.
- High concurrency test (N goroutines each performing X mutations).
- Memory profiling (ensure no large buffer retention).
Failure Injection:
- Kill a node mid-mutation; client call should timeout and not corrupt local state.
- Simulated network partition: drop `Ping` replies; ensure host removal path triggers.
---
## 12. Rollback Strategy
Because no mixed-version compatibility is provided, rollback = redeploy previous version containing legacy protocol:
1. Stop all new-version pods.
2. Deploy old version cluster-wide.
3. No data migration needed (event persistence unaffected).
Note: Avoid partial upgrades; perform full rolling restart quickly to prevent split-brain (new nodes wont talk to old nodes).
---
## 13. Risks & Mitigations
| Risk | Description | Mitigation |
|------|-------------|------------|
| Full-cluster restart required | No mixed compatibility | Schedule maintenance window |
| gRPC adds CPU overhead | Envelope + marshaling cost | Buffer reuse, keep small messages uncompressed |
| Ownership race | Timing differences after refactor | Add explicit logs + tests around `RequestOwnership` path |
| Hidden dependency on frame-level status codes | Some code may assume `FrameWithPayload` fields | Wrap gRPC responses into minimal compatibility structs until fully removed |
| Memory growth | Connection reuse & pooled buffers not implemented initially | Add `sync.Pool` & track memory via pprof early |
---
## 14. Logging & Observability
- Structured log entries for:
- Ownership changes
- Negotiation rounds
- Remote spawn events
- Mutation failures (with cart id, mutation type)
- Metrics:
- `cart_grpc_mutate_duration_seconds` (histogram)
- `cart_grpc_mutate_errors_total`
- `cart_grpc_control_duration_seconds`
- `cart_remote_hosts` (gauge)
- Retain existing grain counts.
- Optional future: OpenTelemetry tracing (span per remote mutation).
---
## 15. Future Enhancements (Post-Migration)
- Replace JSON state with `CartState` proto and provide streaming watch API.
- mTLS between nodes (certificate rotation via K8s Secret or SPIRE).
- Distributed tracing integration.
- Ownership leasing with TTL and optimistic renewal.
- Delta replication or CRDT-based conflict resolution for experimentation.
---
## 16. Task Breakdown & Estimates
| Task | Estimate |
|------|----------|
| Proto definitions & generation | 0.5d |
| CartActor server/client | 1.0d |
| ControlPlane server/client | 1.0d |
| SyncedPool refactor | 1.0d |
| Metrics & interceptors | 0.5d |
| Remove legacy code & cleanup | 0.5d |
| Tests (unit + integration) | 1.5d |
| Benchmark & tuning | 0.51.0d |
| Total | ~67d |
---
## 17. Open Questions (Confirm Before Implementation)
1. Combine both services on a single port (simplify ops) or keep dual-port first? (Default here: keep dual, but easy to merge.)
2. Minimum Go version remains 1.24.x—acceptable to add `google.golang.org/grpc` latest?
3. Accept adding `sync.Pool` micro-optimizations in first pass or postpone?
---
## 18. Acceptance Criteria
- All previous integration tests (adjusted to gRPC) pass.
- Cart operations (add, remove, delivery, checkout) function across at least a 2node cluster.
- Control plane negotiation forms consistent host list.
- Latency for a remote mutation does not degrade beyond an acceptable threshold (define baseline before merge).
- Legacy networking code fully removed.
---
## 19. Next Steps (If Approved)
1. Implement proto files and commit.
2. Scaffold server & client code.
3. Refactor `SyncedPool` and `main.go`.
4. Add metrics and tests.
5. Run benchmark on target Pi hardware.
6. Review & merge.
---
End of Plan.