Files
go-cart-actor/GRPC-MIGRATION-PLAN.md
2025-10-10 06:45:23 +00:00

14 KiB
Raw Blame History

gRPC Migration Plan

File: GRPC-MIGRATION-PLAN.md
Author: (Generated plan)
Status: Draft for review
Target Release: Next major version (breaking change no mixed compatibility)


1. Overview

This document describes the full migration of the current custom TCP frame-based protocol (both the cart mutation/state channel on port 1337 and the control plane on port 1338) to gRPC. We will remove all legacy packet framing (FrameWithPayload, RemoteGrain, GenericListener handlers for these two ports) and replace them with two gRPC services:

  1. Cart Actor Service (mutations + state retrieval)
  2. Control Plane Service (cluster membership, negotiation, ownership change, lifecycle)

We intentionally keep:

  • Internal CartGrain logic, message storage format, disk persistence, and JSON cart serialization.
  • Existing message type numeric mapping for backward compatibility with persisted event logs.
  • HTTP/REST API layer unchanged (it still consumes JSON state from the local/remote grain pipeline).

We do NOT implement mixed-version compatibility; migration occurs atomically (cluster restart with new image).


2. Goals

  • Remove custom binary frame protocol & simplify maintenance.
  • Provide clearer, strongly defined interfaces via .proto schemas.
  • Improve observability via gRPC interceptors (metrics & tracing hooks).
  • Reduce per-call overhead compared with the current manual connection pooling + handwritten framing (HTTP/2 multiplexing + connection reuse).
  • Prepare groundwork for future enhancements (streaming, typed state, event streaming) without rewriting again.

3. Non-Goals (Phase 1)

  • Converting the cart state payload from JSON to a strongly typed proto.
  • Introducing authentication / mTLS (may be added later).
  • Changing persistence or replay format.
  • Changing the HTTP API contract.
  • Implementing streaming watchers or push updates.

4. Architecture After Migration

Ports:

  • :1337 → gRPC CartActor service.
  • :1338 → gRPC ControlPlane service.

Each node:

  • Runs one gRPC server with both services (can use a single listener bound to two services or keep two separate listeners; we will keep two ports initially to minimize operational surprise, but they could be merged later).
  • Maintains a connection pool of *grpc.ClientConn objects keyed by remote hostname (one per remote host, reused for both services).

Call Flow (Mutation):

  1. HTTP request hits PoolServer.
  2. SyncedPool.getGrain(cartId):
    • Local: direct invocation.
    • Remote: uses RemoteGrainGRPC (new) which invokes CartActor.Mutate.
  3. Response JSON returned unchanged.

Control Plane Flow:

  • Discovery (K8s watch) still triggers AddRemote(host).
  • Instead of custom Ping, Negotiate, etc. via frames, call gRPC methods on ControlPlane service.
  • Ownership changes use ConfirmOwner RPC.

5. Proto Design

5.1 Cart Actor Proto (Envelope Pattern)

We keep an envelope with bytes payload holding the serialized underlying cart mutation proto (existing types in messages.proto). This minimizes churn.

Indented code block (proto sketch):

syntax = "proto3";
package cart;
option go_package = "git.tornberg.me/go-cart-actor/proto;proto";

enum MutationType {
  MUTATION_TYPE_UNSPECIFIED = 0;
  MUTATION_ADD_REQUEST = 1;
  MUTATION_ADD_ITEM = 2;
  MUTATION_REMOVE_ITEM = 4;
  MUTATION_REMOVE_DELIVERY = 5;
  MUTATION_CHANGE_QUANTITY = 6;
  MUTATION_SET_DELIVERY = 7;
  MUTATION_SET_PICKUP_POINT = 8;
  MUTATION_CREATE_CHECKOUT_ORDER = 9;
  MUTATION_SET_CART_ITEMS = 10;
  MUTATION_ORDER_COMPLETED = 11;
}

message MutationRequest {
  string cart_id = 1;
  MutationType type = 2;
  bytes payload = 3;          // Serialized specific mutation proto
  int64 client_timestamp = 4; // Optional; server fills if zero
}

message MutationReply {
  int32 status_code = 1;
  bytes payload = 2; // JSON cart state or error string
}

message StateRequest {
  string cart_id = 1;
}

message StateReply {
  int32 status_code = 1;
  bytes payload = 2; // JSON cart state
}

service CartActor {
  rpc Mutate(MutationRequest) returns (MutationReply);
  rpc GetState(StateRequest) returns (StateReply);
}

5.2 Control Plane Proto

syntax = "proto3";
package control;
option go_package = "git.tornberg.me/go-cart-actor/proto;proto";

message Empty {}

message PingReply {
  string host = 1;
  int64 unix_time = 2;
}

message NegotiateRequest {
  repeated string known_hosts = 1;
}
message NegotiateReply {
  repeated string hosts = 1; // Healthy hosts returned
}

message CartIdsReply {
  repeated string cart_ids = 1;
}

message OwnerChangeRequest {
  string cart_id = 1;
  string new_host = 2;
}
message OwnerChangeAck {
  bool accepted = 1;
  string message = 2;
}

message ClosingNotice {
  string host = 1;
}

service ControlPlane {
  rpc Ping(Empty) returns (PingReply);
  rpc Negotiate(NegotiateRequest) returns (NegotiateReply);
  rpc GetCartIds(Empty) returns (CartIdsReply);
  rpc ConfirmOwner(OwnerChangeRequest) returns (OwnerChangeAck);
  rpc Closing(ClosingNotice) returns (OwnerChangeAck);
}

6. Message Type Mapping

Legacy Constant Numeric New Enum Value
AddRequestType 1 MUTATION_ADD_REQUEST
AddItemType 2 MUTATION_ADD_ITEM
RemoveItemType 4 MUTATION_REMOVE_ITEM
RemoveDeliveryType 5 MUTATION_REMOVE_DELIVERY
ChangeQuantityType 6 MUTATION_CHANGE_QUANTITY
SetDeliveryType 7 MUTATION_SET_DELIVERY
SetPickupPointType 8 MUTATION_SET_PICKUP_POINT
CreateCheckoutOrderType 9 MUTATION_CREATE_CHECKOUT_ORDER
SetCartItemsType 10 MUTATION_SET_CART_ITEMS
OrderCompletedType 11 MUTATION_ORDER_COMPLETED

Persisted events keep original numeric codes; reconstruction simply casts to MutationType.


7. Components To Remove / Replace

Remove (after migration complete):

  • remote-grain.go
  • rpc-server.go
  • Any packet/frame-specific types solely used by the above (search: FrameWithPayload, RemoteHandleMutation, RemoteGetState where not reused by disk or internal logic).
  • The constants representing network frame types in synced-pool.go (RemoteNegotiate, AckChange, etc.) replaced by gRPC calls.
  • netpool usage for remote cart channel (control plane also no longer needs Connection abstraction).

Retain (until reworked or optionally cleaned later):

  • message.go (for persistence)
  • message-handler.go
  • cart-grain.go
  • messages.proto (underlying mutation messages)
  • HTTP API server and REST handlers.

8. New / Modified Components

New files (planned):

  • proto/cart_actor.proto
  • proto/control_plane.proto
  • grpc/cart_actor_server.go (server impl)
  • grpc/cart_actor_client.go (client adapter implementing Grain)
  • grpc/control_plane_server.go
  • grpc/control_plane_client.go
  • grpc/interceptors.go (metrics, logging, optional tracing hooks)
  • remote_grain_grpc.go (adapter bridging existing interfaces)
  • control_plane_adapter.go (replaces frame handlers in SyncedPool)

Modified:

  • synced-pool.go (remote host management now uses gRPC clients; negotiation logic updated)
  • main.go (initialize both gRPC services on startup)
  • go.mod (add google.golang.org/grpc)

9. Step-by-Step Migration Plan

  1. Add proto files and generate Go code (protoc --go_out --go-grpc_out).
  2. Implement CartActorServer:
    • Translate MutationRequest to Message.
    • Use existing handler registry for payload encode/decode.
    • Return JSON cart state.
  3. Implement CartActorClient wrapper (RemoteGrainGRPC) implementing:
    • HandleMessage: Build envelope, call Mutate.
    • GetCurrentState: Call GetState.
  4. Implement ControlPlaneServer with methods:
    • Ping: returns host + time.
    • Negotiate: merge host lists; emulate old logic.
    • GetCartIds: iterate local grains.
    • ConfirmOwner: replicate quorum flow (accept always; error path for future).
    • Closing: schedule remote removal.
  5. Implement ControlPlaneClient used inside SyncedPool.AddRemote.
  6. Refactor SyncedPool:
    • Replace frame handlers registration with gRPC client calls.
    • Replace Server.AddHandler(...) start-up with launching gRPC server.
    • Implement periodic health checks using Ping.
  7. Remove old connection constructs for 1337/1338.
  8. Metrics:
    • Add unary interceptor capturing duration and status.
    • Replace packet counters with cart_grpc_mutate_calls_total, cart_grpc_control_calls_total, histograms for latency.
  9. Update main.go to start:
    • gRPC server(s).
    • HTTP server as before.
  10. Delete legacy files & update README build instructions.
  11. Load testing & profiling on Raspberry Pi hardware (or ARM emulation).
  12. Final cleanup & dead code removal (search for now-unused constants & structs).
  13. Tag release.

10. Performance Considerations (Raspberry Pi Focus)

  • Single *grpc.ClientConn per remote host (HTTP/2 multiplexing) to reduce file descriptor and handshake overhead.
  • Use small keepalive pings (optional) only if connections drop; default may suffice.
  • Avoid reflection / dynamic dispatch in hot path: pre-build a mapping from MutationType to handler function.
  • Reuse byte buffers:
    • Implement a sync.Pool for mutation serialization to reduce GC pressure.
  • Enforce per-RPC deadlines (e.g. 300400ms) to avoid pile-ups.
  • Backpressure:
    • Before dispatch: if local grain pool at capacity and target grain is remote, abort early with 503 to caller (optional).
  • Disable gRPC compression for small payloads (mutation messages are small). Condition compression if payload > threshold (e.g. 8KB).
  • Compile with -ldflags="-s -w" in production to reduce binary size (optional).
  • Enable GOMAXPROCS tuned to CPU cores; Pi often benefits from leaving default but monitor.
  • Use histograms with limited buckets to reduce Prometheus cardinality.

11. Testing Strategy

Unit:

  • Message type mapping tests (legacy -> enum).
  • Envelope roundtrip: Original proto -> payload -> gRPC -> server decode -> internal Message.

Integration:

  • Two-node cluster simulation:
    • Mutate cart on Node A, ownership moves, verify remote access from Node B.
    • Quorum failure simulation (temporarily reject ConfirmOwner).
  • Control plane negotiation: start nodes in staggered order, assert final membership.

Load/Perf:

  • Benchmark local mutation vs remote mutation latency.
  • High concurrency test (N goroutines each performing X mutations).
  • Memory profiling (ensure no large buffer retention).

Failure Injection:

  • Kill a node mid-mutation; client call should timeout and not corrupt local state.
  • Simulated network partition: drop Ping replies; ensure host removal path triggers.

12. Rollback Strategy

Because no mixed-version compatibility is provided, rollback = redeploy previous version containing legacy protocol:

  1. Stop all new-version pods.
  2. Deploy old version cluster-wide.
  3. No data migration needed (event persistence unaffected).

Note: Avoid partial upgrades; perform full rolling restart quickly to prevent split-brain (new nodes wont talk to old nodes).


13. Risks & Mitigations

Risk Description Mitigation
Full-cluster restart required No mixed compatibility Schedule maintenance window
gRPC adds CPU overhead Envelope + marshaling cost Buffer reuse, keep small messages uncompressed
Ownership race Timing differences after refactor Add explicit logs + tests around RequestOwnership path
Hidden dependency on frame-level status codes Some code may assume FrameWithPayload fields Wrap gRPC responses into minimal compatibility structs until fully removed
Memory growth Connection reuse & pooled buffers not implemented initially Add sync.Pool & track memory via pprof early

14. Logging & Observability

  • Structured log entries for:
    • Ownership changes
    • Negotiation rounds
    • Remote spawn events
    • Mutation failures (with cart id, mutation type)
  • Metrics:
    • cart_grpc_mutate_duration_seconds (histogram)
    • cart_grpc_mutate_errors_total
    • cart_grpc_control_duration_seconds
    • cart_remote_hosts (gauge)
    • Retain existing grain counts.
  • Optional future: OpenTelemetry tracing (span per remote mutation).

15. Future Enhancements (Post-Migration)

  • Replace JSON state with CartState proto and provide streaming watch API.
  • mTLS between nodes (certificate rotation via K8s Secret or SPIRE).
  • Distributed tracing integration.
  • Ownership leasing with TTL and optimistic renewal.
  • Delta replication or CRDT-based conflict resolution for experimentation.

16. Task Breakdown & Estimates

Task Estimate
Proto definitions & generation 0.5d
CartActor server/client 1.0d
ControlPlane server/client 1.0d
SyncedPool refactor 1.0d
Metrics & interceptors 0.5d
Remove legacy code & cleanup 0.5d
Tests (unit + integration) 1.5d
Benchmark & tuning 0.51.0d
Total ~67d

17. Open Questions (Confirm Before Implementation)

  1. Combine both services on a single port (simplify ops) or keep dual-port first? (Default here: keep dual, but easy to merge.)
  2. Minimum Go version remains 1.24.x—acceptable to add google.golang.org/grpc latest?
  3. Accept adding sync.Pool micro-optimizations in first pass or postpone?

18. Acceptance Criteria

  • All previous integration tests (adjusted to gRPC) pass.
  • Cart operations (add, remove, delivery, checkout) function across at least a 2node cluster.
  • Control plane negotiation forms consistent host list.
  • Latency for a remote mutation does not degrade beyond an acceptable threshold (define baseline before merge).
  • Legacy networking code fully removed.

19. Next Steps (If Approved)

  1. Implement proto files and commit.
  2. Scaffold server & client code.
  3. Refactor SyncedPool and main.go.
  4. Add metrics and tests.
  5. Run benchmark on target Pi hardware.
  6. Review & merge.

End of Plan.