gce: dynamically refresh GCE managed zones on node informer events#1164
gce: dynamically refresh GCE managed zones on node informer events#1164arvindbr8 wants to merge 2 commits into
Conversation
Currently, CCM GCE cloud provider caches `managedZones` exactly once at startup and never refreshes it. If a GKE cluster spins up a new node pool in a newly enabled zone in the region, node lookups and persistent volume operations fail in that zone because GCE CCM's cache is stale. This change introduces a thread-safe dynamic zone refresh mechanism: 1. Guard the `managedZones` cache with a `sync.RWMutex` to prevent data races. 2. Refactor the codebase to route all read accesses through a thread-safe getter (`getManagedZones()`). 3. Implement `refreshManagedZones()` using the mockable `ListZonesInRegion()` interface. 4. Trigger the refresh on-demand in `updateNodeZones` when a node registers in an unmanaged zone. 5. Add a comprehensive unit test `TestUpdateNodeZonesDynamicRefresh` to verify the dynamic zone refresh pipeline.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: arvindbr8 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
This issue is currently awaiting triage. If the repository mantainers determine this is a relevant issue, they will accept it by applying the The DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Hi @arvindbr8. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
What this PR does / why we need it
This PR implements a dynamic zone refresh mechanism for the GCE cloud provider in the Cloud Controller Manager (CCM).
Currently, CCM caches the GCE zones it manages (
managedZones) exactly once at startup and never refreshes them. When a new zone is enabled in the GCP region (e.g., when adding a node pool in a new zone), all zonal VM/node lookups (getInstanceByName) and persistent volume operations (GetDiskByNameUnknownZone) fail in the new zone because the cache is stale. Previously, this required a manual Kubernetes Control Plane (KCP) restart to get CCM to recognize the new zone.How it solves it
We introduce a zero-downtime, thread-safe, on-demand zone refresh:
sync.RWMutex(managedZonesLock) to the GCECloudstruct and refactored all read occurrences to use a thread-safe getter (getManagedZones()).ListZonesInRegioninterface instead of the legacy rawgetZonesForRegionAPI, allowing standard GCE mock testing.updateNodeZones). When a node registers in a zone that is missing from CCM's cache, CCM dynamically triggersrefreshManagedZones()to fetch the expanded zone list.updateNodeZonesto compute local state and manually release thenodeZonesLockbefore triggering any GCE API calls.Special notes for your reviewer
TestUpdateNodeZonesDynamicRefreshingce_instances_test.gowhich simulates the node informer events pipeline and verifies the dynamic cache refresh.**cc: @YifeiZhuang @zhaoqsh @gnossen