Skip to content

gce: dynamically refresh GCE managed zones on node informer events#1164

Open
arvindbr8 wants to merge 2 commits into
kubernetes:masterfrom
arvindbr8:fix-ccm-zone-refresh
Open

gce: dynamically refresh GCE managed zones on node informer events#1164
arvindbr8 wants to merge 2 commits into
kubernetes:masterfrom
arvindbr8:fix-ccm-zone-refresh

Conversation

@arvindbr8
Copy link
Copy Markdown
Contributor

What this PR does / why we need it

This PR implements a dynamic zone refresh mechanism for the GCE cloud provider in the Cloud Controller Manager (CCM).
Currently, CCM caches the GCE zones it manages (managedZones) exactly once at startup and never refreshes them. When a new zone is enabled in the GCP region (e.g., when adding a node pool in a new zone), all zonal VM/node lookups (getInstanceByName) and persistent volume operations (GetDiskByNameUnknownZone) fail in the new zone because the cache is stale. Previously, this required a manual Kubernetes Control Plane (KCP) restart to get CCM to recognize the new zone.

How it solves it

We introduce a zero-downtime, thread-safe, on-demand zone refresh:

  1. Thread Safety: Added sync.RWMutex (managedZonesLock) to the GCE Cloud struct and refactored all read occurrences to use a thread-safe getter (getManagedZones()).
  2. Mock-Friendly API: Refactored the zone listing refresh to use the modern ListZonesInRegion interface instead of the legacy raw getZonesForRegion API, allowing standard GCE mock testing.
  3. Event-Driven Trigger: Hooked the refresh into the node informer (updateNodeZones). When a node registers in a zone that is missing from CCM's cache, CCM dynamically triggers refreshManagedZones() to fetch the expanded zone list.
  4. Deadlock Prevention: Structurally optimized updateNodeZones to compute local state and manually release the nodeZonesLock before triggering any GCE API calls.

Special notes for your reviewer

  • The node informer was chosen as the trigger because GKE volume provisioning requires a node to exist in a zone before volumes can be targeted there. This event-driven hook provides zero idle GCE API overhead.
  • I added a complete unit test TestUpdateNodeZonesDynamicRefresh in gce_instances_test.go which simulates the node informer events pipeline and verifies the dynamic cache refresh.**

cc: @YifeiZhuang @zhaoqsh @gnossen

Currently, CCM GCE cloud provider caches `managedZones` exactly once at
startup and never refreshes it. If a GKE cluster spins up a new node pool
in a newly enabled zone in the region, node lookups and persistent volume
operations fail in that zone because GCE CCM's cache is stale.

This change introduces a thread-safe dynamic zone refresh mechanism:
1. Guard the `managedZones` cache with a `sync.RWMutex` to prevent data races.
2. Refactor the codebase to route all read accesses through a thread-safe
   getter (`getManagedZones()`).
3. Implement `refreshManagedZones()` using the mockable `ListZonesInRegion()`
   interface.
4. Trigger the refresh on-demand in `updateNodeZones` when a node registers
   in an unmanaged zone.
5. Add a comprehensive unit test `TestUpdateNodeZonesDynamicRefresh` to
   verify the dynamic zone refresh pipeline.
@k8s-ci-robot k8s-ci-robot requested review from cici37 and jpbetz May 26, 2026 18:24
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: arvindbr8
Once this PR has been reviewed and has the lgtm label, please assign bowei for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label May 26, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

This issue is currently awaiting triage.

If the repository mantainers determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 26, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @arvindbr8. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 26, 2026
@YifeiZhuang
Copy link
Copy Markdown
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants