[5 Ways] How to Fix vCenter “No Healthy Upstream” 503 Error

Emma

3 months ago

The “no healthy upstream” error in vCenter usually hits at the worst time — often right after a reboot or during maintenance, and almost always paired with a 503 Service Unavailable response.

At its core, the error means the Envoy proxy can’t reach a backend service like vpxd, vsphere-ui, or vapi-endpoint. The cause is usually one of two things: disk space exhaustion or an expired certificate.

This guide walks you through 5 troubleshooting fixes to find the root cause and get your vSphere environment back online quickly.

Why vCenter Shows “No Healthy Upstream”

Here are the most common technical causes for this vCenter no healthy upstream error.

1. Expired Certificates (The #1 Culprit)

Certificate expiration is the most frequent cause of this error. Within the VCSA, two certificates are vital for internal communication:

STS Certificate: If the Security Token Service certificate expires, SSO cannot validate tokens, halting all internal service authentication.
Machine SSL Certificate: If this is expired, the reverse proxy cannot establish a secure connection to backend services, resulting in the 503 error.

2. Disk Space Exhaustion

VCSA partitions are sensitive to storage limits. When a partition reaches 100%, services cannot write logs or temporary files and will crash immediately. The most common culprits are:

/storage/log: Saturated by excessive log growth.
/storage/seat: Fills with stats and events, often causing the vpxd service to fail.

3. Service Failures (vsphere-ui and vpxd)

The “upstream” refers to the internal services handling your requests. If these are stopped or hung, the proxy has no destination for your traffic:

vsphere-ui: The web console service required to render the client interface.
vpxd: The core vCenter engine. If this crashes due to database issues, the entire management plane becomes unavailable. If you find that the core services fail to start, you may also experience symptoms in which VMware cannot synchronize with the host. This often indicates a deeper issue with the management network or the underlying database integrity.

4. Memory Pressure and Resource Starvation

VCSA requires strict RAM reservations, especially after upgrades.

OOM Killer: Newer versions (v7.0/8.0) require significantly more memory (12GB+). If under-provisioned, the Linux Out-Of-Memory (OOM) killer will terminate critical Java processes to protect the kernel, preventing services from reaching a “healthy” state.

5. DNS Resolution Failures

vCenter depends on a functional Domain Name System (DNS) to identify its own endpoints via its Fully Qualified Domain Name (FQDN).

Resolution Failure: If the VCSA cannot resolve its FQDN or if a reverse DNS (PTR) record is missing, the service startup sequence fails. This breaks the communication chain between the proxy and the internal services.

How to Fix No Healthy Upstream Error: Troubleshooting Steps

Once you understand the underlying reasons for the vCenter no healthy upstream error, you can take a systematic approach to remediate the services. This error generally indicates that the Envoy proxy cannot find a functional backend service to route your request to.

Whether you are experiencing no healthy upstream after reboot or during normal operations, following these technical steps will help you identify and resolve the failure.

Note:

Before troubleshooting, enable SSH on the VCSA or access it via the ESXi DCUI.
After a reboot or service restart, *no healthy upstream* can be normal—wait 10–15 minutes for services to fully initialize before continuing.

Step 1: Check Disk Space

Before modifying any configurations, ensure the VCSA has the “breathing room” to run its processes. Disk exhaustion is a primary reason services fail to initialize.

SSH into the VCSA and enter the shell.
Run the following command:

df -h

What to look for: Closely examine the output for any partition at 100% capacity. In a VCSA environment, /storage/log and /storage/core are the most frequent culprits.

The Fix: If /storage/log is full, you may need to clear old compressed logs manually. If the environment has outgrown its current allocation, expand the virtual disk in the ESXi settings. Afterward, run the following to trigger an internal resize and service refresh:

service-control --stop --all && service-control --start --all

Step 2: Verify Certificate Validity

Certificate expiration is the main possible technical failure leading to the vCenter no healthy upstream message. If the underlying Security Token Service (STS) or Machine SSL certificates are invalid, services cannot mutually authenticate.

Note:

Always take a file-based backup or a VM snapshot before performing a certificate reset. For a more secure and efficient backup solution, consider using i2Backup.

Run this command to check expiration dates across all VCSA stores:

for i in $(/usr/lib/vmware-vmafd/bin/vecs-cli store list); do echo "STORE: $i"; /usr/lib/vmware-vmafd/bin/vecs-cli entry list --store $i --text | grep -ie "Not After"; done

The Fix: If any certificates are listed as expired, you need to launch the VMware Certificate Manager utility:

/usr/lib/vmware-vmca/bin/certificate-manager

In most “No Healthy Upstream” scenarios caused by total certificate failure, Option 8 (Reset all Certificates) is the most effective choice for a full recovery.

Step 3: Check Service Status

If disks are clear and certificates are valid, the “upstream” (the service itself) may simply be stopped or crashed.

To identify which specific service is unhealthy:

service-control --status --all

What to look for: Check if vsphere-ui (the HTML5 client) or vpxd (the core vCenter service) is in a “Stopped” state.

The Fix: If critical services are stopped, attempt a clean manual restart of the entire service stack:

service-control --stop --all

service-control --start --all

Step 4: Verify DNS and Time Sync

vSphere services rely heavily on precise timing and name resolution. If the VCSA cannot “see” itself or if its clock has drifted, tokens will be rejected.

Check DNS: Ensure the vCenter can resolve its own FQDN:

nslookup <vcenter-fqdn>

Check Time: Verify the current appliance time matches your infrastructure’s NTP source or Domain Controller:

date

The Fix: If the time is off by more than a few minutes, the Security Token Service will reject authentication requests, triggering the upstream error. Correct the time via the vCenter Management Interface (VAMI) or the date command and ensure NTP is synchronized.

Step 5: Examine the UI Logs (The Final Clue)

If your services appear to be running according to service-control, but you still encounter the “no healthy upstream” error, the issue likely lies within the Java runtime of the vSphere Client. When the service starts but cannot fully initialize, the “Final Clue” is hidden in the wrapper logs.

Access the shell and tail the vSphere Client log:

tail -n 100 /var/log/vmware/vsphere-ui/logs/vsphere_client_virgo.log

What to look for:

lang.OutOfMemoryError: This confirms that the VCSA does not have enough physical RAM allocated to support the Java Heap size required by the UI service. This is a common culprit for vCenter no healthy upstream after reboot following an upgrade.
lang.NullPointerException: Often indicates a corrupted plugin or a failure to communicate with the lookup service.

The Fix: If you identify an OutOfMemoryError, you need to shut down the VCSA and increase the memory allocation in the vSphere settings.

If you see plugin-related exceptions, you may need to clear the Serenity database or unregister stale extensions using the Managed Object Browser (MOB).

FAQ

Q1: Why does vCenter show “no healthy upstream” even after restarting services?

A: Most likely, the root cause (expired certificates, low RAM, or DNS issues) wasn’t fixed. Check the logs or run the quick troubleshooting commands to find the issue.

Q2: Can I fix vCenter no healthy upstream without SSH access?

A: Yes. Use the ESXi DCUI to access the VCSA shell, or use the VAMI to fix time/DNS issues. For certificate resets, SSH is easier but not always required.

Q3: Is it safe to use a VM snapshot to fix the error?

A: Snapshot rollback can work for temporary fixes, but it’s not recommended long-term. Always take a file-based backup before rolling back, and fix the root cause (e.g., expired certificates) afterward.

Q4: Why does df -h show disk space normal but I still get the error?

A: The issue is likely DNS, time sync, or a service deadlock. Check DNS resolution and time first, then restart all services.

Q5: How do I check if Envoy proxy is running in vCenter 8.0?

A: Run this command: service-control –status envoyproxy. If it’s stopped, restart it with service-control –start envoyproxy.

Q6: Do I need to restart vCenter after fixing certificates?

A: Yes. After resetting certificates, restart all services with service-control –stop –all && service-control –start –all to apply changes.

Q7: Why does the error happen only after a vCenter upgrade?

A: Upgrades often increase RAM requirements or don’t auto-renew certificates. Check RAM allocation and certificate expiration first.

Q8: Can plugin issues cause vCenter no healthy upstream?

A: Yes. Corrupted or outdated plugins can crash the vsphere-ui service. Check the UI logs for NullPointerException and unregister stale plugins.

Q9: What’s the minimum RAM for vCenter 8.0 to avoid this error?

A: 16GB RAM is recommended. 12GB is the minimum, but it’s more likely to trigger OOM Killer and the no healthy upstream error.

Q10: How long should I wait after restarting vCenter before troubleshooting?

A: Wait 10–20 minutes. Services can take time to fully start, especially after a reboot or upgrade. If the error still shows after 20 minutes, start troubleshooting.

Conclusion

The vCenter no healthy upstream error is essentially a “smoke signal” indicating that the VCSA’s core services are unable to communicate. While the 503 message is a generic response from the reverse proxy, identifying the root cause – be it expired certificates, exhausted disk space, or a service crash – is straightforward when using a systematic CLI approach.

To effectively fix no healthy upstream error, always prioritize checking your STS certificates and partition health first. By maintaining proper resource overhead and monitoring your service logs, you can prevent no healthy upstream after reboot scenarios and ensure your vSphere environment remains stable. Always remember to take a file-based backup or VM snapshot before performing significant certificate or service remediation.