Today I had to troubleshoot an IBM Blade system. The customer was complaining that all servers, except one, weren’t able to communicate with the rest of the network. The blade system contains two Nortel switches. Each Nortel switch is connected with a 3 Gbps LACP channel to separate HP switches. The HP switches are the core switches of the network and have VRRP configured between them. The servers have two network card, which are configured in an active / standby team configuration.
I started troubleshooting by simply pinging between the different servers in the blade system. The servers were able to ping each other. Next I tried to ping the default gateway. Only the working servers could ping the default gateway, the other servers couldn’t.
Looking at the active / standby team configuration, I noticed that the active NIC communicates with the Nortel switch connected to the VRRP slave switch. So the servers weren’t able to ping the VRRP master switch (default gateway), but they were able to ping the VRRP slave switch, but the VRRP master switch and VRRP slave switch could ping each other.
I look at the VLAN tagging configuration on the Nortel and HP switches, but all the ports had the correct VLAN tagging, so this couldn’t be the problem. I changed the teaming and made the secondary NIC the active one. Now all the servers were able to communicate with the rest of the network. I switched everything back to the previous configuration and the problem returned again.
Looking at these symptoms I could only point out the LACP channel as the cause of all the problems. Maybe something went wrong when establishing the LACP channel. I guess the load balancing algorithm used is MAC based, maybe destination MAC based. So all packets to the default gateway or another VLAN would use the MAC address of the VRRP master switch and these packets would be lost in a UDLD link. So I decided to disable to ports on the HP switch and only leave one port enabled.
After that all the switches could communicate with the rest of the network. I decided to disable that port and enable another single port. The servers were still able to communicate with the rest of the network. I tried using the last port and still everything was working perfectly. I decided to add the other two ports to the LACP channel. This time, by having the 3 Gbps LACP channel active, every was working perfectly.
In my opinion something went wrong during the establishment of the LACP channel. I found it difficult to troubleshoot the environment, because there aren’t a lot of troubleshooting methods for the HP switches and especially for the Nortel blade switches.
I have had different discussions with different customers about the load-balancing algorithms between a Cisco switch, configured with a port-channel and a VMware ESX server using multiple NICs. Our VMware consultants always choose Route based on IP hashes as load-balancing algorithm. This means that load-balancing happens on layer 3 of the OSI model (source-destination-IP).
In my opinion, the switch should be configured the same way. Depending on the model switch, you can have different default load-balancing algoritmhs. For example, the Cisco Catalyst 3750 uses src-mac load-balancing and the Cisco Catalyst 6500 use src-dst-ip load-balancing. You can check the configured load-balancing algorithm with the following command:
show etherchannel load-balancing
If you would like you change the load-balancing algorithm you can use the global configuration command:
port-channel load-balancing <option>
Be aware that this is a global configuration command, so it affects all the configured port-channels on the switch.
To check the load-balancing between the different NICs, you should have a tool to look at real-time bandwidth statistics. I normally use the tool SNMP Traffic Grapher to monitor the different switch ports. On the ESX console you can check the load-balancing with the commands:
The load should be spread fairly even across the different switch ports en vmnics.