Today I had to troubleshoot an IBM Blade system. The customer was complaining that all servers, except one, weren’t able to communicate with the rest of the network. The blade system contains two Nortel switches. Each Nortel switch is connected with a 3 Gbps LACP channel to separate HP switches. The HP switches are the core switches of the network and have VRRP configured between them. The servers have two network card, which are configured in an active / standby team configuration.
I started troubleshooting by simply pinging between the different servers in the blade system. The servers were able to ping each other. Next I tried to ping the default gateway. Only the working servers could ping the default gateway, the other servers couldn’t.
Looking at the active / standby team configuration, I noticed that the active NIC communicates with the Nortel switch connected to the VRRP slave switch. So the servers weren’t able to ping the VRRP master switch (default gateway), but they were able to ping the VRRP slave switch, but the VRRP master switch and VRRP slave switch could ping each other.
I look at the VLAN tagging configuration on the Nortel and HP switches, but all the ports had the correct VLAN tagging, so this couldn’t be the problem. I changed the teaming and made the secondary NIC the active one. Now all the servers were able to communicate with the rest of the network. I switched everything back to the previous configuration and the problem returned again.
Looking at these symptoms I could only point out the LACP channel as the cause of all the problems. Maybe something went wrong when establishing the LACP channel. I guess the load balancing algorithm used is MAC based, maybe destination MAC based. So all packets to the default gateway or another VLAN would use the MAC address of the VRRP master switch and these packets would be lost in a UDLD link. So I decided to disable to ports on the HP switch and only leave one port enabled.
After that all the switches could communicate with the rest of the network. I decided to disable that port and enable another single port. The servers were still able to communicate with the rest of the network. I tried using the last port and still everything was working perfectly. I decided to add the other two ports to the LACP channel. This time, by having the 3 Gbps LACP channel active, every was working perfectly.
In my opinion something went wrong during the establishment of the LACP channel. I found it difficult to troubleshoot the environment, because there aren’t a lot of troubleshooting methods for the HP switches and especially for the Nortel blade switches.