Connecting the world…

blade

Link State Tracking

Last week a friend called me and told me he was having serious problems with his network. A complete blade environment wasn’t able to communicate with the rest of the network. I asked what changed in the network and he told me that he had added a VLAN to a trunk allowed lists.

Because he is a friend, I dialed in and checked the configuration of the switch. I noticed that all ports on the switch were err-disabled. What happened here, that all switch ports were err-disabled!!! I noticed the configuration of link state tracking on all ports.

Link-state tracking, also known as trunk failover, is a feature that binds the link state of multiple interfaces. Link-state tracking provides redundancy in the network when used with server network interface card (NIC) adapter teaming. When the server network adapters are configured in a primary or secondary relationship known as teaming and the link is lost on the primary interface, connectivity transparently changes to the secondary interface.

At first I was skeptic about the link state configuration and asked my friend why it was used. He couldn’t give me any answer, because he didn’t configure the switch. For me it was hard to find a reason why link state tracking was used, because I wasn’t familiar with the network. I removed the link state configuration from the switch. All ports changed to a normal state. I noticed that the uplink (port-channel) configuration wasn’t correct. They added the VLAN to the trunk allowed lists on a member port and not on the port-channel interface.

After helping my friend and dreaming for a couple of days, I started thinking about the Link State Tracking feature. I tried to discover why someone configured the feature in my friends environment. Eventually, after some brain cracking, I found the solution. Let’s look at the following example environment.

LinkStateTracking

The figure shows one ESX server, which has two NIC’s. One NIC is connected to bl-sw01 and the other NIC is connected to bl-sw02. The ESX uses the load-balancing algorithm “Route based on Virtual PortID”.

Now lets assume the link between bl-sw02 and dis-sw02 loses its connection. Because the ESX server still has a connection with bl-sw02, it keeps sending packet that way. Switch bl-sw02 doesn’t have any uplink to the rest of the network, so the packet will get dropped.

When using Link State Tracking the connection between the ESX server and switch bl-sw02 will also loose its connection when the uplink between bl-sw02 en switch dis-sw02 gets lost. The ESX server will only use the connection with switch bl-sw01 to reach the rest of the network. Link State Tracking uses upstream and downstream interfaces. In the example the connection between the switch port, which connects switch bl-sw02 to switch dis-sw02, would be configured as an upstream port. The switch port to the ESX server would be configured as a downstream port. The downstream port is put in err-disable state when the upstream port loses its connection. This is exactly what you would like to accomplish.

The first step to enable Link State Tracking globally on the switch:

bl-sw02(config)# link state track 1

The next step is configuring the upstream and downstream interfaces.

interface GigabitEthernet0/16
description switch-uplink

switchport trunk encapsulation dot1q
switchport mode trunk
switchport nonegotiate
link state group 1 upstream
spanning-tree link-type point-to-point

!

interface GigabitEthernet0/10
description ESX01

switchport trunk encapsulation dot1q
switchport mode trunk
switchport nonegotiate
link state group 1 downstream
spanning-tree portfast trunk

You can check the status of the Link State Group with the following command:

bl-sw02#show link state group detail

Link State Group: 1 Status: Enabled, Up

Upstream Interfaces : Gi0/16(Up)

Downstream Interfaces : Gi0/10(Up)

In the future I will use Link State Tracking, especially in blade environments. At least in blade environments with multiple switch, which don’t support some kind of stacking technology, and servers with multiple NIC’s.

IBM Blade with Nortel and HP switches

Today I had to troubleshoot an IBM Blade system. The customer was complaining that all servers, except one, weren’t able to communicate with the rest of the network. The blade system contains two Nortel switches. Each Nortel switch is connected with a 3 Gbps LACP channel to separate HP switches. The HP switches are the core switches of the network and have VRRP configured between them. The servers have two network card, which are configured in an active / standby team configuration.

I started troubleshooting by simply pinging between the different servers in the blade system. The servers were able to ping each other. Next I tried to ping the default gateway. Only the working servers could ping the default gateway, the other servers couldn’t.

Looking at the active / standby team configuration, I noticed that the active NIC communicates with the Nortel switch connected to the VRRP slave switch. So the servers weren’t able to ping the VRRP master switch (default gateway), but they were able to ping the VRRP slave switch, but the VRRP master switch and VRRP slave switch could ping each other.

I look at the VLAN tagging configuration on the Nortel and HP switches, but all the ports had the correct VLAN tagging, so this couldn’t be the problem. I changed the teaming and made the secondary NIC the active one. Now all the servers were able to communicate with the rest of the network. I switched everything back to the previous configuration and the problem returned again.

Looking at these symptoms I could only point out the LACP channel as the cause of all the problems. Maybe something went wrong when establishing the LACP channel. I guess the load balancing algorithm used is MAC based, maybe destination MAC based. So all packets to the default gateway or another VLAN would use the MAC address of the VRRP master switch and these packets would be lost in a UDLD link. So I decided to disable to ports on the HP switch and only leave one port enabled.

After that all the switches could communicate with the rest of the network. I decided to disable that port and enable another single port. The servers were still able to communicate with the rest of the network. I tried using the last port and still everything was working perfectly. I decided to add the other two ports to the LACP channel. This time, by having the 3 Gbps LACP channel active, every was working perfectly.

In my opinion something went wrong during the establishment of the LACP channel. I found it difficult to troubleshoot the environment, because there aren’t a lot of troubleshooting methods for the HP switches and especially for the Nortel blade switches.

HP Blade Switch Development

Maybe old news for some of you, but HP has developed the Cisco switches for the HP Blade servers. The Cisco Catalyst Blade Switch 3120G and 3120X provide stacking functionality. This improves the functionality of the switches by creating a single switch from two physical switches.

Source: The Cisco Catalyst Blade Switch 3120 Series Switches are specifically designed to meet the rigors of the blade server based application infrastructure and provides HP BladeSystem customers with the ability to stack up to nine switches into a single virtual Switch.

The creation of a stack helps improving the availability and load-balancing of connections between the HP Blade environment and the physical network environment. More information about the new switches can be found here.