Fixing Erratic Behavior on Hyper-V with Network Load Balancers

For years, I’d never heard of this problem. Then, suddenly, I’m seeing it everywhere. It’s not easy to precisely outline a symptom tree for you. Networked applications will behave oddly. Remote desktop sessions may skip or hang. Some network traffic will not pass at all. Other traffic will behave erratically. Rather than try to give you a thorough symptom tree, we’ll just describe the setup that can be addressed with the contents of this article: you’re using Hyper-V with a third-party network load balancer and experiencing network-related problems.

Acknowledgements

Before I ever encountered it, the problem was described to me by one my readers. Check out our Complete Guide to Hyper-V Networking article and look in the comments section for Jahn’s input. I had a different experience, but that conversation helped me reach a resolution much more quickly.

Problem Reproduction Instructions

The problem may appear under other conditions, but should always occur under these:

  • The network adapters that host the Hyper-V virtual switch are configured in a team
    • Load-balancing algorithm: Dynamic
    • Teaming mode: Switch Independent (likely occurs with switch-embedded teaming as well)
  • Traffic to/from affected virtual machines passes through a third-party load-balancer
    • Load balancer uses a MAC-based system for load balancing and source verification
      • Citrix Netscaler calls its feature “MAC based forwarding”
      • F5 load balancers call it “auto last hop”
    • The load balancer’s “internal” IP address is on the same subnet as the virtual machine’s
  • Sufficient traffic must be exiting the virtual machine for Hyper-V to load balance some of it to a different physical adapter

I’ll go into more detail later. This list should help you determine if you’re looking at an article that can help you.

Resolution

Fixing the problem is very easy, and can be done without downtime. I’ll show the options in preference order. I’ll explain the impacting differences later.

Option 1: Change the Load-Balancing Algorithm

Your best bet is to change the load-balancing algorithm to “Hyper-V port”. You can change it in the lbfoadmin.exe graphical interface if your management operating system is GUI-mode Windows Server. To change it with PowerShell (assuming only one team):

Get-NetLbfoTeam | Set-NetLbfoTeam -LoadBalancingAlgorithm HyperVPort

There will be a brief interruption of networking while the change is made. It won’t be as bad as the network problems that you’re already experiencing.

Option 2: Change the Teaming Mode

Your second option is to change your teaming mode. It’s more involved because you’ll also need to update your physical infrastructure to match. I’ve always been able to do that without downtime as long as I changed the physical switch first, but I can’t promise the same for anyone else.

Decide if you want to use Static teaming or LACP teaming. Configure your physical switch accordingly.

Change your Hyper-V host to use the same mode. If your Hyper-V system’s management operating system is Windows Server GUI, you can use lbfoadmin.exe. To change it in PowerShell (assuming only one team):

Get-NetLbfoTeam | Set-NetLbfoTeam -TeamingMode Static

or

Get-NetLbfoTeam | Set-NetLbfoTeam -TeamingMode Lacp

In this context, it makes no difference whether you pick static or LACP. If you want more information, read our article on the teaming modes.

Option 3: Disable the Feature on the Load Balancer

You could tell the load balancer to stop trying to be clever. In general, I would choose that option last.

An Investigation of the Problem

So, what’s going on? What caused all this? If you’ve got an environment that matches the one that I described, then you’ve unintentionally created the perfect conditions for a storm.

Whose fault is it? In this case, I don’t really think that it’s fair to assign fault. Everyone involved is trying to make your network traffic go faster. They sometimes do that by playing fast and loose in that gray area between Ethernet and TCP/IP. We have lots of standards that govern each individually, but not so many that apply to the ways that they can interact. The problem arises because Microsoft is playing one game while your load balancer plays another. The games have different rules, and neither side is aware that another game is afoot.

Traffic Leaving the Virtual Machine

We’ll start on the Windows guest side (also applies to Linux). Your application inside your virtual machine wants to send some data to another computer. That goes something like this:

  1. Application: “Network, send this data to computer www.altaro.com on port 443”.
  2. Network: “DNS server, get me the IP for www.altaro.com”
  3. Network: “IP layer, determine if the IP address for www.altaro.com is on the same subnet”
  4. Network: “IP layer, send this packet to the gateway”
  5. IP layer passes downward for packaging in an Ethernet frame
  6. Ethernet layer transfers the frame

The part to understand: your application and your operating system don’t really care about the Ethernet part. Whatever happens down there just happens. Especially, it doesn’t care at all about the source MAC.

lb_out_traffic

 

Traffic Crossing the Hyper-V Virtual Switch

Because this particular Ethernet frame is coming out of a Hyper-V virtual machine, the first thing that it encounters is the Hyper-V virtual switch. In our scenario, the Hyper-V virtual switch rests atop a team of network adapters. As you’ll recall, that team is configured to use the Dynamic load balancing algorithm in Switch Independent mode. The algorithm decides if load balancing can be applied. The teaming mode decides which pathway to use and if it needs to repackage the outbound frame.

Switch independent mode means that the physical switch doesn’t know anything about a team. It only knows about two or more Ethernet endpoints connected in standard access mode. A port in that mode can “host” any number of MAC addresses;the physical switch’s capability defines the limit. However, the same MAC address cannot appear on multiple access ports simultaneously. Allowing that would cause all sorts of problems.

lb_broken_si_traffic

 

So, if the team wants to load balance traffic coming out of a virtual machine, it needs to ensure that the traffic has a source MAC address that won’t cause the physical switch to panic. For traffic going out anything other than the primary adapter, it uses the MAC address of the physical adapter.

lb_good_si_traffic

 

So, no matter how many physical adapters the team owns, one of two things will happen for each outbound frame:

  • The team will choose to use the physical adapter that the virtual machine’s network adapter is registered on. The Ethernet frame will travel as-is. That means that its source MAC address will be exactly the same as the virtual network adapter’s (meaning, not repackaged)
  • The team will choose to use an adapter other than the one that the virtual machine’s network adapter is registered on. The Ethernet frame will be altered. The source MAC address will be replaced with the MAC address of the physical adapter

Note: The visualization does not cover all scenarios. A virtual network adapter might be affinitized to the second physical adapter. If so, its load balanced packets would travel out of the shown “pNIC1” and use that physical adapter’s MAC as a source.

Traffic Crossing the Load Balancer

So, our frame arrives at the load balancer. The load balancer has a really crummy job. It needs to make traffic go faster, not slower. And, it acts like a TCP/IP router. Routers need to unpackage inbound Ethernet frames, look at their IP information, and make decisions on how to transmit them. That requires compute power and time.

lb_router_hard

If it needs too much time to do all this, then people would prefer to live without the load balancer. That means that the load balancer’s manufacturer doesn’t sell any units, doesn’t make any money, and goes out of business. So, they come up with all sorts of tricks to make traffic faster. One way to do that is by not doing quite so much work on the Ethernet frame. This is a gross oversimplification, but you get the idea:

lb_router_easy

Essentially, the load balancer only needs to remember which MAC address sent which frame, and then it doesn’t need to worry so much about all that IP nonsense (it’s really more complicated than that, but this is close enough).

The Hyper-V/Load Balancer Collision

Now we’ve arrived at the core of the problem: Hyper-V sends traffic from virtual machines using source MAC addresses that don’t belong to those virtual machines. The MAC addresses belong to the physical NIC. When the load balancer tries to associate that traffic with the MAC address of the physical NIC, everything breaks.

Trying to be helpful (remember that), the load balancer attempts to return what it deems as “response” traffic to the MAC that initiated the conversation. The MAC, in this case, belongs directly to that second physical NIC. It wasn’t expecting the traffic that’s now coming in, so it silently discards the frame.

That happens because:

  • The Windows Server network teaming load balancing algorithms are send only; they will not perform reverse translations. There are lots of reasons for that and they are all good, so don’t get upset with Microsoft. Besides, it’s not like anyone else does things differently.
  • Because the inbound Ethernet frame is not reverse-translated, its destination MAC belongs to a physical NIC. The Hyper-V virtual switch will not send any Ethernet frame to a virtual network adapter unless it owns the destination MAC
  • In typical system-to-system communications, the “responding” system would have sent its traffic to the IP address of the virtual machine. Through the normal course of typical networking, that traffic’s destination MAC would always belong to the virtual machine. It’s only because your load balancer is trying to speed things along that the frame is being sent to the physical NIC’s MAC address. Otherwise, the source MAC of the original frame would have been little more than trivia.

Stated a bit more simply: Windows Server network teaming doesn’t know that anyone cares about its frames’ source MAC addresses and the load balancer doesn’t know that anyone is lying about their MAC addresses.

Why Hyper-V Port Mode Fixes the Problem

When you select the Hyper-V port load balancing algorithm in combination with the switch independent teaming mode, each virtual network adapter’s MAC address is registered on a single physical network adapter. That’s the same behavior that Dynamic uses. However, no load balancing is done for any given virtual network adapter; all traffic entering and exiting any given virtual adapter will always use the same physical adapter. The team achieves load balancing by placing each virtual network adapter across its physical members in a round-robin fashion.

lb_si_hp

Source MACs will always be those of their respective virtual adapters, so there’s nothing to get confused about.

I like this mode as a solution because it does a good job addressing the issue without making any other changes to your infrastructure. The drawback would be if you only had a few virtual network adapters and weren’t getting the best distribution. For a 10GbE system, I wouldn’t worry.

Why Static and LACP Fix the Problem

Static and LACP teaming involve your Windows Server system and the physical switch agreeing on a single logical pathway that consists of multiple physical pathways. All MAC addresses are registered on that logical pathway. Therefore, the Windows Server team has no need of performing any source MAC substitution regardless of the load balancing algorithm that you choose.

lb_stdlacp

Since no MAC substitution occurs here, the load balancer won’t get anything confused.

I don’t like this method as much. It means modifying your physical infrastructure. I’ve noticed that some physical switches don’t like the LACP failover process very much. I’ve encountered some that need a minute or more to notice that a physical link was down and react accordingly. With every physical switch that I’ve used or heard of, the switch independent mode fails over almost instantly.

That said, using a static or LACP team will allow you to continue using the Dynamic load balancing algorithm. All else being equal, you’ll get a more even load balancing distribution with Dynamic than you will with Hyper-V port mode.

Why You Should Let the Load Balancer Do Its Job

The third listed resolution suggests disabling the related feature on your load balancer. I don’t like that option, personally. I don’t have much experience with the Citrix product, but I know that the F5 buries their “Auto Last Hop” feature fairly deeply. Also, these two manufacturers enable the feature by default. It won’t be obvious to a maintainer that you’ve made the change.

However, your situation might dictate that disabling the load balancer’s feature causes fewer problems than changing the Hyper-V or physical switch configuration. Do what works best for you.

Using a Different Internal Router Also Addresses the Issue

In all of these scenarios, the load balancer performs routing. Actually, these types of load balancers always perform routing, because they present a single IP address for the service to the outside world and translate internally to the back-end systems.

However, nothing states that the internal source IP address of the load balancer must exist in the same subnet as the back-end virtual machines. You might do that for performance reasons; as I said above, routing incurs overhead. However, this all a known quantity and modern routers are pretty good at what they do. If any router is present between the load balancer and the back-end virtual machines, then the MAC address issue will sort itself out regardless of your load balancing and teaming mode selections.

Have You Experienced this Phenomenon?

If so, I’d love to hear from you. What system did you experience it happening? How did you resolve the situation (if you were able)? Perhaps you’ve just encountered it and arrived here to get a solution – if so let me know if this explanation was helpful or if you need any further assistance regarding your particular environment. The comment section below awaits.

Altaro Hyper-V Backup
Share this post

Not a DOJO Member yet?

Join thousands of other IT pros and receive a weekly roundup email with the latest content & updates!

9 thoughts on "Fixing Erratic Behavior on Hyper-V with Network Load Balancers"

    • Shannon Yates says:

      Hey There,

      I have been working witch Cisco, HPE and Microsoft to get to the bottom of this and your the first one i have come across that has found it. We have hit this pretty little issue also in a major way doing migrations of Hyper-V Chassis into Cisco ACI. ACI has a similar feature enabled on the Bridge Domain Called Endpoint Dataplane learning, where effectively it is being clever and learning the MAC IP from the data plane and as a result sees both the Virtual and Redundant physical MAC. This causes MAJOR problems with ACI equivalent to a host flap in the old days followed with Critical Alarms in the fabric.

      So If your deploying Cisco ACI any time soon i hope you find this before you migrate in.

      Great Post !
      Shannon Yates

  • Have seen this ever since Windows Server 2012 Hyper-V, but never before in previous versions. So we had a clue it was due to the new teaming. Struggled with it for our webserver farms. Luckily I had a coworker that was very knowledgeable about the netscaler side and could tell me what was happening on that side. Researching and Switching to Hyper-V mode corrected the problem and has been the standard for our logical switch configurations ever since. Good article. Say hi to Andy if you cross paths.

  • mikis says:

    We have been struggling with this magic in the environment with F5 BigIPs and Windows Hyper-V 2012. It was observed only on a certain type of traffic. Mainly DNS/UDP outgoing of the VMs.
    We have chosen to disable Auto Last Hop on the F5 as the least painful and the quickest solution for now. Looks it helped.

  • Robert Jones says:

    This post is super helpful. I have been trying to get to the bottom of this for days. I was about to switch my nic teaming to Hyper-V port before seeing this, but the article solidified it and your explanation is perfect. Thanks so much.

  • Aaron says:

    Hello, we are experiencing a wierd issue that I hoped was the one described, but after chaning teaming to “Hyper-V Port” it remains the same.

    We see packets being dropped on all nodes, all VMs, almost always at the same time. Even in Hyper-V nodes belonging to different clusters. We do have a physical load balance (FortiADC) balancing VMs in Hyper-V.

    We are clueless on what is happening! Any ideas?

    • Eric Siron says:

      Is your only symptom dropped packets? I’m not sure that’s so weird. Anything else going on?
      Does your tool show that the source of the dropped packets is the load balancer?

Leave a comment or ask a question

Your email address will not be published. Required fields are marked *

Your email address will not be published.

Notify me of follow-up replies via email

Yes, I would like to receive new blog posts by email

What is the color of grass?

Please note: If you’re not already a member on the Dojo Forums you will create a new account and receive an activation email.