Hyper-V Troubleshooting & Performance Articles - Altaro DOJO | Hyper-V https://www.altaro.com/hyper-v Hyper-V guides, how-tos, tips, and expert advice for system admins and IT professionals Mon, 07 Mar 2022 12:47:21 +0000 en-US hourly 1 How to repair a broken NIC configuration in Azure https://www.altaro.com/hyper-v/repair-broken-nic-azure/ https://www.altaro.com/hyper-v/repair-broken-nic-azure/#respond Fri, 23 Oct 2020 05:44:37 +0000 https://www.altaro.com/hyper-v/?p=19050 Having a problem with a broken NIC configuration in Azure? This detailed solution based on best-practices will help you repair it.

The post How to repair a broken NIC configuration in Azure appeared first on Altaro DOJO | Hyper-V.

]]>

Following a customer call from a few weeks ago, I thought it would be good to document how to fix a broken NIC configuration of an Azure VM.

The first question that comes to mind is: How it is possible to break a NIC configuration in Azure?

To be honest, it’s actually pretty simple. Let’s say you change the Adapter Properties from inside the Virtual Machine, instead of doing it via the Azure Portal. That will completely disconnect your VM from the Microsoft Global Network Backend. In case this happens, with the next steps, you will find out how you should recover such a VM.

For this guide, I’m using a Windows Server 2019 VM running in a DS1 v2 with private NIC configuration. I connect to the VM using Azure Bastion, which is already pretty robust when it comes to connecting to a VM.

Windows Server 2019 VM DS1 v2 private NIC configuration Azure Bastion

Nothing fancy and no special Microsoft Magic up to here.

Step 1: Break the VM connection to the Microsoft Azure Backend

You connect to the VM and change the NIC configuration like shown below. This is a common mistake network admins or Azure Newbies make when trying to set a static IP for a VM. You can read here a guide on how you should set a static IP for an Azure VM.

NIC configuration IP change

Afterwards, a connection error will appear and you will no longer be able to connect to the VM.

NIC configuration connection error

At this point, most people working with Azure would just go and recover their work from a backup taken last night but there is a much easier way to recover the virtual machine.

But before we can recover, we need to do some preparations.

Step 2: The preparation

The first step for the prep is very important. If you have public IP assigned to your virtual machine, you need to change the IP from dynamic to static. We will shut down the VM during the recovery process. Without that change, the assigned public IP would be lost and you would have to retrieve a new one.

Nic configuration preparation

Please be aware, it is never a good idea and no best practice to assign a public IP directly to an Azure Virtual Machine. Please always use at least the Azure Load Balancer in front of a VM. That would increase your DDoS prevention and security. If you use VMs for production, please always follow the Azure Security Guidelines.

Now disassociate the public IP from the NIC of your VM.

NIC ip config

 

After you change the Public IP configuration, you need to create a new Network Interface from the Azure Marketplace.

Create NIC network interface

Now associate the public IP to the new Network Interface. In my case, I named it Altaro2 but your Network Interface name should apply to your Azure Naming and Governance Strategy.

Azure Naming and Governance Strategy

Now we are done with the preparation work. Let’s start with recovering the Virtual Machine.

Step 3: Recovering the Virtual Machine

For the next step, we need to stop and deallocate the VM. To do so, you stop the Azure VM via Azure Portal or PowerShell.

Recovering the Virtual Machine

For the guys and girls who are now thinking “but I need a downtime to shut down the VM”. Please think, the VM is offline anyway. So every second you lose on such a thought is a second lost for uptime in the future … and yes, I have met such people. 😉

After the VM is shut down, you go to Networking configuration and allocate the newly created Network Interface and deallocate the old one. The exchange of a NIC is only possible when a VM is deallocated and shutdown. You cannot exchange a NIC while running.

NIC networking

Now detach the old Network Interface.

NIC detaching the old Network Interface

After you did the exchange, you can start your Virtual Machine with the new Network Interface.

NIC VM new interface

After the Virtual Machine has started, you should be able to connect to it using RDP, SSH or Bastion, depends on what you prefer.

Step 4: Validation

As already explained. I use Azure Bastion to connect to my virtual machines. After I exchanged the NICs, I’m able to connect to the VM, and I also have a very fresh, nice DHCP config for my Azure Virtual Machine.

NIC configuration validation

After you brought your Virtual Machine back to business, you should do a cleanup process, to ensure you don’t waste resources, money, and lose track of the changes.

Step 5: Cleanup

The last step of our recovery is the cleanup. I would suggest the following steps during the cleanup process:

  • Delete the old Network Interface
  • Set the public IP back to dynamic
  • Change the configuration of the new Network Interface according to what you would prefer in case of DNS, IP or Firewalling
  • Document the changes in your change management and documentation
  • Start a Backup for the Virtual Machine

Why does this solution work?

Now you may ask yourself why is that working or if it’s an “Azure behaviour”. The answer is simple, the recovery method is not related to Azure and can be used for other Hypervisors and Hyperscalers too.

I only used a standard behaviour, which is implemented in every Windows or Linux operating system. Operating Systems normally keep their Network configuration based on MAC and Hardware Identification Number. When you create a new Network Interface, whether it’s virtual or physical, it gets a new MAC Address and Hardware Identification Number.

When the Operating System detects the new MAC and Hardware ID, it has no configuration and switches over to default configuration. In most Operating Systems the default setting is DHCP and that’s the point where the magic happens. Now the Network Interface is back in default mode to connect to the Microsoft Azure Backend Network and gets the IP and Virtual Network specifications assigned, which we destroyed while playing with the NIC inside the Operating System.

Closing notes

It sometimes happens that you waste a NIC config and the reasons for that could be:

  • You unknowingly change the NIC configuration or Firewall settings
  • You were working with Azure Networks for the first time and you tried to do things as you did aeons ago
  • You made a config script, which wasted your settings
  • Someone tried a new “security approach”

Those things happen often and I hope this guide can be helpful to get everything back on track. 🙂

As a funny ending note, I wanted to share something that happened to my customer and me during a live configuration session for Azure Virtual WAN. A coworker of my customer changed the IP of the Windows Server NIC of the Terminal Server we were working with. That gave us 30 minutes of extra fun during the session. 😉

The post How to repair a broken NIC configuration in Azure appeared first on Altaro DOJO | Hyper-V.

]]>
https://www.altaro.com/hyper-v/repair-broken-nic-azure/feed/ 0
How to Quickly Recover and Restore Windows Server Hyper-V Backups https://www.altaro.com/hyper-v/recover-restore-windows-server/ https://www.altaro.com/hyper-v/recover-restore-windows-server/#respond Thu, 20 Aug 2020 12:28:20 +0000 https://www.altaro.com/hyper-v/?p=17890 This article explains the best practices for how to recover data from a backup and bring your services back online as fast as possible

The post How to Quickly Recover and Restore Windows Server Hyper-V Backups appeared first on Altaro DOJO | Hyper-V.

]]>

Perhaps the only thing worse than having a disaster strike your datacenter is the stress of recovering your data and services as quickly as possible. Most businesses need to operate 24 hours a day and any service outage will upset customers and your business will lose money. According to a 2016 study by the Ponemon Institute, the average datacenter outage costs enterprises over $750,000 and lasts about 85 minutes, losing the businesses roughly $9,000 per minute. While your organization may be operating at a smaller scale, any service downtime or data loss is going to hurt your reputation and may even jeopardize your career. This blog is going to give you the best practices for how to recover your data from a backup and bring your services online as fast as possible.

Automation to Decrease your Recovery Time Objective (RTO)

Automation is key when it comes to decreasing your Recovery Time Objective (RTO) and minimizing your downtime. Any time you have a manual step in the process, it is going to create a bottleneck. If the outage is caused by a natural disaster, relying on human intervention is particularly risky as the datacenter may be inaccessible or remote connections may not be available. As you learn about the best practice of detection, alerting, recovery, startup, and verification, consider how you could implement each of these steps in a fully-automated fashion.

Detect Outages Faster

The first way to optimize your recovery speed is to detect the outage as quickly as possible. If you have an enterprise monitoring solution like System Center Operations Manager (SCOM), it will continually check the health of your application and its infrastructure, looking for errors or other problems.  Even if you have developed an in-house application and do not have access to enterprise tools, you can use Windows Task Manager to set up tasks that automatically check for system health by scanning event logs, then trigger recovery actions. There are also many free monitoring tools such as Uptime Robot which alerts you anytime your website goes offline.

Initiate the Recovery Process

Once the administrators have been alerted, immediately begin the recovery process.  Meanwhile, you should run a secondary health check on the system to make sure that you did not receive a false alert. This is a great background task to continually run during the recovery process to make sure that something like a cluster failover or transient network failure does not force your system into restarting if it is actually healthy. If the outage was indeed a false positive, then have a task prepared which will terminate the recovery process so that it does not interfere with the now-healthy system.

Select the Optimal Backup

If you restore your service and determine that there was data loss, then you will need to make a decision whether to accept that loss or if you should attempt to recover from the last good backup, which can cause further downtime during the restoration. Make sure you can automatically determine whether you need to restore a full backup, or whether a differencing backup is sufficient to give you a faster recovery time. By comparing the timestamp of the outage to the timestamp on your backup(s), you can determine which option will minimize the impact on your business. This can be done with a simple PowerShell script, but make sure that you know how to get this information from your backup provider and pass it into your script.

Prioritize Backup Network Traffic

Once you have identified the best backup, you then need to copy it to your production system as fast as possible. A lot of organizations will deprioritize their backup network since they are only used a few times a day or week. This may be acceptable during the backup process, but these networks need to be optimized during recovery.  If you do need to restore a backup, consider running a script that will prioritize this traffic, such as by changing the quality of service (QoS) settings or disabling other traffic which uses that same network.

Provision Fast Disks for Recovery

Next, consider the storage media which the backup is copied before the restoration happens.  Try to use your fastest SSD disks to maximize the speed in which the backup is restored.  If you decided to backup your data on a tape drive, you will likely have high copy speeds during restoration.  However, tape drives usually require manual intervention to find and mount that drive, which should generally be avoided if you want a fully automated process.  You can learn more about the tradeoffs of using tape drives and other media here.

Restart Services and Applications

Once your backup has been restored, then you need to restart the services and applications.  If you are restoring to a virtual machine (VM), then you can optimize its startup time by maximizing the memory which is allocated to it during startup and operations.  You can also configure VM prioritization to ensure that this critical VM starts first in case it is competing with other VMs to launch on a host which has recently crashed.  Enable QoS on your virtual network adapters to ensure that traffic flows through to the guest operating system as quickly as possible, which will speed up the time to restore a backup within the VM, and also help clients reconnect faster.  Whether you are running this application within a VM or on bare metal, you can also use Task Manager to enhance the priority of the important processes.

Verify that the Recovery Worked

Now verify that your backup was restored correctly and your application is functioning as expected by running some quick test cases.  If you feel confident that those tests worked, then you can allow users to reconnect.  If those tests fail, then work backward through the workflow to try to determine the bottleneck, or simply roll back to the next “good” backup and try the process again.

Regularly Test Backup and Recovery

Anytime you need to restore from a backup, it will be a frustrating experience, which is why testing throughout your application development lifecycle is critical.  Any single point of failure can cause your backup or recovery to fail, which is why this needs to be part of your regular business operations.  Once your systems have been restored, always make sure your IT department does a thorough investigation into what caused the outage, what worked well in the recovery, and what areas could be improved.  Review the time each step took to complete and ask yourself whether any of these should be optimized.  It is also a good best practice to write up a formal report which can be saved and referred to in the future, even if you have moved on to a different company.

The top software backup provides like Altaro can help you throughout the process by offering backup solutions for Hyper-V, Azure, O365 and PCs with the Altaro API interface which can be used for backup automation.

No matter how well you can prepare your datacenter, disasters can happen, so make sure that you have done all you can to try to recover your data – so that you can save your company!

The post How to Quickly Recover and Restore Windows Server Hyper-V Backups appeared first on Altaro DOJO | Hyper-V.

]]>
https://www.altaro.com/hyper-v/recover-restore-windows-server/feed/ 0
RTO and RPO: Understanding Disaster Recovery Times https://www.altaro.com/hyper-v/rto-rpo-disaster-recovery/ https://www.altaro.com/hyper-v/rto-rpo-disaster-recovery/#respond Tue, 07 Apr 2020 09:27:27 +0000 https://www.altaro.com/hyper-v/?p=18525 Everything you need in order to understand and plan your Disaster Recovery Times, including Recovery Time Objective (RTO) and Recovery Point Objective (RPO)

The post RTO and RPO: Understanding Disaster Recovery Times appeared first on Altaro DOJO | Hyper-V.

]]>

You will focus much of your disaster recovery planning (and rightly so) on the data that you need to capture. The best way to find out if your current strategy does this properly is to try our acid test. However, backup coverage only accounts for part of a proper overall plan. Your larger design must include a thorough model of recovery goals, specifically Recovery Time Objective (RTO) and Recovery Point Objective (RPO).

Ideally, a restore process would contain absolutely everything. Practically, expect that to never happen. This article explains the risks and options of when and how quickly operations can and should resume following systems failure.

Table of Contents

Disaster Recovery Time in a Nutshell

What is Recovery Time Objective?

What is Recovery Point Objective?

Challenges Against Short RTOs and RPOs

RTO Challenges

RPO Challenges

Outlining Organizational Desires

Considering the Availability and Impact of Solutions

Instant Data Replication

Short Interval Data Replication

Ransomware Considerations for Replication

Short Interval Backup

Long Interval Backup

Ransomware Considerations for Backup

Using Multiple RTOs and RPOs

Leveraging Rotation and Retention Policies

Minimizing Rotation Risks

Coalescing into a Disaster Recovery Plan

Disaster Recovery Time in a Nutshell

If a catastrophe strikes that requires recovery from backup media, most people will first ask: “How long until we can get up and running?” That’s an important question, but not the only time-oriented problem that you face. Additionally, and perhaps more importantly, you must ask: “How much already-completed operational time can we afford to lose?” The business-continuity industry represents the answers to those questions in the acronyms RTO and RPO, respectively.

What is Recovery Time Objective?

Your Recovery Time Objective (RTO) sets the expectation for the answer to, “How long until we can get going again?” Just break the words out into a longer sentence: “It is the objective for the amount of time between the data loss event and recovery.”

Recovery Time Objective RTO

Of course, we would like to make all of our recovery times instant. But, we also know that will not happen. So, you need to decide in advance how much downtime you can tolerate, and strategize accordingly. Do not wait until the midst of a calamity to declare, “We need to get online NOW!” By that point, it will be too late. Your organization needs to build up those objectives in advance. Budgets and capabilities will define the boundaries of your plan. Before we investigate that further, let’s consider the other time-based recovery metric.

What is Recovery Point Objective?

We don’t just want to minimize the time that we lose; we also want to minimize the amount of data that we lose. Often, we frame that in terms of retention policies — how far back in time we need to be able to access. However, failures usually cause a loss of systems during run time. Unless all of your systems continually duplicate data as it enters the system, you will lose something. Because backups generally operate on a timer of some sort, you can often describe that potential loss in a time unit, just as you can with recovery times. We refer to the maximum total acceptable amount of lost time as a Recovery Point Objective (RPO).

Recovery Point Objective RPO

As with RTOs, shorter RPOs are better. The shorter the amount of time since a recovery point, the less overall data lost. Unfortunately, reduced RPOs take a heavier toll on resources. You will need to balance what you can achieve against what your business units want. Allow plenty of time for discussions on this subject.

Challenges Against Short RTOs and RPOs

First, you need to understand what will prevent you from achieving instant RTOs and RPOs. More importantly, you need to ensure that the critical stakeholders in your organization understand it. These objectives mean setting reasonable expectations for your managers and users at least as much as they mean setting goals for your IT staff.

RTO Challenges

We can define a handful of generic obstacles to quick recovery times:

  • Time to acquire, configure, and deploy replacement hardware
  • Effort and time to move into new buildings
  • Need to retrieve or connect to backup media and sources
  • Personnel effort
  • Vendor engagement

You may also face some barriers specific to your organization, such as:

  • Prerequisite procedures
  • Involvement of key personnel
  • Regulatory reporting

Make sure to clearly document all known conditions that add time to recovery efforts. They can help you to establish a recovery checklist. When someone requests a progress report during an outage, you can indicate the current point in the documentation. That will save time and reduce frustration.

RPO Challenges

We could create a similar list for RPO challenges as we did for RTO challenges. Instead, we will use one sentence to summarize them all: “The backup frequency establishes the minimum RPO”. In order to take more frequent backups, you need a fast backup system with adequate amounts of storage. So, your ability to bring resources to bear on the problem directly affects RPO length. You have a variety of solutions to choose from that can help.

Outlining Organizational Desires

Before expending much effort figuring out what you can do, find out what you must do. Unless you happen to run everything, you will need input from others. Start broadly with the same type of questions that we asked above: “How long can you tolerate downtime during recovery?” and “How far back from a catastrophic event can you re-enter data?” Explain RTOs and RPOs. Ensure that everyone understands that RPO means recent a loss of recent data, not long-term historical data.

These discussions may require a fair bit of time and multiple meetings. Suggest that managers work with their staff on what-if scenarios. They can even simulate operations without access to systems. For your part, you might need to discover the costs associated with solutions that can meet different RPO and RTO levels. You do not need to provide exact figures, but you should be ready and able to answer ballpark questions. You should also know the options available at different spend levels.

Considering the Availability and Impact of Solutions

To some degree, the amount that you spend controls the length of your RTOs and RPOs. That has limits; not all vendors provide the same value per dollar spent. But, some institutions set out to spend as close to nothing as possible on backup. While most backup software vendors do offer a free level of their product, none of them make their best features available at no charge. Organizations that try to spend nothing on their backup software will have high RTOs and RPOs and may encounter unexpected barriers. Even if you find a free solution that does what you need, no one makes storage space and equipment available for free. You need to find a balance between cost and capability that your company can accept.

To help you understand your choices, we will consider different tiers of data protection.

Instant Data Replication

For the lowest RPO, only real-time replication will suffice. In real-time replication, every write to live storage is also written to backup storage. You can achieve this in many ways, but the most reliable involves dedicated hardware. You will spend a lot, but you can reduce your RPO effectively to zero. Even a real-time replication system can drop active transactions, so never expect a complete shield against data loss.

Real-time replication systems have a very high associated cost. For the most reliable protection, they will need to span geography as well. If you just replicate to another room down the hall and a fire destroys the entire building, your replication system will not save you. So, you will need multiple locations, very high speed interconnects, and capable storage systems.

Short Interval Data Replication

If you can sustain a few minutes of lost information, then you usually find much lower price tags for short-interval replication technology. Unlike real-time replication, software can handle the load of delayed replication, so you will find more solutions. As an example, Altaro VM Backup offers Continuous Data Protection (CDP), which cuts your RPO to as low as five minutes.

As with instant replication, you want your short-interval replication to span geographic locations if possible. But, you might not need to spend as much on networking, as the delays in transmission give transfers more time to complete.

Ransomware Considerations for Replication

You always need to worry about data corruption in replication. Ransomware adds a new twist but presents the same basic problem. Something damages your real-time data. None-the-wiser, your replication system makes a faithful copy of that corrupted data. The corruption or ransomware has turned both your live data and your replicated data into useless jumbles of bits.

Anti-malware and safe computing practices present your strongest front-line protection against ransomware. However, you cannot rely on them alone. The upshot: you cannot rely on replication systems alone for backup. A secondary implication: even though replication provides very short RPOs, you cannot guarantee them.

Short Interval Backup

You can use most traditional backup software in short intervals. Sometimes, those intervals can be just, or nearly, as short as short-term replication intervals. The real difference between replication and backup is the number of possible copies of duplicated data. Replication usually provides only one copy of live data — perhaps two or three at the most — and no historical copies. Backup programs differ in how many unique simultaneous copies that they will make, but all will make multiple historical copies. Even better, historical copies can usually exist offline.

You do not need to set a goal of only a few minutes for short interval backups. To balance protection and costs, you might space them out in terms of hours. You can also leverage delta, incremental, and differential backups to reduce total space usage. Sometimes, your technologies have built-in solutions that can help. As an example, SQL administrators commonly use transaction log backups on a short rotation to make short backups to a local disk. They perform a full backup each night that their regular backup system captures. If a failure occurs during the day that does not wipe out storage, they can restore the previous night’s full backup and replay the available transaction log backups.

Long Interval Backup

At the “lowest” tier, we find the oldest solution: the reliable nightly backup. This usually costs the least in terms of software licenses and hardware. Perhaps counter-intuitively, it also provides the most resilient solution. With longer intervals, you also get longer term storage choices. You get three major benefits from these backups: historical data preservation, protection against data corruption, and offline storage. We will explore each in the upcoming sections.

Ransomware Considerations for Backup

Because we use a backup to create distinct copies, it has some built-in protection against data corruption, including ransomware. As long as the ransomware has no access to a backup copy, it cannot corrupt that copy. First and foremost, that means that you need to maintain offline backups. Replication requires essentially constant continuity to its replicas, so only backup can work under this restriction. Second, it means that you need to exercise caution around restores when you execute restore procedures. Some ransomware authors have made their malware aware of several common backup applications, and they will hijack it to corrupt backups whenever possible. You can only protect your offline data copies by attaching them to known-safe systems.

Using Multiple RTOs and RPOs

You will need to structure your systems into multiple RTO and RPO categories. Some outages will not require much time to recover from. Some will require different solutions. For instance, even though we tend to think primarily in terms of data during disaster recovery planning, you must consider equipment as well. For instance, if your sales division prints its own monthly flyers and you lose a printer, then you need to establish, RTOs, RPOs, downtime procedures, and recovery processes just for those print devices.

You also need to establish multiple levels for your data, especially when you have multiple protection systems. For example, if you have both replication and backup technologies in operation, then you will set one RPO/RTO value for times when the replication works, and RTO/RPO values for when you must resort to long-term backup. That could happen due to ransomware or some other data corruption event, but it can also happen if someone accidentally deletes something important.

To start this planning, establish “Best Case” and “Worst Case” plans and processes for your individual systems.

Leveraging Rotation and Retention Policies

For your final exercise in time-based disaster recovery designs, we will look at rotation and retention policies. “Rotation” comes from the days of tape backups, when we would decide how often to overwrite old copies of data. Now that high-capacity external disks have reached a low-cost point, many businesses have moved away from tape. You may not overwrite media anymore, or at least not at the same frequency. Retention policies dictate how long you must retain at least one copy of a given piece of information. These two policies directly relate to each other.

Backup Rotation and Retention

In today’s terms, think of “rotation” more in terms of unique copies of data. Backup systems have used “differential” and “incremental” backups for a very long time. The former is a complete record of changes since the last full backup; the latter is a record of changes since the last backup of any kind. Newer backup copies have “delta” and deduplication capabilities. A “delta” backup operates like a differential or incremental backup, but within files or blocks. Deduplication keeps only one copy of a block of bits, regardless of how many times it appears within an entire backup set. These technologies reduce backup time and storage space needs… at a cost.

Minimizing Rotation Risks

These speed-enhancing and space-reducing improvements have one major cost: they reduce the total number of available unique backup copies. As long as nothing goes wrong with your media, then this will never cause you a problem. However, if one of the full backups suffer damage, then that invalidates all dependent partial backups. You must balance the number of full backups that you take against the amount of time and bandwidth necessary to capture them.

As one minimizing strategy, target your full backup operations to occur during your organization’s quietest periods. If you do not operate 24 hours per day, that might allow for nightly full backups. If you have low volume weekends, you might take full backups on Saturdays or Sundays. You can intersperse full backups on holidays.

Recovery Time Actual (RTA) and Recovery Point Actual (RPA)

You may encounter a pair of uncommon terms: recovery time actual (RTA) and recovery point actual (RPA). These refer to measurements taken during testing or live recovery operations. They have some value, but beware falling into a needless circular trap.

Recovery time actual matches up with recovery time objective. Your RTO establishes the acceptable amount of time to lose due to a problem. RTA measures how long it took to return to operation after a failure (whether real or for testing purposes). Similarly, recovery point actual measures how well the restoration process aligned with your RPO.

While RTA and RPA metrics provide some value, they should not feature prominently in your plan. During testing, they can help determine if you have set reasonable objectives or if you need to alter your plan. They have much less applicability after a true failure, though. You can include them in typical “post-mortem” and “lessons learned” write-ups. Beyond that, you won’t find much use.

I read one article on RTA that compares RTO to an outage with your Internet provider. The author points out that we care less about how long the provider expects an outage to last and a lot more about how long it actually lasts. While true, this analogy has no accurate comparison to a valid RTO. We will always have a desire for zero downtime and, failing that, a desire for zero recovery time. An RTO should not set the expected downtime, but the acceptable downtime. In the private consumer Internet space, we mostly suffer at the mercy of the company’s ability. When establishing RTOs for our own enterprise, we can control purchases and processes that affect the viability of short RTOs. When recovering from a failure, the nature of the problem will always impact RTA, meaning that today’s RTA may have absolutely no bearing on the next RTA.

Few other authors go into any depth on RTA or RPA. RPA has even less to talk about than RTA. Either you achieved your RPO or you did not. If you got lucky, then maybe you did not lose all the data changes between the latest backup and the failure. Maybe you can glean something valuable from the experience, but the RPA itself doesn’t mean much of anything.

When performing a recovery, whether test or real, document what you can. You will need some numbers to report afterward. Mostly, you want to measure how well you achieved your objectives and note anything that you can change in your plan for the better.

Certifications and Regulatory Compliance in Highly Regulated Industries

Most organizations can follow an internally guided process for documenting and performing their backup and disaster recovery processes. Some must meet higher external standards.

First, work with industry-focused legal counsel. Finance, healthcare, insurance, and other highly regulated businesses may need to meet certain regulatory compliance requirements. Regional rules, such as GDRP, will impact all businesses within jurisdiction regardless of size or function.

Second, consider following outside certified practice guides. These can help to ensure that you meet all legal expectations, even those that you might not know about. Two highly regarded institutions provide publications: the International Organization for Standardization’s ISO/IEC 27031, “Guidelines for information and communication technology readiness for business continuity” and the National Institute of Science and Technology’s Special Publication 800-34, “Contingency Planning Guide for Federal Information Systems”.

Coalescing into a Disaster Recovery Plan

As you design your disaster recovery plan, review the sections in this article as necessary. Remember that all operations require time, equipment, and personnel. Faster backup and restore operations always require a trade-off of expense and/or resilience. Modest lengthening of allowable RTOs and RPOs can result in major cost and effort savings. Make certain that the key members of your organization understand how all of these numbers will impact them and their operations during an outage.

If you need some help defining RTO and RPO in your organization, let me know in the comments section below and I will help you out!

The post RTO and RPO: Understanding Disaster Recovery Times appeared first on Altaro DOJO | Hyper-V.

]]>
https://www.altaro.com/hyper-v/rto-rpo-disaster-recovery/feed/ 0
The Acid Test for Your Backup Strategy https://www.altaro.com/hyper-v/test-backup-strategy/ https://www.altaro.com/hyper-v/test-backup-strategy/#respond Mon, 30 Mar 2020 16:26:06 +0000 https://www.altaro.com/hyper-v/?p=18490 Everything you need in order to ensure the best backup and restore strategy of your media in case of unexpected disasters or failures

The post The Acid Test for Your Backup Strategy appeared first on Altaro DOJO | Hyper-V.

]]>

For the first several years that I supported server environments, I spent most of my time working with backup systems. I noticed that almost everyone did their due diligence in performing backups. Most people took an adequate responsibility to verify that their scheduled backups ran without error. However, almost no one ever checked that they could actually restore from a backup — until disaster struck. I gathered a lot of sorrowful stories during those years. I want to use those experiences to help you avert a similar tragedy.

Successful Backups Do Not Guarantee Successful Restores

Fortunately, a lot of the problems that I dealt with in those days have almost disappeared due to technological advancements. But, that only means that you have better odds of a successful restore, not that you have a zero chance of failure. Restore failures typically mean that something unexpected happened to your backup media. Things that I’ve encountered:

  • Staff inadvertently overwrote a full backup copy with an incremental or differential backup
  • No one retained the necessary decryption information
  • Media was lost or damaged
  • Media degraded to uselessness
  • Staff did not know how to perform a restore — sometimes with disastrous outcomes

I’m sure that some of you have your own horror stories.

These risks apply to all organizations. Sometimes we manage to convince ourselves that we have immunity to some or all of them, but you can’t get there without extra effort. Let’s break down some of these line items.

People Represent the Weakest Link

We would all like to believe that our staff will never make errors and that the people that need to operate the backup system have the ability to do so. However, as a part of your disaster recovery planning, you must expect an inability to predict the state or availability of any individual. If only a few people know how to use your backup application, then those people become part of your risk profile.

You have a few simple ways to address these concerns:

  • Periodically test the restore process
  • Document the restore process and keep the documentation updated
  • Non-IT personnel need knowledge and practice with backup and restore operations
  • Non-IT personnel need to know how to get help with the application

It’s reasonable to expect that you would call your backup vendor for help in the event of an emergency that prevented your best people from performing restores. However, in many organizations without a proper disaster recovery plan, no one outside of IT even knows who to call. The knowledge inside any company naturally tends to arrange itself in silos, but you must make sure to spread at least the bare minimum information.

Technology Does Fail

I remember many shock and horror reactions when a company owner learned that we could not read the data from their backup tapes. A few times, these turned into grief and loss counselling sessions as they realized that they were facing a critical — or even complete — data loss situation. Tape has its own particular risk profile, and lots of businesses have stopped using it in favour of on-premises disk-based storage or cloud-based solutions. However, all backup storage technologies present some kind of risk.

In my experience, data degradation occurred most frequently. You might see this called other things, my favourite being “bit rot”. Whatever you call it, it all means the same thing: the data currently on the media is not the same data that you recorded. That can happen just because magnetic storage devices have susceptibilities. That means that no one made any mistakes — the media just didn’t last. For all media types, we can establish an average for failure rates. But, we have absolutely no guarantees on the shelf life for any individual unit. I have seen data pull cleanly off decade-old media; I have seen week-old backups fail miserably.

Unexpectedly, newer technology can make things worse. In our race to cut costs, we frequently employ newer ways to save space and time. In the past, we had only compression and incremental/differential solutions. Now, we have tools that can deduplicate across several backup sets and at multiple levels. We often put a lot of reliance on the single copy of a bit.

How to Test your Backup Strategy

The best way to identify problems is to break-test to find weaknesses. Leveraging test restores will help identity backup reliability and help you solve these problems. Simply, you cannot know that you have a good backup unless you can perform a good restore. You cannot know that your staff can perform a restore unless they perform a restore. For maximum effect, you need to plan tests to occur on a regular basis.

Some tools, like Altaro VM Backup, have built-in tools to make tests easy. Altaro VM Backup provides a “Test & Verify Backups” wizard to help you perform on-demand tests and a “Schedule Test Drills” feature to help you automate the process.

how to test and verify backups altaro

If your tool does not have such a feature, you can still use it to make certain that your data will be there when you need it. It should have some way to restore a separate or redirected copy. So, instead of overwriting your live data, you can create a duplicate in another place where you can safely examine and verify it.

Test Restore Scenario

In the past, we would often simply restore some data files to a shared location and use a simple comparison tool. Now that we use virtual machines for so much, we can do a great deal more. I’ll show one example of a test that I use. In my system, all of these are Hyper-V VMs. You’ll have to adjust accordingly for other technologies.

Using your tool, restore copies of:

  • A domain controller
  • A SQL server
  • A front-end server dependent on the SQL server

On the host that you restored those VMs to, create a private virtual switch. Connect each virtual machine to it. Spin up the copied domain controller, then the copied SQL server, then the copied front-end. Use the VM connect console to verify that all of them work as expected.

Create test restore scenarios of your own! Make sure that they match a real-world scenario that your organization would rely on after a disaster.

The post The Acid Test for Your Backup Strategy appeared first on Altaro DOJO | Hyper-V.

]]>
https://www.altaro.com/hyper-v/test-backup-strategy/feed/ 0
Remote Work: Top 5 Challenges (and Solutions) for IT Admins https://www.altaro.com/hyper-v/remote-work-it-admins/ https://www.altaro.com/hyper-v/remote-work-it-admins/#respond Thu, 26 Mar 2020 15:23:56 +0000 https://www.altaro.com/hyper-v/?p=18509 Discover the top 5 challenges facing IT admins managing remote operations, alongside the best practice solutions from Microsoft and other service providers

The post Remote Work: Top 5 Challenges (and Solutions) for IT Admins appeared first on Altaro DOJO | Hyper-V.

]]>

The devastating effects of the COVID-19 pandemic are plain for all to see. Our sympathies go out to everyone struggling to cope with this tragic situation. We are truly indebted to the health workers around the world who are working tirelessly to protect us. But with most countries now restricting movements and public gatherings, perhaps the unsung heroes of the current situation from an economic perspective are the IT admins who are keeping the world’s workforce productive.

Virtually every industry has been impacted by the global pandemic in some waybut the businesses that are fortunate enough to be able to continue operations remotely have to adapt – and quickly. Larger enterprises have been doing this for years and will already have a robust remote work infrastructure in placeBut for those organizations which are transitioning to this new work dynamic, it introduces many new challenges which IT departments are expected to solveThis article will review the top 5 challenges facing IT admins managing remote operations, along with solutions from Microsoft and other service providers. 

1. Drops in Productivity and Business Continuity 

One of the first things that your company will see is a drop in productivity as business operations slow down. This will likely be a combination of remote work challenges due to technology and personal reasons. When people are at home, they are generally more distracted, and they may even be taking care of their children during business hours

Companies should expect these challenges and realize that this could also change work habits. For example, productivity could spike in the early mornings and evenings when children are usually sleepingEncourage managers to ask their staff about any changes in employee behaviour so that the IT department can be prepared. Monitoring Office 365 activity is possible via usage reports:

Monitoring user activity in Microsoft 365

Office 365 Usage Executive Summary

All your staff should have laptops with your company’s line of business applications and productivity suite, which could include Microsoft Office 365 or Google G Suite. While most of your employees will be familiar with products like Word and Excel, explore whether using SharePoint for content collaboration or OneNote for shared workspaces will help different business units.

It probably goes without saying, but make sure that you are using a reliable virtual communication or meeting tool, such as Microsoft Teams or Google HangoutsEncourage managers to move all their regular in-person meetings to the virtual format to keep their team on track. 

If you haven’t already, provide a password-protected webpage which employees can visit containing a current version of all remote work guidelines, documentation, service notifications and a place to submit help requests. This is particularly important during a transition period with a lot of new remote users. Make sure guidelines are properly followed as this is one of the most important stages to be fully aware of vulnerability issues and to minimize risk.

Microsoft System Center Service Manager (SCSM) can provide this as an ITSM solution with an online help portal. Most IT organizations will see an increase in support requests, so consider redirecting more of your staff towards customer support. IT will be critical in keeping the business running, so having quick customer response time is critical.

It is important to set clear expectations with your business stakeholders that there may be disruptions to IT services during the transition phase and encourage teams to have backup communication plans. This could be as simple as sharing each other’s phone numbers to ensure business continuity in case of an unplanned outage. You want to make the transition as easy as possible so that you can be a hero of your company.

2. Managing Network and Connectivity Issues

Since the coronavirus outbreak started, cities throughout the world have seen a 10 to 40% increase in Internet traffic. This is due to more people working from home and an increased number of children and others in isolation viewing more internet content during business hours from the same building. This has caused many disruptions throughout the internet, however, Internet Service Providers (ISPs) have been scaling up their infrastructure to support this growth.

While you cannot change these internet-wide problems we are all facing, you can provide your workforce with some best practices to help them address connectivity issues. Remind them that any network usage within their buildingincluding streaming TV, video games or music, will reduce their bandwidth

Some ISPs and cellular providers are offering their residential and commercial customers discounted service upgrades and no data overage charges – check with your company internet provider if they offer this as soon as possible. Furthermore, compile a list of these organizations for your employees. If regular Internet access becomes too slow, remind them that they may be able to set up a mobile hotspot and tether their laptop to their cell phone’s Internet connection.

Once your staff has remotely connected to your infrastructure there are different ways that you can optimize the network traffic. First, scale-out your network hardware. If you are running your services in a public or private cloud, you can take advantage of network virtualization and network function virtualization (NFV) by deploying virtualized routers, switches and load balancers.

If your datacenter is using physical networking hardware, you may want to invest in additional equipment, but be wary of delays if you have to ship anything internationally. If you are expecting an increase in remote users, also consider that you may need to increase the capacity of your entire infrastructure, including virtual machines and storage network throughput 

You should also prioritize network traffic so that requests for business-critical services or VoIP communications are more likely to go through. Network prioritization, or Quality of Service (QoS), can be controlled from various points in your physical and virtual networks. If you are using Hyper-V, you can use storage QoS to prioritize disk access to important dataYou can also learn additional virtual networking best practices from Altaro.

Make sure you have a backup plan in place for different types of outages. Some organizations have deployed a redundant copy of their services in a secondary site or public cloud so you may want to consider turning these resources on to support the extra demand. However, cloud providers are not immune to outages (more about that in Protecting User Data) so you should always have a backup plan in place especially for communications services, a PABX (e.g. 3CX) for Chat & Video outages, and Email continuity solutions (like those of Proofpoint & Mimecast) for email outages. 

If you do experience a service drop, it may not be immediately reported by the service provider as they often prefer to have an official reason and solution in place before they communicate it. Downdetector.com is a great way to quickly check if it is a genuine service drop or it’s just your connection.

monitor app downtime with downdetector

3. Enabling Access to Remote Resources 

Once your users have network access, you want to restrict or grant access to specific resources. Windows Server Active Directory for role-based access control (RBAC) is just the start – you should create groups in Active Directory and assign different users to eachConfigure and manage security policies at the group level to simplify administration when users join or leave the organization. This is also where you can implement VPN restrictions (read more in Ensuring High Security Standards below) 

Some of your users may need to access the operating system of a server, which can be provided through Microsoft Remote Desktop (RDP)This application allows the user to connect directly to a workstation or virtual machine to get access to all the services running on it. Make sure that you provide guidanceideally with graphical step-by-step instructions, for any new processes. Ensure that all employees have a clear escalation path to tech support.

4. Ensuring High Security Standards

In addition to using Active Directory with RBAC, you should take time to deploy security best practices for remote access. This starts with education, making sure that employees are using private Internet connections or connecting via a VPN if they access your services via a public network. Also, ensure that you are following any industry-specific security requirements if you are working with sensitive data or in a regulated industry.

You should already have deployed a strong firewall which you can manage remotely. If the firewall is virtualized as a network function virtualization (NFV) device, then it can be dynamically updated and scaled. By default, disable all inbound and outbound traffic with a “zero trust” security model, and only allow access to specific protocols and ports that you are intentionally using. It is essential to turn Multi-Factor Authentication on for Office 365 access to ensure that remote users go through extra validation via clicking on an email link or entering a code sent to their mobile device.  

For advanced control, use Azure Conditional Access which lets IT departments dynamically grant and revoke access based on different variables. A good example of this is using conditional policies to restrict access from certain counties if you know you shouldn’t have any users logging on from those regions. 

Take advantage of any native security tools offered by your cloud provider, such as Azure Security Center. Portals like this provide an IT department with best practices and reports and use AI-based analytics to find anomalies or unusual traffic patterns.

For more information about Azure Security Center, watch our on-demand webinar Azure Security Center: How to Protect Your Datacenter with Next Generation Security

Azure Security Center webinar

5. Protecting User Data

Now that all of your employees are working remotely from laptops, you want to ensure that their data is backed up and can quickly be recovered. The easiest option is to have your employees place their files on cloud storage, such as Microsoft OneDrive for BusinessMicrosoft SharePoint or Dropbox. This helps by providing a centralized location so that even if the user loses their laptop, a copy of the data is preserved

However, these copies are effectively redundant copies which are not a replacement for backup. These redundant copies enable shared documents to function with approved users able to access the document for collaboration. But this data is constantly rewritten and thus not a genuine alternative to stored backup copies.

If your company is using Microsoft O365 for email, then one copy of that user’s mailbox is available in the cloud. Keep in mind that while the email is always accessible, only one copy of all O365 data is stored by Microsoft. However again, this is effectively redundant data. Thus, the native backup provided by Microsoft will not be enough for most companies considering that with more people working from home, the chances of software outages and blackouts increase due to the strain from supporting the wave of new users.  

This was all too real to users who experienced wide-spread blackouts of Teams and Exchange Online on Tuesday 17 March. 

Microsoft Teams service health

Office 365 Service Health on Tuesday 17 March 2020  

One way to prepare for these events is to have alternative options for the key apps to continue operations and business continuity (discussed above). However, these events also prove that Microsoft is not infallible, and proper backup is essential to ensure you don’t lose the vital data your company relies upon. Therefore, you should also be using a solution such as Altaro Office 365 Backup to create a reliable and secure backup of every O365 mailbox, SharePoint document and OneDrive for Business file.

Altaro Office 365 Backup Logo

Start your free trial of Altaro Office 365 Backup

If your users are not automatically storing their files in the cloud, then at least make sure that their files on their laptops are being regularly and automatically backed up. You can configure global backup settings across your organization using Group Policy Management. This will force backups to be taken on every laptop at certain intervals for certain essential business applications. Consider having the backups automatically copied to a remote file storage location during off-hours.

You now know how to prepare for the top five challenges you will face as your employees start working from home. If this sounds too complicated for your IT team’s skills, consider finding a managed service provider (MSP) or Cloud Service Provider (CSP) to help you through the transition.

Keep in mind that these best practices only cover the needs of your employees. If your business offers technology services, then consider the impact of the home workforce on your line of business applications, and scale them up or down as appropriate. Using these tips, you’ll get ahead of the challenges that your organization will face as more of the staff works remotelyThese are testing times for all of us, but as an IT admin, you can now become one of your company’s heroes by keeping your business running in this new and challenging world. 

Good luck! 

The post Remote Work: Top 5 Challenges (and Solutions) for IT Admins appeared first on Altaro DOJO | Hyper-V.

]]>
https://www.altaro.com/hyper-v/remote-work-it-admins/feed/ 0
6 Hardware Tweaks that will Skyrocket your Hyper-V Performance https://www.altaro.com/hyper-v/hardware-tweaks-hyper-v-performance/ https://www.altaro.com/hyper-v/hardware-tweaks-hyper-v-performance/#comments Wed, 09 Oct 2019 12:54:42 +0000 http://www.altaro.com/hyper-v/?p=12600 Simple changes to your hardware settings can drastically improve Hyper-V performance. Find out what you need to change to optimize your Hyper-V environment.

The post 6 Hardware Tweaks that will Skyrocket your Hyper-V Performance appeared first on Altaro DOJO | Hyper-V.

]]>

Few Hyper-V topics burn up the Internet quite like “performance”. No matter how fast it goes, we always want it to go faster. If you search even a little, you’ll find many articles with long lists of ways to improve Hyper-V’s performance. The less focused articles start with ge1neral Windows performance tips and sprinkle some Hyper-V-flavored spice on them. I want to use this article to tighten the focus down on Hyper-V hardware settings only. That means it won’t be as long as some others; I’ll just think of that as wasting less of your time. So in the name of speed, let’s get right into it!

Upgrade your system

I would prefer if everyone just knew this upfront. Unfortunately, it seems like I need to frequently remind folks that hardware cannot exceed its capabilities. So, every performance article I write will always include this point front-and-center. Each piece of hardware has an upper limit on maximum speed. Where that speed barrier lies in comparison to other hardware in the same category almost always correlates directly with the cost. You cannot tweak a go-cart to outrun a Corvette without spending at least as much money as just buying a Corvette — and that’s without considering the time element. If you bought slow hardware, then you will have a slow Hyper-V environment.

Fortunately, this point has a corollary: don’t panic. Production systems, especially server-class systems, almost never experience demand levels that compare to the stress tests that admins put on new equipment. If typical load levels were that high, it’s doubtful that virtualization would have caught on so quickly. We use virtualization for so many reasons nowadays, we forget that “cost savings through better utilization of under-loaded server equipment” was one of the primary drivers of early virtualization adoption.

BIOS Settings for Hyper-V Performance

Don’t neglect your BIOS! It contains some of the most important settings for Hyper-V.

  • C States. Disable C States! Few things impact Hyper-V performance quite as strongly as C States! Names and locations will vary, so look in areas related to Processor/CPU, Performance, and Power Management. If you can’t find anything that specifically says C States, then look for settings that disable/minimize power management. C1E is usually the worst offender for Live Migration problems, although other modes can cause issues.
  • Virtualization support: A number of features have popped up through the years, but most BIOS manufacturers have since consolidated them all into a global “Virtualization Support” switch, or something similar. I don’t believe that current versions of Hyper-V will even run if these settings aren’t enabled. Here are some individual component names, for those special BIOSs that break them out:
    • Virtual Machine Extensions (VMX)
    • AMD-V — AMD CPUs/mainboards. Be aware that Hyper-V can’t (yet?) run nested virtual machines on AMD chips
    • VT-x, or sometimes just VT — Intel CPUs/mainboards. Required for nested virtualization with Hyper-V in Windows 10/Server 2016
  • Data Execution Prevention: DEP means less for performance and more for security. It’s also a requirement. But, we’re talking about your BIOS settings and you’re in your BIOS, so we’ll talk about it. Just make sure that it’s on. If you don’t see it under the DEP name, look for:
    • No Execute (NX) — AMD CPUs/mainboards
    • Execute Disable (XD) — Intel CPUs/mainboards
  • Second Level Address Translation: I list this feature primarily for the sake of completeness. It’s been many years since any system was built new without SLAT support. If you have one, following every point in this post to the letter still won’t make that system fast. Starting with Windows 8 and Server 2016, you cannot use Hyper-V without SLAT support. Names that you will see SLAT under:
    • Nested Page Tables (NPT)/Rapid Virtualization Indexing (RVI) — AMD CPUs/mainboards
    • Extended Page Tables (EPT) — Intel CPUs/mainboards
  • Disable power management. This goes hand-in-hand with C States. Just turn off power management altogether. Get your energy savings via consolidation. You can also buy lower wattage systems.
  • Use Hyperthreading. I’ve seen a tiny handful of claims that Hyperthreading causes problems on Hyper-V. I’ve heard more convincing stories about space aliens. I’ve personally seen the same number of space aliens as I’ve seen Hyperthreading problems with Hyper-V (that would be zero). If you’ve legitimately encountered a problem that was fixed by disabling Hyperthreading AND you can prove that it wasn’t a bad CPU, that’s great! Please let me know. But remember, you’re still in a minority of a minority of a minority. The rest of us will run Hyperthreading. If you want to minimize your exposure to Spectre and similar cache side-channel attacks, then upgrade to at least 2016 and employ the core scheduler.
  • Disable SCSI BIOSs. Unless you plan to boot your host from a SAN, kill the BIOSs on your SCSI adapters. A SCSI card’s BIOS doesn’t do anything good or bad for a running Hyper-V host, but it slows down physical boot times.
  • Disable BIOS-set VLAN IDs on physical NICs. Some network adapters support VLAN tagging through boot-up interfaces. If you then bind a Hyper-V virtual switch to one of those adapters, you could encounter all sorts of network nastiness.

Storage Settings for Hyper-V Performance

I wish the IT world would learn to accept that rotating hard disks do not move data very quickly. If you just can’t cope with that, buy a gigantic lot of them and make big RAID 10 arrays. Or, you could get a stack of SSDs. Don’t get six or so spinning disks and get sad that they “only” move data at a few hundred megabytes per second. That’s how the tech works.

Performance tips for storage:

  • Learn to live with the fact that storage is slow.
  • Remember that speed tests do not reflect real-world load and that file copy does not test anything except permissions.
  • Learn to live with Hyper-V’s I/O scheduler. If you want a computer system to have 100% access to storage bandwidth, start by checking your assumptions. Just because a single file copy doesn’t go as fast as you think it should, does not mean that the system won’t perform its production role adequately. If you’re certain that a system must have total and complete storage speed, then do not virtualize it. A VM cannot achieve that level of speed without stealing I/O from other guests.
  • Enable read caches
  • Carefully consider the potential risks of write caching. If acceptable, enable write caches. If your internal disks, DAS, SAN, or NAS has a battery backup system that can guarantee clean cache flushes on a power outage, write caching is generally safe. Internal batteries that report their status and/or automatically disable caching are best. UPS-backed systems are sometimes OK, but they are not foolproof.
  • Prefer few arrays with many disks over many arrays with few disks.
  • Unless you’re going to store VMs on a remote system, do not create an array just for Hyper-V. By that, I mean that if you’ve got six internal bays, do not create a RAID-1 for Hyper-V and a RAID-x for the virtual machines. That’s a Microsoft SQL Server 2000 design. This is 2019 and you’re building a Hyper-V server. Use all the bays in one big array.
  • Do not architect your storage to make the hypervisor/management operating system go fast. I can’t believe how many times I read on forums that Hyper-V needs lots of disk speed. After boot-up, it needs almost nothing. The hypervisor remains resident in memory. Unless you’re doing something questionable in the management OS, it won’t even page to disk very often. Architect storage speed in favor of your virtual machines.
  • Set your fiber channel SANs to use very tight WWN masks. Live Migration requires a handoff from one system to another, and the looser the mask, the longer that takes. With 2016 the guests shouldn’t crash, but the hand-off might be noticeable.
  • Keep iSCSI/SMB networks clear of other traffic. I see a lot of recommendations to put each and every iSCSI NIC on a system into its own VLAN and/or layer-3 network. I’m on the fence about that advice. On one hand, a network storm in one iSCSI network might justify it. However, keeping those networks quiet would go a long way on its own. For clustered systems, multi-channel SMB needs each adapter to be on a unique layer 3 network (according to the docs; from what I can tell, it works even with same-net configurations).
  • If using gigabit, try to physically separate iSCSI/SMB from your virtual switch. Meaning, don’t make that traffic endure the overhead of virtual switch processing if you can help it.
  • Round robin MPIO might not be the best, although it’s the most recommended. If you have one of the aforementioned network storms, Round Robin will negate some of the benefits of VLAN/layer 3 segregation. I like least queue depth, myself.
  • MPIO and SMB multi-channel are much faster than the best teaming.
  • If you must run MPIO or SMB traffic across a team, create multiple virtual or logical NICs. That will give the teaming implementation more opportunities to create balanced streams.
  • Use jumbo frames for iSCSI/SMB connections if everything supports it (host adapters, switches, and back-end storage). You’ll improve the header-to-payload bit ratio by a meaningful amount.
  • Enable RSS on SMB-carrying adapters. If you have RDMA-capable adapters, absolutely enable that.
  • Use dynamically expanding VHDX, but not dynamically expanding VHD. I still see people recommending fixed VHDX for operating system VHDXs, which is just absurd. Fixed VHDX is good for high-volume databases, but mostly because they’ll probably expand to use all the space anyway. Dynamic VHDX enjoys higher average write speeds because it completely ignores zero writes. No defined pattern has yet emerged that declares a winner on read rates, but people who say that fixed always wins are making demonstrably false assumptions.
  • Do not use pass-through disks. The performance is sometimes a little bit better, but sometimes it’s worse, and it almost always causes some other problems elsewhere. The trade-off is not worth it. Just add one spindle to your array to make up for any perceived speed deficiencies. If you insist on using pass-through for performance reasons, then I want to see the performance traces of production traffic that prove it.
  • Don’t let fragmentation keep you up at night. Fragmentation is a problem for single-spindle desktops/laptops, “admins” that never should have been promoted above first-line help desk, and salespeople selling defragmentation software. If you’re here to disagree, you better have a URL to performance traces that I can independently verify before you even bother entering a comment. I have plenty of Hyper-V systems of my own on storage ranging from 3-spindle up to >100-spindle, and the first time I even feel compelled to run a defrag (much less get anything out of it) I’ll be happy to issue a mea culpa. For those keeping track, we’re at 8 years and counting.

Memory Settings for Hyper-V Performance

There isn’t much that you can do for memory. Buy what you can afford and, for the most part, don’t worry about it.

  • Buy and install your memory chips optimally. Multi-channel memory is somewhat faster than single-channel. Your hardware manufacturer will be able to help you with that.
  • Don’t over-allocate memory to guests. Just because your file server had 16GB before you virtualized it does not mean that it has any use for 16GB.
  • Use Dynamic Memory unless you have a system that expressly forbids it. It’s better to stretch your memory dollar farther than to wring your hands about whether or not Dynamic Memory is a good thing. Until directly proven otherwise for a given server, it’s a good thing.
  • Don’t worry so much about NUMA. I’ve read volumes and volumes on it. I even spent a lot of time configuring it on a high-load system. Wrote some about it. I never got any of that time back. I’ve had some interesting conversations with people that really did need to tune NUMA. They constitute… oh, I’d say about .1% of all the conversations that I’ve ever had about Hyper-V. The rest of you should leave NUMA enabled at defaults and walk away.

Network Settings for Hyper-V Performance

Networking configuration can make a real difference to Hyper-V performance.

  • Learn to live with the fact that gigabit networking is “slow” and that 10GbE networking often has barriers to reaching 10Gbps for a single test. Most networking demands don’t even bog down gigabit. It’s just not that big of a deal for most people.
  • Learn to live with the fact that a) your four-spindle disk array can’t fill up even one 10GbE pipe, much less the pair that you assigned to iSCSI, and that b) it’s not Hyper-V’s fault. I know this doesn’t apply to everyone, but wow, do I see lots of complaints about how Hyper-V can’t magically pull or push bits across a network faster than a disk subsystem can read and/or write them.
  • Disable VMQ on gigabit adapters. I think some manufacturers are finally coming around to the fact that they have a problem. Too late, though. The purpose of VMQ is to redistribute inbound network processing for individual virtual NICs away from CPU 0, core 0 to the other cores. Current-model CPUs are fast enough to handle many gigabit adapters.
  • If you use a Hyper-V virtual switch on a network team and you’ve disabled VMQ on the physical NICs, disable it on the team adapter as well. I’ve been saying that since shortly after 2012 came out and people are finally discovering that I’m right, so, yay? Anyway, do it.
  • Don’t worry so much about vRSS. RSS is like VMQ, only for non-VM traffic. vRSS, then, is the projection of VMQ down into the virtual machine. Basically, with traditional VMQ, the VMs’ inbound traffic is separated across pNICs in the management OS, but then each guest still processes its own data on vCPU 0. vRSS splits traffic processing across vCPUs inside the guest once it gets there. The “drawback” is that distributing processing and then redistributing processing costs more processing. So, you have a nicely distributed load, but you also have more overall processing. The upshot: almost no one will care either way. Set it or don’t set it, you probably can’t detect the difference in production. If you’re new to all of this, then you’ll find an “RSS” setting on the network adapter inside the guest. If that’s on in the guest (off by default) and VMQ is on and functioning in the host, then you have vRSS. woohoo.
  • Don’t blame Hyper-V for your networking ills. I mention this in the context of performance because your time has value. I’m constantly called upon to troubleshoot Hyper-V “networking problems” because someone is sharing MACs or IPs or trying to get traffic from the dark side of the moon over a Cat-3 cable with three broken strands. Hyper-V is also frequently blamed by people that just don’t have a functional understanding of TCP/IP. More wasted time that I’ll never get back.
  • Use one virtual switch. Multiple virtual switches cause processing overhead without providing returns. This is a guideline, not a rule, but you need to be prepared to provide an unflinching, sure-footed defense for every virtual switch in a host after the first.
  • Don’t mix gigabit with 10 gigabit in a team. Teaming will not automatically select 10GbE over the gigabit. 10GbE is so much faster than gigabit that it’s best to just kill gigabit and converge on the 10GbE.
  • Ten one-gigabit cards do not equal a single 10GbE card. I’m all for only using 10GbE when you can justify it with usage statistics, but gigabit just cannot compete.

Maintenance Best Practices

Don’t neglect your systems once they’re deployed!

  • Take a performance baseline when you first deploy a system and save it.
  • Take and save another performance baseline when your system reaches a normative load level (basically, once you’ve reached its expected number of VMs).
  • Keep drivers reasonably up-to-date. Verify that settings aren’t lost after each update.
  • Monitor hardware health. The Windows Event Log often provides early warning symptoms, if you have nothing else.

 

Further reading

If you carry out all (or as many as possible) of the above hardware adjustments you will witness a considerable jump in your hyper-v performance. That I can guarantee. However, for those who don’t have the time, patience, or are prepared to make the necessary investment in some cases, Altaro has developed an e-book just for you. Find out more about it here: Supercharging Hyper-V Performance for the time-strapped admin.

Note: This article was originally published in August 2017. It has been fully updated to be relevant as of October 2019.

The post 6 Hardware Tweaks that will Skyrocket your Hyper-V Performance appeared first on Altaro DOJO | Hyper-V.

]]>
https://www.altaro.com/hyper-v/hardware-tweaks-hyper-v-performance/feed/ 22
What is the Hyper-V Core Scheduler? https://www.altaro.com/hyper-v/hyper-v-core-scheduler/ https://www.altaro.com/hyper-v/hyper-v-core-scheduler/#comments Thu, 25 Jul 2019 16:51:33 +0000 https://www.altaro.com/hyper-v/?p=17790 The core scheduler provides a strong inter-virtual machine barrier against cache side-channel attacks eg Spectre variants. This guide explains how to use it

The post What is the Hyper-V Core Scheduler? appeared first on Altaro DOJO | Hyper-V.

]]>

In the past few years, sophisticated attackers have targeted vulnerabilities in CPU acceleration techniques. Cache side-channel attacks represent a significant danger. They magnify on a host running multiple virtual machines. One compromised virtual machine can potentially retrieve information held in cache for a thread owned by another virtual machine. To address such concerns, Microsoft developed its new “HyperClear” technology pack. HyperClear implements multiple mitigation strategies. Most of them work behind the scenes and require no administrative effort or education. However, HyperClear also includes the new “core scheduler”, which might need you to take action.

The Classic Scheduler

Now that Hyper-V has all new schedulers, its original has earned the “classic” label. I wrote an article on that scheduler some time ago. The advanced schedulers do not replace the classic scheduler so much as they hone it. So, you need to understand the classic scheduler in order to understand the core scheduler. A brief recap of the earlier article:

  • You assign a specific number of virtual CPUs to a virtual machine. That sets the upper limit on how many threads the virtual machine can actively run.
  • When a virtual machine assigns a thread to a virtual CPU, Hyper-V finds the next available logical processor to operate it.

To keep it simple, imagine that Hyper-V assigns threads in round-robin fashion. Hyper-V does engage additional heuristics, such as trying to keep a thread with its owned memory in the same NUMA node. It also knows about simultaneous multi-threading (SMT) technologies, including Intel’s Hyper-Threading and AMD’s recent advances. That means that the classic scheduler will try to place threads where they can get the most processing power. Frequently, a thread shares a physical core with a completely unrelated thread — perhaps from a different virtual machine.

Risks with the Classic Scheduler

The classic scheduler poses a cross-virtual machine data security risk. It stems from the architectural nature of SMT: a single physical core can run two threads but has only one cache.

Classic SchedulerIn my research, I discovered several attacks in which one thread reads cached information belonging to the other. I did not find any examples of one thread polluting the others’ data. I also did not see anything explicitly preventing that sort of assault.

On a physically installed operating system, you can mitigate these risks with relative ease by leveraging antimalware and following standard defensive practices. Software developers can make use of fencing techniques to protect their threads’ cached data. Virtual environments make things harder because the guest operating systems and binary instructions have no influence on where the hypervisor places threads.

The Core Scheduler

The core scheduler makes one fairly simple change to close the vulnerability of the classic scheduler: it never assigns threads from more than one virtual machine to any physical core. If it can’t assign a second thread from the same VM to the second logical processor, then the scheduler leaves it empty. Even better, it allows the virtual machine to decide which threads can run together.

Hyper-V Core Scheduler

We will move on through implementation of the scheduler before discussing its impact.

Implementing Hyper-V’s Core Scheduler

The core scheduler has two configuration points:

  1. Configure Hyper-V to use the core scheduler
  2. Configure virtual machines to use two threads per virtual core

Many administrators miss that second step. Without it, a VM will always use only one logical processor on its assigned cores. Each virtual machine has its own independent setting.

We will start by changing the scheduler. You can change the scheduler at a command prompt (cmd or PowerShell) or by using Windows Admin Center.

How to Use the Command Prompt to Enable and Verify the Hyper-V Core Scheduler

For Windows and Hyper-V Server 2019, you do not need to do anything at the hypervisor level. You still need to set the virtual machines. For Windows and Hyper-V Server 2016, you must manually switch the scheduler type.

You can make the change at an elevated command prompt (PowerShell prompt is fine):

bcdedit /set hypervisorschedulertype core

Note: if bcdedit does not accept the setting, ensure that you have patched the operating system.

Reboot the host to enact the change. If you want to revert to the classic scheduler, use “classic” instead of “core”. You can also select the “root” scheduler, which is intended for use with Windows 10 and will not be discussed further here.

To verify the scheduler, just run bcdedit by itself and look at the last line:

bcdedit

bcdedit will show the scheduler type by name. It will always appear, even if you disable SMT in the host’s BIOS/UEFI configuration.

How to Use Windows Admin Center to Enable the Hyper-V Core Scheduler

Alternatively, you can use Windows Admin Center to change the scheduler.

  1. Use Windows Admin Center to open the target Hyper-V host.
  2. At the lower left, click Settings. In most browsers, it will hide behind any URL tooltip you might have visible. Move your mouse to the lower left corner and it should reveal itself.
  3. Under Hyper-V Host Settings sub-menu, click General.
  4. Underneath the path options, you will see Hypervisor Scheduler Type. Choose your desired option. If you make a change, WAC will prompt you to reboot the host.

windows admin center

Note: If you do not see an option to change the scheduler, check that:

  • You have a current version of Windows Admin Center
  • The host has SMT enabled
  • The host runs at least Windows Server 2016

The scheduler type can change even if SMT is disabled on the host. However, you will need to use bcdedit to see it (see previous sub-section).

Implementing SMT on Hyper-V Virtual Machines

With the core scheduler enabled, virtual machines can no longer depend on Hyper-V to make the choice to use a core’s second logical processor. Hyper-V will expect virtual machines to decide when to use the SMT capabilities of a core. So, you must enable or disable SMT capabilities on each virtual machine just like you would for a physical host.

Because of the way this technology developed, the defaults and possible settings may seem unintuitive. New in 2019, newly-created virtual machines can automatically detect the SMT status of the host and hypervisor and use that topology. Basically, they act like a physical host that ships with Hyper-Threaded CPUs — they automatically use it. Virtual machines from previous versions need a bit more help.

Every virtual machine has a setting named “HwThreadsPerCore”. The property belongs to the Msvm_ProcessorSettingData CIM class, which connects to the virtual machine via its Msvm_Processor associated instance. You can drill down through the CIM API using the following PowerShell (don’t forget to change the virtual machine name):

Get-CimInstance -Namespace root/virtualization/v2 -ClassName Msvm_ComputerSystem -Filter 'ElementName="svdc01"' | Get-CimAssociatedInstance -Namespace root/virtualization/v2 -ResultClassName Msvm_Processor | Get-CimAssociatedInstance -Namespace root/virtualization/v2 -ResultClassName Msvm_ProcessorSettingData | Select-Object -Property HwThreadsPerCore

The output of the cmdlet will present one line per virtual CPU. If you’re worried that you can only access them via this verbose technique hang in there! I only wanted to show you where this information lives on the system. You have several easier ways to get to and modify the data. I want to finish the explanation first.

The HwThreadsPerCore setting can have three values:

  • 0 means inherit from the host and scheduler topology — limited applicability
  • 1 means 1 thread per core
  • 2 means 2 threads per core

The setting has no other valid values.

A setting of 0 makes everything nice and convenient, but it only works in very specific circumstances. Use the following to determine defaults and setting eligibility:

  • VM config version < 8.0
    • Setting is not present
    • Defaults to 1 if upgraded to VM version 8.x
    • Defaults to 0 if upgraded to VM version 9.0+
  • VM config version 8.x
    • Defaults to 1
    • Cannot use a 0 setting (cannot inherit)
    • Retains its setting if upgraded to VM version 9.0+
  • VM config version 9.x
    • Defaults to 0

I will go over the implications after we talk about checking and changing the setting.

You can see a VM’s configuration version in Hyper-V Manager and PowerShell’s Get-VM :

Hyper-V Manager

The version does affect virtual machine mobility. I will come back to that topic toward the end of the article.

How to Determine a Virtual Machine’s Threads Per Core Count

Fortunately, the built-in Hyper-V PowerShell module provides direct access to the value via the *-VMProcessor cmdlet family. As a bonus, it simplifies the input and output to a single value. Instead of the above, you can simply enter:

Get-VMProcessor -VMName svdc01 | Select-Object -Property HwThreadCountPerCore

If you want to see the value for all VMs:

Get-VMProcessor -VMName * | Select-Object -Property VMName, HwThreadCountPerCore

You can leverage positional parameters and aliases to simplify these for on-the-fly queries:

Get-VMProcessor * | select VMName, HwThreadCountPerCore

You can also see the setting in recent version of Hyper-V Manager (Windows Server 2019 and current versions of Windows 10). Look on the NUMA sub-tab of the Processor tab. Find the Hardware threads per core setting:

settings

In Windows Admin Center, access a virtual machine’s Processor tab in its settings. Look for Enable Simultaneous Multithreading (SMT).

processors

If the setting does not appear, then the host does not have SMT enabled.

How to Set a Virtual Machine’s Threads Per Core Count

You can easily change a virtual machine’s hardware thread count. For either the GUI or the PowerShell commands, remember that the virtual machine must be off and you must use one of the following values:

  • 0 = inherit, and only works on 2019+ and current versions of Windows 10 and Windows Server SAC
  • 1 = one thread per hardware core
  • 2 = two threads per hardware core
  • All values above 2 are invalid

To change the setting in the GUI or Windows Admin Center, access the relevant tab as shown in the previous section’s screenshots and modify the setting there. Remember that Windows Admin Center will hide the setting if the host does not have SMT enabled. Windows Admin Center does not allow you to specify a numerical value. If unchecked, it will use a value of 1. If checked, it will use a value of 2 for version 8.x VMs and 0 for version 9.x VMs.

To change the setting in PowerShell:

Set-VMProcessor -VMName svdc01 -HwThreadCountPerCore 2

To change the setting for all VMs in PowerShell:

Set-VMProcessor -VMName * -HwThreadCountPerCore 2

Note on the cmdlet’s behavior: If the target virtual machine is off, the setting will work silently with any valid value. If the target machine is on and the setting would have no effect, the cmdlet behaves as though it made the change. If the target machine is on and the setting would have made a change, PowerShell will error. You can include the -PassThru parameter to receive the modified vCPU object:

Set-VMProcessor -VMName * -HwThreadCountPerCore 2 -Passthru | select VMName, HwThreadCountPerCore

Considerations for Hyper-V’s Core Scheduler

I recommend using the core scheduler in any situation that does not explicitly forbid it. I will not ask you to blindly take my advice, though. The core scheduler’s security implications matter, but you also need to think about scalability, performance, and compatibility.

Security Implications of the Core Scheduler

This one change instantly nullifies several exploits that could cross virtual machines, most notably in the Spectre category. Do not expect it to serve as a magic bullet, however. In particular, remember that an exploit running inside a virtual machine can still try to break other processes in the same virtual machine. By extension, the core scheduler cannot protect against threats running in the management operating system. It effectively guarantees that these exploits cannot cross partition boundaries.

For the highest level of virtual machine security, use the core scheduler in conjunction with other hardening techniques, particularly Shielded VMs.

Scalability Impact of the Core Scheduler

I have spoken with one person who was left with the impression that the core scheduler does not allow for oversubscription. They called into Microsoft support, and the engineer agreed with that assessment. I reviewed Microsoft’s public documentation as it was at the time, and I understand how they reached that conclusion. Rest assured that you can continue to oversubscribe CPU in Hyper-V. The core scheduler prevents threads owned by separate virtual machines from running simultaneously on the same core. When it starts a thread from a different virtual machine on a core, the scheduler performs a complete context switch.

You will have some reduced scalability due to the performance impact, however.

Performance Impact of the Core Scheduler

On paper, the core scheduler presents severe deleterious effects on performance. It reduces the number of possible run locations for any given thread. Synthetic benchmarks also show a noticeable performance reduction when compared to the classic scheduler. A few points:

  • Generic synthetic CPU benchmarks drive hosts to abnormal levels using atypical loads. In simpler terms, they do not predict real-world outcomes.
  • Physical hosts with low CPU utilization will experience no detectable performance hits.
  • Running the core scheduler on a system with SMT enabled will provide better performance than the classic scheduler on the same system with SMT disabled

Your mileage will vary. No one can accurately predict how a general-purpose system will perform after switching to the core scheduler. Even a heavily-laden processor might not lose anything. Remember that, even in the best case, an SMT-enabled core will not provide more than about a 25% improvement over the same core with SMT disabled. In practice, expect no more than a 10% boost. In the simplest terms: switching from the classic scheduler to the core scheduler might reduce how often you enjoy a 10% boost from SMT’s second logical processor. I expect few systems to lose much by switching to the core scheduler.

Some software vendors provide tools that can simulate a real-world load. Where possible, leverage those. However, unless you dedicate an entire host to guests that only operate that software, you still do not have a clear predictor.

Compatibility Concerns with the Core Scheduler

As you saw throughout the implementation section, a virtual machine’s ability to fully utilize the core scheduler depends on its configuration version. That impacts Hyper-V Replica, Live Migration, Quick Migration, virtual machine import, backup, disaster recovery, and anything else that potentially involves hosts with mismatched versions.

Microsoft drew a line with virtual machine version 5.0, which debuted with Windows Server 2012 R2 (and Windows 8.1). Any newer Hyper-V host can operate virtual machines of its version all the way down to version 5.0. On any system, run Get-VMHostSupportedVersion to see what it can handle. From a 2019 host:

So, you can freely move version 5.0 VMs between a 2012 R2 host and a 2016 host and a 2019 host. But, a VM must be at least version 8.0 to use the core scheduler at all. So, when a v5.0 VM lands on a host running the core scheduler, it cannot use SMT. I did not uncover any problems when testing an SMT-disabled guest on an SMT-enabled host or vice versa. I even set up two nodes in a cluster, one with Hyper-Threading on and the other with Hyper-Threading off, and moved SMT-enabled and SMT-disabled guests between them without trouble.

The final compatibility verdict: running old virtual machine versions on core-scheduled systems means that you lose a bit of density, but they will operate.

Summary of the Core Scheduler

This is a lot of information to digest, so let’s break it down to its simplest components. The core scheduler provides a strong inter-virtual machine barrier against cache side-channel attacks, such as the Spectre variants. Its implementation requires an overall reduction in the ability to use simultaneous multi-threaded (SMT) cores. Most systems will not suffer a meaningful performance penalty. Virtual machines have their own ability to enable or disable SMT when running on a core-scheduled system. All virtual machine versions prior to 8.0 (WS2016/W10 Anniversary) will only use one logical processor per core when running on a core-scheduled host.

The post What is the Hyper-V Core Scheduler? appeared first on Altaro DOJO | Hyper-V.

]]>
https://www.altaro.com/hyper-v/hyper-v-core-scheduler/feed/ 3
What To Do When Live Migration Fails On Hosts With The Same CPU https://www.altaro.com/hyper-v/live-migration-fails-same-cpu/ https://www.altaro.com/hyper-v/live-migration-fails-same-cpu/#comments Tue, 06 Nov 2018 10:39:18 +0000 https://www.altaro.com/hyper-v/?p=16820 Attempted to Live Migrate a Hyper-V VM to a host with the same CPU, but Hyper-V states incompatibilities between the two CPUs? Here's how to fix it.

The post What To Do When Live Migration Fails On Hosts With The Same CPU appeared first on Altaro DOJO | Hyper-V.

]]>

Symptom: You attempt to Live Migrate a Hyper-V virtual machine to a host that has the same CPU as the source, but Hyper-V complains about incompatibilities between the two CPUs. Additionally, Live Migration between these two hosts likely worked in the past.

The event ID is 21502. The full text of the error message reads:

“Live migration of ‘Virtual Machine VMName‘ failed.

Virtual machine migration operation for ‘VMNAME‘ failed at migration destination ‘DESTINATION_HOST‘. (Virtual machine ID VMID)

The virtual machine ‘VMNAME‘ is using processor-specific features not supported on physical computer ‘DESTINATION_HOST‘. To allow for the migration of this virtual machine to physical computers with different processors, modify the virtual machine settings to limit the processor features used by the virtual machine. (Virtual machine ID VMID)

Live Migration of 'Virtual Machine svdcadmt' failed

Why Live Migration Might Fail Across Hosts with the Same CPU

Ordinarily, this problem surfaces when hosts use CPUs that expose different feature sets — just like the error message states. You can use a tool such as CPU-Z to identify those. We have an article that talks about the effect of CPU feature differences on Hyper-V.

In this discussion, we only want to talk about cases where the CPUs have the same feature set. They have the same feature sets; CPU identifiers reveal the same family, model, stepping, and revision numbers. And yet, Hyper-V says that they need compatibility mode.

Cause 1: Spectre Mitigations

The Spectre mitigations make enough of a change to prevent Live Migrations, but that might not be obvious to anyone that doesn’t follow BIOS update notes. To see if that be affecting you, check the BIOS update level on the hosts. You can do that quickly by asking PowerShell to check WMI: Get-WmiObject -ClassName Win32_BIOSGet-CimInstance -ClassName Win32_BIOS, or, at its simplest, gwmi win32_bios:

A difference in BIOS versions might tell the entire story if you look at their release notes. When updates were released to address the first wave of Spectre-class CPU vulnerabilities, they included microcode that altered the way that CPUs process instructions. So, the CPU’s feature sets didn’t change per se, but its functionality did.

Spectre Updates to BIOS Don’t Always Cause Live Migration Failures

You may have had a few systems that received the hardware updates that did not prevent Live Migration. There’s quite a bit going on in all of these updates that amount to a lot of moving parts:

  • These updates require a cold boot of the physical system to fully apply. Most modern systems from larger manufacturers have the ability to self-cold boot after a BIOS update, but not all do. It is possible that you have a partially-applied update waiting for a cold boot.
  • These updates require the host operating system to be fully patched. Your host(s) might be awaiting installation or post-patch reboot.
  • These updates require the guests to be cold-booted from a patched host. Some clusters have been running so well for so long that we have no idea when the last time any given guest was cold booted. If it wasn’t from a patched host, then it won’t have the mitigations and won’t care if it moves to an unpatched host. They’ll also happily moved back to an unpatched host.
  • You may have registry settings that block the mitigations, which would have a side effect of preventing them from interfering with Live Migration.

I have found only one “foolproof” combination that always prevents Live Migration:

  • Source host fully patched — BIOS and Windows
  • Virtual machine operating system fully patched
  • Registry settings allow mitigation for the host, guest, and virtual machine
  • The guest was cold booted from the source host
  • Destination host is missing at least the proper BIOS update

Because Live Migration will work more often than not, it’s not easy to predict when a Live Migration will succeed across mismatched hosts.

[thrive_leads id=’17165′]

Correcting a Live Migration Block Caused by Spectre

Your first, best choice is to bring all hosts, host operating systems, and guest operating systems up to date and ensure that no registry settings impede their application. Performance and downtime concerns are real, of course, but not as great as the threat of a compromise.  Furthermore, if you’re in the position where this article applies to you, then you already have at least one host up to date. Might as well put it to use.

You have a number of paths to solve this problem. I chose the route that would result in the least disruptions. To that end:

  • I patched all of the guests but did not allow them to reboot
  • I brought one host up to current BIOS and patch levels
  • I filled it up with all the VMs that it could take; in two node clusters, that should mean all of the guests
  • I performed a full shut down and startup of those VMs; that allowed them to apply the patch and utilize the host’s update status in one boot cycle. It also locked them to that node, so watch out for that.
  • I moved through the remaining hosts in the cluster. In larger clusters, that also meant opportunistically cold booting additional VMs

That way, each virtual machine was cold booted only one time and I did not run into any Live Migration failures. Make certain that you check your work at every point before moving on — there is much work to be done here and a missed step will likely result in additional reboots.

Note: Enabling the CPU compatibility feature will probably not help you overcome the Live Migration problem — but it might. It does not appear to affect everyone identically, like due to fundamental differences in different processor generations.

Automating the Spectre Mitigation Rollout

I opted not to script this process. The normal patching automation processes cover reboots, not cold boots, and working up a foolproof script to properly address everything that might occur did not seem worth the effort to me. These are disruptive patches anyway, so I wanted to be hands-on where possible. If patch processes like this become a regular event (and it seems that it might), I may rethink that. If I had dozens or more systems to cope with, I would have scripted it. I was lucky enough that a human-driven response worked well enough to suit. However, I did leverage bulk tools that I had available.

  • I used GPOs to change my patching behavior to prevent reboots
  • I used GPOs to selectively filter mitigation application until I was ready
  • To easily cold boot all VMs on a host, try Get-VM | Stop-VM -Passthru | Start-VM. Watch for any VMs that don’t want to stop — I deliberately chose not to force them.
  • I could have used Cluster Aware Updating to rollout my BIOS patches. I chose to manually stage the BIOS updates in this case and then allowed the planned patch reboot to handle the final application.

Overall, I did little myself other than manually time the guest power cycles.

Cause 2: Hypervisor Version Mismatch

I like to play with Windows Server Insider builds on one of my lab hosts. I keep its cluster partner at 2016 so that I can study and write articles. Somewhere along the way, I started getting the CPU feature set errors trying to move between them. Enabling the CPU compatibility feature does overcome the block in this case. Hopefully, no one is using Windows Server Insider builds in production, much less mixing them with 2016 hosts in a cluster.

It would stand to reason that this mismatch block will be corrected before 2019 goes RTM. If not, Cluster Rolling Upgrade won’t function with 2016 and 2019 hosts.

Correcting a Live Migration Block Caused by Mixed Host Versions

I hope that if you got yourself into this situation that you know how to get out of it. In my lab, I usually shut the VMs down and move them manually. They are lab systems just like the cluster, so that’s harmless. For the ones that I prefer to keep online, I have CPU compatibility mode enabled.

Do you have a Hyper-V Problem To Tackle?

These common Hyper-V troubleshooting posts have proved quite popular with you guys, but if you think there is something I’ve missed so far and should be covering let me know in the comments below and it could be the topic for my next blog post! Thanks for reading!

The post What To Do When Live Migration Fails On Hosts With The Same CPU appeared first on Altaro DOJO | Hyper-V.

]]>
https://www.altaro.com/hyper-v/live-migration-fails-same-cpu/feed/ 4
Hyper-V Quick Tip: Nested Hyper-V VM Fails to Load Network and Mouse https://www.altaro.com/hyper-v/nested-hyper-v-vm-fail-load/ https://www.altaro.com/hyper-v/nested-hyper-v-vm-fail-load/#comments Tue, 02 Oct 2018 08:52:32 +0000 https://www.altaro.com/hyper-v/?p=16727 Q: I enabled nested virtualization for a HV VM, but it lost network connectivity and I cannot use the mouse inside the console. How can I fix it?

The post Hyper-V Quick Tip: Nested Hyper-V VM Fails to Load Network and Mouse appeared first on Altaro DOJO | Hyper-V.

]]>

Q: I enabled nested virtualization for a previously functional Hyper-V virtual machine, but now it doesn’t have network connectivity and I cannot use the mouse inside the console. What happened and how can I fix it?

So, you’ve decided to kick the tires on nested virtualization, but it seems like just the act of enabling the processor feature caused the guest to break. You know that the virtual machine worked just fine prior to making that switch, and you haven’t even enabled Hyper-V in the guest yet!

virtual machine connection - mouse not captured in remote desktop session

Known symptoms:

  • Synthetic network adapters do not function; the virtual machine has no network connectivity
  • You cannot use the mouse in the Hyper-V console
  • Enhanced session services do not function
  • In Device Manager, the status for “Microsoft Hyper-V Virtual Machine Bus” has a problem. The exact code may vary. I have seen:
    • Code 10 (This device cannot start)
    • Code 39 (Windows cannot load the device driver for this hardware. The driver may be corrupted or missing)

Microsoft Hyper-v Virtual Machine Bus Properties

All of these symptoms point to a single problem: VMBus won’t start. Without VMBus, a Hyper-V virtual machine becomes a minimally-capable system. No synthetic devices, no mouse, etc.

A. Ensure that your Hyper-V host operating system upgrade level is at least as recent as your virtual machine.

I am equivocating somewhat on the usage of the term “upgrade level”. You do not necessarily need to be at the same build level. Your host just needs to be able to expose virtualization feature sets that are new enough to not cause problems for the guest’s VMBus. As much as I’d like to give you a certain way to predict that, I’m not sure that any foolproof way exists (other than for Microsoft’s teams to include something in patch notes).

Regardless of any other condition, the underlying cause for all of these symptoms is that Hyper-V in the host cannot properly communicate with the integration services/components in the guest. In the past, we’d have you check that you had installed the integration services and reinstall if necessary. Now, of course, all supported operating systems ship with at least a basic level of these components and usually have at least basic acceleration features.

However, the onset of rapid release cycles and open membership in the Insiders’ programs have changed things, sometimes dramatically. Usually, we still get enough functionality to prevent the above symptoms, even when versions don’t match well. However, enabling exposure of virtualization features makes a more fundamental change that  In my case, I was running an Insider build of Windows 10 (build 17127) on top of the normal LTSC Windows Server 2016. It worked fine when I started it on a host running a recent Insider build of Windows Server (17709).

Note: You can Live Migrate a virtual machine in this condition, but it will not automatically correct the problem. The normal techniques used to start a driver in a running Windows session did not work for me, either. You will still need to power cycle the virtual machine.

Have you experienced this problem but found a different solution? Let us know in the comments below!

More Hyper-V Quick Tips from Eric

Safely Shutdown a Guest with Unresponsive Mouse

How Many Cluster Networks Should I Use?

How to Choose a Live Migration Performance Solution

How to Enable Nested Virtualization

The post Hyper-V Quick Tip: Nested Hyper-V VM Fails to Load Network and Mouse appeared first on Altaro DOJO | Hyper-V.

]]>
https://www.altaro.com/hyper-v/nested-hyper-v-vm-fail-load/feed/ 1
Hyper-V Quick Tip: Safely Shutdown a Guest with Unresponsive Mouse https://www.altaro.com/hyper-v/shutdown-guest-no-mouse/ https://www.altaro.com/hyper-v/shutdown-guest-no-mouse/#respond Tue, 11 Sep 2018 13:30:05 +0000 https://www.altaro.com/hyper-v/?p=16778 Found yourself without mouse control over a Hyper-V guest? It's still possible to safely shutdown only using a keyboard - but not a simple as you think..

The post Hyper-V Quick Tip: Safely Shutdown a Guest with Unresponsive Mouse appeared first on Altaro DOJO | Hyper-V.

]]>

Q: A virtual machine running Windows under Hyper-V does not respond to mouse commands in VMConnect. How do I safely shut it down?

A: Use a combination of VMConnect’s key actions and native Windows key sequences to shut down.

Ordinarily, you would use one of Hyper-V’s various “Shut Down” commands to instruct a virtual machine to gracefully shut down the guest operating system. Otherwise, you can use the guest operating system’s native techniques for shutting down. In Windows guests running the full desktop experience, the mouse provides the easiest way. However, any failure of the guest operating system renders the mouse inoperable. The keyboard continues to work, of course.

Shutting Down a Windows Guest of Hyper-V Using the Keyboard

Your basic goal is to reach a place where you can issue the shutdown command.

Tip: Avoid using the mouse on the VMConnect window at all. It will bring up the prompt about the mouse each time unless you disable it. Clicking on VMConnect’s title bar will automatically set focus so that it will send most keypresses into the guest operating system. You cannot send system key combinations or anything involving the physical Windows key (otherwise, these directions would be a lot shorter).

  1. First, you need to log in. Windows 10/Windows Server 2016 and later no longer require any particular key sequence to bring up a login prompt — pressing any key while VMConnect has focus should show a login prompt. Windows 8.1/Windows Server 2012 R2 and earlier all require a CTRL+ALT+DEL sequence prior to making log in available. For those, click Action on VMConnect’s menu bar, then click Ctrl+Alt+Delete. If your VMConnect session is running locally, you can press the CTRL+ALT+END sequence on your physical keyboard instead. However, that won’t work within a remote desktop session.

    You can also press the related button on VMConnect’s button bar immediately below the text menu. It’s the button with three small boxes. In the screenshot above, look directly to the left of the highlighted text.
  2. Log in with valid credentials. Your virtual machine’s network likely does not work either, so you may need to use local credentials.
  3. Use the same sequences from step 1 to send a CTRL+ALT+DEL sequence to the guest.
  4. In the overlay that appears, use the physical down or up arrow key until Task Manager is selected, then press Enter. The screen will look different on versions prior to 10/2016 but will function the same.
  5. Task Manager should appear as the top-most window. If it does, proceed to step 6.
    If it does not, then you might be out of luck. If you can see enough of Task Manager to identify the window that obscures it, or if you’re just plain lucky, you can close the offending program. If you want, you can just proceed to step 6 and try to run these steps blindly.

    1. Press the TAB key. That will cause Task Manager to switch focus to its processes list.
    2. Press the up or down physical arrow keys to cycle through the running processes.
    3. Press Del to close a process.
  6. Press ALT+F to bring up Task Manager’s file menu. Press Enter or N for Run new task (wording is different on earlier versions of Windows).
  7. In the Create new task dialog, type shutdown /s /t 0. If your display does not distinguish, that’s a zero at the end, not the letter O. Shutting down from within the console typically does not require administrative access, but if you’d like, you can press Tab to set focus to the Create this task with administrative privileges box and then press the Spacebar to check it. Press Enter to run the command (or Tab to the OK button and press Enter).

Once you’ve reached step 7, you have other options. You can enter cmd to bring up a command prompt or powershell for a PowerShell prompt. If you want to tinker with different options for the shutdown command, you can do that as well. If you would like to get into Device Manager to see if you can sort out whatever ails the integration services, run devmgmt.msc (use the administrative privileges checkbox for best results).

Be aware that this generally will not fix anything. Whatever prevented the integration services from running will likely continue. However, your guest won’t suffer any data loss. So, you could connect its VHDX file(s) to a healthy virtual machine for troubleshooting. Or, if the problem is environmental, you can safely relocate the virtual machine to another host.

More Hyper-V Quick Tips

How Many Cluster Networks Should I Use?

How to Choose a Live Migration Performance Solution

How to Enable Nested Virtualization

 

Have you run into this issue yourself? Were you able to navigate around it? What was your solution? Let us know in the comments section below!

The post Hyper-V Quick Tip: Safely Shutdown a Guest with Unresponsive Mouse appeared first on Altaro DOJO | Hyper-V.

]]>
https://www.altaro.com/hyper-v/shutdown-guest-no-mouse/feed/ 0