Backup & DR Planning Articles - Altaro DOJO | Backup & DR https://www.altaro.com/backup-dr Backup and disaster recovery guides, how-tos, tips, and expert advice for system admins and IT professionals Thu, 28 Oct 2021 10:07:07 +0000 en-US hourly 1 Best Practices for Backing up Exchange Server https://www.altaro.com/backup-dr/backing-up-exchange-server/ https://www.altaro.com/backup-dr/backing-up-exchange-server/#respond Wed, 28 Apr 2021 05:56:57 +0000 https://www.altaro.com/backup-dr/?p=754 E-mail may be the world’s single most mission critical application. It is something that nearly every company uses, and depends on heavily for day to day operations. Given the business critical nature of E-mail, it can cause a variety of problems if a user accidentally deletes an E-mail message containing important information.

The post Best Practices for Backing up Exchange Server appeared first on Altaro DOJO | Backup & DR.

]]>

Email may be the world’s single most mission-critical application. It is something that nearly every company uses and depends on heavily for day to day operations. Given the business-critical nature of email, it can cause various problems if a user accidentally deleted an e-mail message containing important company information.

Thankfully, unless the user performs a hard delete, the message is not actually gone. Instead, it is placed into the user’s Deleted Items folder, and the user can get it back at any time within the folder’s retention period.

But what if a user DOES permanently delete an important message or doesn’t realize that a message was deleted until after it has already been purged from the Deleted Items folder? That’s the topic of this article. In this article, we’ll dive into best practices and considerations for backing up an on-premises Exchange server.

NOTE: This article does NOT cover data protection strategies for Exchange Online hosted in O365/M365, if you’re looking for protection options for that use case you can do so here.

Is a Backup Really Necessary for Exchange Server?

The subject of whether or not Exchange Server backups are really required has been debated for years. Even as far back as 2012, IT pros were discussing a backup abandonment philosophy that was sometimes referred to as zero backup. The basic philosophy behind this concept was that if an organization has enough redundancy in place, there should be no need for traditional backups.

Even today, Microsoft seems to be embracing this philosophy for Exchange Server. According to Microsoft, the “preferred architecture for Exchange Server leverages a concept known as Exchange Native Data Protection. Exchange Native Data Protection relies on native Exchange features to protect your mailbox data, without the use of traditional backups”.

Although this approach can theoretically ensure that Exchange mailboxes remain online at all times, it does little if anything to provide point in time recovery capabilities. Hence, simply going against the grain and actually backing up Exchange could be regarded as an important best practice. Organizations must consider what would happen if there were a need to revert a mailbox database to an earlier point in time due to a malware infection, database corruption, or some other catastrophic event. The idea of forgoing traditional backups is short-sighted and needlessly puts an organization’s mailbox data at risk. As such, organizations should look for a backup solution that can protect mailbox data in a way that meets the organization’s operational needs while also addressing any compliance requirements.

The Need for Single Item Recovery with an Exchange Server

The most important capability to look for in an Exchange Server backup solution is the ability to recover individual mailboxes and individual items within mailboxes.  To those who may be new to Exchange Server, this capability might seem like a basic capability that should exist in any Exchange Server data protection solution. However, single item recovery has always been somewhat elusive in Exchange Server environments.

Exchange Server first gained mainstream popularity back at the time of Exchange 5.x. At that time, Active Directory had not even been invented yet. So Exchange had its own built-in directory database that worked in conjunction with the public and private information store databases. Being that Exchange Server was still rather primitive then, there really wasn’t a good way to perform single item recovery. A few vendors at the time offered what was then called brick-level recovery capabilities. Still, subjectively speaking, it seemed that these recovery techniques failed almost as often as they worked.

Back then, the only guaranteed way to recover a mailbox item was for an organization to keep two spare physical servers on hand. Both of these servers were connected to an isolated network segment. If the organization needed to recover the contents of a user’s network, the administrator would restore a domain controller backup to one of the servers. An Exchange Server backup from the same period of time was then restored to the other server. This was a full restoration, so the server hardware had to be nearly identical to the organization’s production Exchange box.

Once both restorations were complete, the administrator would log into the newly restored domain controller on the isolated network. From there, they would change the password of the user whose data needed to be recovered, thereby giving the administrator the ability to log in as that user. The administrator would then install Windows and Outlook onto a PC that existed on the isolated network segment. Finally, the user could log in as the user, locate the deleted item, and then export it to a PST file. This PST file would then be presented to the user who had requested the recovery so that the user could import the file’s contents into their mailbox. Needless to say, single item recovery for Exchange Server was not a trivial matter. The recovery process was very time consuming and labor-intensive.

Exchange Server and Modern Single Item Recovery

Over time Exchange Server became more advanced, and Microsoft began giving administrators better options for recovering Exchange mailbox data. In fact, Exchange Server now includes native single item recovery capabilities that do not depend on traditional backups. Even so, these capabilities come with some significant caveats.

The biggest issue with using Exchange Server’s single item recovery capabilities is that they are disabled by default (https://docs.microsoft.com/en-us/exchange/recipients/user-mailboxes/recover-deleted-messages?view=exchserver-2019). You can only recover an item if single item recovery was enabled before the item was deleted. Therefore, another best practice would be to enable single item recovery right now if it has not already been enabled.

Single item recovery is enabled on a per-mailbox basis, and you can check to see if single item recovery has been enabled for a mailbox by using the following command:

Get-Mailbox <Name> | Format-List SingleItemRecoveryEnabled,RetainDeletedItemsFor

The command listed above will tell you whether or not single item recovery is enabled and will also give you the retention period for deleted items. This brings up an important point. By default, deleted items are only retained for 14 days (https://docs.microsoft.com/en-us/exchange/recipients/user-mailboxes/single-item-recovery?view=exchserver-2019). After that, they are permanently deleted. You can, however, set a custom retention period. If, for example, you wanted to set the retention period to 30 days while also enabling single item recovery for all mailboxes, you could do so by using this command:

Get-Mailbox -ResultSize unlimited -Filter {(RecipientTypeDetails -eq 'UserMailbox')} | Set-Mailbox -SingleItemRecoveryEnabled $true -RetainDeletedItemsFor 30

It is also worth noting that you cannot use the EAC to restore recovered items. Instead, you will have to use a pair of PowerShell commands like these:

Search-Mailbox "<User Name>" -SearchQuery "from:'<Sender’s name>' AND <keyword>" -TargetMailbox "Discovery Search Mailbox" -TargetFolder "<recovery folder name>" -LogLevel Full

Search-Mailbox "Discovery Search Mailbox" -SearchQuery "from: ‘<sender’s name>' AND <keyword>" -TargetMailbox "<recipient’s name>" -TargetFolder "Recovered Messages" -LogLevel Full -DeleteContent

Needless to say, using PowerShell to search for and recover messages can be a very tedious process.

The Exchange Server Recovery Database

Even though Microsoft tends to encourage Exchange Native Data Protection and PowerShell based recovery, Exchange Server does include a recovery database feature that allows a traditional backup application to be used.

A recovery database is a special purpose mailbox database that exists to allow administrators to extract restored mailbox data. A backup can be restored to the recovery database, and then items can be extracted on an as-needed basis.

Windows Server Backup has the ability to restore an Exchange Server database to a recovery database. The problem, however, is that Windows Server Backup only supports file-level recovery, not application-level recovery (https://docs.microsoft.com/en-us/exchange/high-availability/disaster-recovery/restore-data-using-recovery-dbs?view=exchserver-2019). This means that when the restoration process completes, the database will be in a dirty shutdown state, which makes it unusable. Because of this, an administrator would then have to use the time consuming and sometimes problematic ESEUTIL tool to force the database into a clean state. Only then will the administrator be able to mount the database and then use the New-MailboxRestoreRequest cmdlet to restore a mailbox.

The Bottom Line

Native Exchange Server data recovery processes leave a lot to be desired. Using Windows Server Backup to restore data to a recovery database is a tedious, time consuming, and potentially error-prone process. Furthermore, administrators must resort to using PowerShell to perform the actual mailbox recovery, and single item recovery is not directly supported. As previously noted, Exchange Server does support single item recovery, but only if the capability is enabled, and the item that needs to be restored has not exceeded its retention period. Even then, the recovery process involves using a series of messy PowerShell commands.

When it comes to something as important as data recovery, it is better to use a third-party backup solution, such as Altaro. A third-party provider like Altaro provides a much more intuitive way of recovering Exchange Server data (including single item recovery) and does not force administrators to work through the recovery process using a series of complicated and obscure PowerShell commands. That is the last thing you want to be worried about when you’re in the process of attempting to recover important company data.

The post Best Practices for Backing up Exchange Server appeared first on Altaro DOJO | Backup & DR.

]]>
https://www.altaro.com/backup-dr/backing-up-exchange-server/feed/ 0
Understanding Backup and Recovery with CSV Disks for Failover https://www.altaro.com/backup-dr/backup-recovery-cluster-shared-volumes-csv-disks-failover/ https://www.altaro.com/backup-dr/backup-recovery-cluster-shared-volumes-csv-disks-failover/#respond Wed, 28 Oct 2020 05:30:14 +0000 https://www.altaro.com/backup-dr/?p=724 If you have deployed a Windows Server Failover Cluster in the past decade, you have probably used Cluster Shared Volumes (CSV). CSV is a type of shared disk which allows multiple simultaneous writes operations, yet they happen in a coordinated fashion to avoid disk corruption.

The post Understanding Backup and Recovery with CSV Disks for Failover appeared first on Altaro DOJO | Backup & DR.

]]>

If you have deployed a Windows Server Failover Cluster in the past decade, you have probably used Cluster Shared Volumes (CSV).  CSV is a type of shared disk which allows multiple simultaneous write operations, yet they happen in a coordinated fashion to avoid disk corruption.  It was not an easy journey for CSV to be widely adopted as the recommended disk configuration for clustered virtual machines (VMs) or Scale-Out File Servers (SOFS).  The technology faced many challenges to keep up with the constantly evolving Windows Server OS, its File Server, and industry storage enhancements.

Software partners, particularly backup and antivirus providers, continually struggled to support the latest versions of CSV.  Now Cluster Shared Volumes and its partner ecosystem are thriving as millions of virtual machines worldwide use this technology.

This blog post will provide an overview of how CSV works so that you can understand how to optimize your backup and recovery process.

Virtual Machines Challenges with Traditional Cluster Disks

When Windows Server Failover Clustering was in its infancy, Hyper-V did not yet exist.  Clusters were usually small in size and hosted a few workloads.  Each workload required a dedicated disk on shared storage, which was managed by the host that ran the workload.  If a clustered application failed over to a different node, the ownership of that disk also moved, and its read and write operations were then managed by that new host.

However, this paradigm no longer worked once virtualization became mainstream as clusters could now support hundreds of VMs. This meant that admins needed to deploy hundreds of disks, causing a storage management nightmare.  Some applications and storage vendors even required a dedicated drive letter to be assigned to each disk, arbitrarily limiting the number of disks (and workloads) to 25 or fewer per cluster.

While it is possible to deploy multiple VMs and store their virtual hard disks (VHDXs) on the same cluster disk, this meant that all the VMs had to reside on the same node.  If one of the VMs had to failover to a different node, then its disk had to be moved and remounted on the new node.  All the VMs that were running on that disk had to be saved and moved, causing downtime (this was before the days of live migration).  Cluster Shared Volumes (CSV) was born out of necessity to support Hyper-V. It was an exciting time for me to be on the cluster engineering team at Microsoft to help launch this revolutionary technology.

Cluster Shared Volumes (CSV) Fundamentals

Cluster Shared Volumes were designed to support the following scenarios:

  • Multiple VHDs could be stored on a single shared disk, which was used by multiple VMs.
  • The VMs could simultaneously run on any node in the cluster.
  • All Hyper-V features would be supported, such as live migration.
  • Disk traffic to VHDs could be rerouted across redundant networks for greater resiliency.
  • A single node would coordinate access to that shared disk to avoid corruption.

Even with the emergence of this new technology, there was still an important principle that remained unchanged – all traffic must be written to the disk in a coordinated fashion.  If multiple VMs write to the same part of a disk at the same time, it can cause corruption, so write access still had to be carefully managed.  The way that CSV handles this is by splitting storage traffic into two classes: direct writes from the VM to blocks on the disk and file system metadata updates.

The metadata is a type of storage traffic that changes the structure or identifiers of the blocks of data on disk, such as:

  • Starting a VM
  • Extending a file
  • Shrinking a disk
  • Renaming a file path

Any changes to the disk’s metadata must be carefully coordinated, and all applications writing to that disk need to know about this change.

When any type of metadata change request is made, the node coordinating access to that disk will:

  1. Temporarily pause all other disk traffic.
  2. Make the changes to the file system.
  3. Notify the other nodes of the changes to the file system structure
  4. Resume the traffic

This “coordinator node” is responsible for controlling the distributed access to a single disk from across multiple nodes.  There is one coordinator node for each CSV disk, and since the coordinators require additional CPU cycles to process and manage all of the traffic, the coordinators are usually balanced across the cluster nodes.  The coordinator is also highly-available, so it can move around the cluster to healthy nodes, just like any other clustered resource.

Data traffic, on the other hand, is simply classified as standard writes to a file or block, known as Direct I/O.  Provided that a disk does not incur any metadata updates, then the location of the VM’s virtual hard disk (VHDX) on the shared disk will remain static.  This means that multiple VMs can write to multiple VHDs on a single disk without the risk of corruption because they are always writing to separate parts of that same disk.  Whenever a metadata change is requested (a VHDX size increase, for example), all the VMs will:

  1. Pause their Direct I/O traffic
  2. Wait for the changes to the file system.
  3. Synchronize their updated disk blocks for their respective VHDs
  4. Then resume their Direct I/O traffic to the new location on the disk.

Another benefit of using CSV is to increase resiliency in the cluster in the event that there is a transient failure of the storage connection between a VM and its VHD.  Previously, if a VM lost access to its disk, it would failover to another node and then try to reconnect to the storage.  With CSV, it can reroute its storage traffic through the coordinator node to its VHD, known as Redirected I/O.  Once the connection between the VM and VHD is restored, it will automatically revert back to Direct I/O.  The rerouting process is significantly faster and less disruptive than a failover.

The Cluster Shared Volumes feature has since expanded its support from only Hyper-V workloads to also providing distributed access to files with a Scale-Out File Server (SOFS) and certain configurations of SQL Server; however, the details of these are beyond the scope of this blog.

Backup and Recovery using Cluster Shared Volumes (CSV)

There are different ways in which you can effectively back up your VMs and their VHDs from a CSV disk.  This can use either a Microsoft backup solution like Windows Server Backup or System Center Data Protection Manager (DPM) or a third-party solution like Altaro VM Backup. Besides CSV only being supported for specific workloads and requiring the use of either the NTFS or ReFS file system(s), there are few additional restrictions placed on the backup. When a backup is initiated on a CSV disk, the backup requestor will locate the coordinator node for that particular disk and manage the backup or recovery from there.

Volume-Level Backups

Taking a copy of the CSV disk, which contains your virtual machines, is the easiest solution, yet it is an all-or-nothing operation, preventing VM-specific backups.  This is because you must backup (and recover) all the VMs on that disk at once.  You may be able to successfully back up a single VM on the CSV disk, but you will actually see an error in the event log as this is technically unsupported.

When the backup request is initiated, the coordinator node will temporarily suspend any metadata updates and take a copy of the shared disk. However, Direct I/O is still permitted. The backup is crash-consistent as well so that the backup should always be recoverable. However, since it does not verify the state of the VMs, it is not application-consistent, meaning that the VM will be restored in the exact same state as when the backup was complete.  This means that if the VM had crashed or was corrupt on the disk, it will be in the same bad state upon recovery.

Application-Level Backups

A better solution is to install your CSV-aware backup software (VSS writer) on each node to allow you to backup and recover a specific VM.  If the VSS writer is not present, the coordinator resource can be moved to a different node that has the VSS writer to initiate the backup.  During the backup, the coordinator node will suspend any metadata updates to the disk until the backup is complete.  This allows for an application-consistent and crash-consistent backup.

It is also recommended that you restore any VMs to the same node to maintain application consistency, although most vendors now support cluster-wide restoration. Most backup software providers will also allow you to backup at the file-level or at the block-level, allowing you to make the tradeoff between faster block-level backups or better file-level recovery options. You should still also be able to back up multiple VMs simultaneously, so application-level backups are generally recommended over volume-level backups.

In Summary

Now that you understand how CSV works, you can hopefully appreciate why it requires special consideration from your backup vendor to ensure that any backups are taken in a coordinated fashion.  Before selecting a backup provider, you should check that their solution explicitly supports Cluster Shared Volumes.  Next, make sure that you have the latest software patches for both CSV and your backup provider, and do some quick online research to see if there are any known issues.

Make sure that you test both backup and recovery thoroughly before you deploy your cluster into production with a CSV disk.  If you encounter an error, I also recommend looking it up online before spending extensive time troubleshooting it, as CSV issues are well-documented by vendors and Microsoft support.  Now you should have the knowledge to understand how CSV works with your backup provider.

The post Understanding Backup and Recovery with CSV Disks for Failover appeared first on Altaro DOJO | Backup & DR.

]]>
https://www.altaro.com/backup-dr/backup-recovery-cluster-shared-volumes-csv-disks-failover/feed/ 0
Windows Server Failover Clustering (ClusDb) Backup and Recovery https://www.altaro.com/backup-dr/clusdb-backup-recovery/ https://www.altaro.com/backup-dr/clusdb-backup-recovery/#respond Wed, 28 Oct 2020 05:00:36 +0000 https://www.altaro.com/backup-dr/?p=711 Today businesses must offer their services twenty-four hours a day to remain competitive, which means that their applications need to be highly available. If the organization is running their workloads in a public cloud then their services usually stay online because the cloud provider manages their highly available infrastructure.

The post Windows Server Failover Clustering (ClusDb) Backup and Recovery appeared first on Altaro DOJO | Backup & DR.

]]>

Today businesses must offer their services twenty-four hours a day to remain competitive, which means that their applications need to be highly available.  If the organization is running its workloads in a public cloud, then its services usually stay online because the cloud provider manages its highly available infrastructure.  For enterprises that are running their own servers, they need to provide high availability (HA) to their applications and virtual machines (VMs), which is usually done through clustering groups of physical servers (nodes), a practice called server clustering.  The clustering services will monitor the health of each node and automatically restart or move workloads within the group of servers to ensure that they are always running.

Failover Clustering is the built-in HA solution for Windows Server and Hyper-V.  To create this distributed system, administrators need to deploy and manage networks, shared storage, servers, operating systems, virtual machines, and applications (see Altaro’s How to set up and manage a Hyper-V Failover Cluster Step by Step).

Since clusters are business-critical and fairly complex, it is important to back up not only the applications and VMs but also the configuration of the cluster so that it can be quickly redeployed in the event of a disaster. This blog post will review the best practices for a failover cluster configuration database backup and recovery, which I learned from spending four years designing clusters while on the product team at Microsoft.

If you’re simply looking for further information on doing backups, guidance on backing up data, operating systems, virtual machines, and applications is available from Altaro.

Understanding the Server Clustering Database (ClusDB)

Although a cluster is a collection of distributed services, it needs to function as a unified system.  It needs to understand the state and properties of every workload on every node, such as which host is managing a particular VM and whether that VM is online.  To accomplish this, Windows Server Failover Clustering has a database containing this information that resides on every node, known as ClusDB.  This database is stored in the cluster’s registry to remove other dependencies and ensure that its operating system access is prioritized. ClusDB is continually updated whenever there is a change to any component which the cluster is managing.  For example, if a new service is deployed, property changes, or an application fails over to another node.

A key trait about this database is that it must be identical on every host so that each node has a consistent view of the state of every clustered object.  This is to ensure that there is a single owner of each workload in the event that nodes cannot communicate with each other.

This is important because if there were a clustered SQL server, and two or more hosts started simultaneously writing to a single database in an uncoordinated fashion, then it could certainly cause disk corruption.  The Cluster Database is critical to ensure that every service operates correctly, so this database must be synchronized across every cluster node.  Any time a workload is added or removed, brought online or taken offline, or if any of its dozens of properties changes, the cluster will immediately update ClusDB on every node.

How to Back Up a Server Clustering Database

It is a good best practice to regularly back up the cluster’s database, in addition to the host and application data for each clustered workload or VM.  This should be done before and after changing the cluster’s configuration, applications, or properties.

Since clusters operate differently than standalone servers, it is important to ensure that your backup provider is “cluster-aware” so that it follows the proper steps.  This includes validating that a cluster is active, healthy, and can offer a complete and current copy of the Cluster Database.  The backup provider should also identify the best node to create the backup to maintain service availability and minimize disruption on other workloads.  The built-in Windows Server Backup is cluster-aware, along with Altaro’s offerings.

Once the backup provider has determined the optimal cluster node, it will call the Volume Shadow Copy Service (VSS), which is the built-in backup framework for Windows Server. The Cluster Service VSS Writer then performs a series of tasks to ensure that the ClusDb backup is complete and consistent. Since this database is stored in the registry, additional considerations will be made by VSS while creating the backup so that this data is injected back into the registry when an image is restored.

How to Restore a Server Clustering Database

If you are just repairing a single node within a cluster, and the rest of the cluster is still operational, then you do not actually need to restore the cluster database.  Simply repair the operating system on that faulty node, restart it, and make sure that it rejoins the cluster.  This node will then synchronize with the rest of the cluster and receive a current version of the ClusDb, then add it into its own registry.

If you need to restore the entire cluster, first ensure that all the host operating systems are repaired and functioning correctly.  Next, you must stop the cluster service on every node, which means that all of your clustered workloads will incur downtime.  Using your cluster-aware backup provider, you will restore the ClusDb by forcing the cluster to use this older database version through an “authoritative restore.”  The node which was selected for the authoritative restore will receive the new ClusDb, inject it into its registry, and start the cluster service on that node.  The other nodes will then come online and synchronize their cluster databases with this restored version.  Finally, the clustered workloads will apply the states and properties defined in the ClusDb, which usually means they will come online.  After the recovery, make sure that you validate that all the nodes and services are operational and accessible to your customers.

Here is a quick workflow that lays out the two options. Sometimes it’s better to see it than just read about it:

Backing up and restoring the cluster database is not too challenging if the backup provider is cluster-aware, even though the underlying process of server clustering has some complexities.  Keep in mind that you are only saving the state and properties of the cluster itself. The data for the operating systems, applications, and VMs must also be backed up via some other method, such as Altaro VM Backup.

The post Windows Server Failover Clustering (ClusDb) Backup and Recovery appeared first on Altaro DOJO | Backup & DR.

]]>
https://www.altaro.com/backup-dr/clusdb-backup-recovery/feed/ 0
Best Practices for Quicker VM Failover and Recovery after a Disaster https://www.altaro.com/backup-dr/best-practices-vm-failover-recovery/ https://www.altaro.com/backup-dr/best-practices-vm-failover-recovery/#respond Wed, 28 Oct 2020 04:54:58 +0000 https://www.altaro.com/backup-dr/?p=709 Downtime is inevitable for most enterprise services, whether it is for maintenance, patching, a malicious attack, a mistake or a natural disaster.No matter the cause, you will need torecover your services as quickly as possible (RTO) while minimizing any data loss (RPO).

The post Best Practices for Quicker VM Failover and Recovery after a Disaster appeared first on Altaro DOJO | Backup & DR.

]]>

Downtime is inevitable for most enterprise services, whether for maintenance, patching, a malicious attack, a mistake, or a natural disaster. No matter the cause, you will need to recover your services as quickly as possible (RTO) while minimizing any data loss (RPO).  The most common reasons for extended downtime during a VM failover are unsuccessful failure detection, poor testing, and a dependence on people.

This blog post will give you the best practices for how to configure your datacenter and applications to failover and restart as quickly as possible. However, in order to follow many of these recommendations, we will assume that you are running a highly-virtualized datacenter. Our examples will use Windows Server Hyper-V; however, these best practices are also applicable to VMware environments.

Note that we’ll be covering failover priorities from quickest recovery to least.

Automatic Monitoring and Restarting Before VM Failover

In modern datacenters, we rely on monitoring tools to let us know when there is an issue to reduce the burden on staff, so automated alerting is critical.  As soon as a service goes offline, your management tools should detect that it is unavailable, log the error, notify the administrator, and automatically try to restart the service or virtual machine (VM) on that same host.  Resuming a VM on the same host is fastest as resources will already have been assigned to it. Whereas a remote failover would require a remount of a disk and allocation of memory and CPU.

Local VM Failover First

Deploy a highly-available host cluster for your virtual machines so that if the application cannot start on the original host, it can still failover and restart on a different server in the same local cluster.  While this process will take longer than restarting it on the same node, it provides a great level of resiliency and reduces service downtime.

Failover clusters usually have configurable settings that can adjust how quickly a failure is detected. This setting is based on an intra-cluster health check between the nodes.  These settings include the frequency that a health check is run (SameSubnetDelay and CrossSubnetDelay) and how many missed health checks can occur before failover is triggered (SameSubnetThreshold and CrossSubnetThreshold).  When a failure is detected, the virtual machine will move to a different node and restart from the last valid configuration saved to its shared storage.

Replicate to a Disaster Recovery Site to Enable VM Failover

This is the stage of DR and failover that many organizations take for granted sadly,

If you have tried local failover, but the entire datacenter is offline, then you need to restore your services at a second site using a disaster recovery procedure.  If your organization does not have an alternative location, there are disaster recovery solutions that can use the Microsoft Azure public cloud, so many of these best practices can still be followed.

Assuming that your services need to use some type of centralized database, it is important to remember that this information needs to be available and replicated between both sites. This means that the distance between the datacenters will affect latency, copy speed, health checks, and failover time.  This distance could also influence whether you are using synchronous or asynchronous replication as synchronous solutions usually limit the maximum distance between sites.

Replication could happen between storage targets at the block level or at the file level taking advantage of built-in features like Hyper-V Replica or third-party solutions that enable failover such as Altaro VM Backup.  The sooner the data is available at the second site, the quicker a service can come online after a failover.  Also note, that if you’re using a cross-site failover cluster be sure to consider the frequency (*SubnetDelay) and failure tolerance (*SubnetThreshold) of your health checks as you do not want to trigger a false failover because of bandwidth issues if the sites are too far apart and cannot communicate effectively.

Retain the IP Address after VM Failover If Possible

Often the most difficult part of failover. That is getting your users connected to the workload again once it’s running in the failover location. It’s best to retain the same IP. Let’s discuss why briefly.

When a VM fails over to a different host, it is recommended to keep the same IP address so that once its applications come online the end-users can immediately connect to it without reconfiguration of endpoints.  This means that the application or VM should use the same subnet across all hosts so that the same IP address can be used.  This is relatively easy when the cluster is in a single location, but this becomes complex if there are multiple sites.

The most common ways to keep a static IP address on the same subnet after a cross-site failover are to stretch a virtual LAN (VLAN) across sites or to use network virtualization to abstract the physical networks of both sites.

A simple diagram showcasing this idea can be seen below:

Optimize Client Reconnection Speed During VM Failover with Different IPs

If you have to change the IP address after the failover, there are a few ways to speed up the time it takes for the client to receive the new IP address, then reconnect to the VM.  If you are using DHCP, then the application will likely get a new and random IP address. You can edit the HostRecordTTL cluster property of the workload to change the application’s DNS setting by reducing its Time To Live (TTL).  This controls how quickly the DNS record expires, which forces the client to request a new record with the new IP address so they can try to reconnect.  Most Windows Server applications have a default HostRecordTTL of 1200 seconds (20 minutes), which means that after a cross-subnet failover, it can take up to 20 minutes for the end-user to get an updated DNS record and reconnect to their application.  Most critical applications will recommend TTL values of 5 minutes or shorter.

If you are using known static IP addresses at both sites then you can configure the application or VM to try to connect with either IP address.  By setting the cluster property RegisterAllProvidersIP to TRUE, every IP address that DNS has used for that resource will be presented to the client who should automatically try to connect to them, iterating through the list.  The IP address must have been previously registered in DNS by that workload, which means that the application must be failed over and brought online on each subnet during deployment or testing.  Not all clustered workloads support the RegisterAllProvidersIP property, so verify that it can be used for your application.

Always Take Backups

Even if you have decided to replicate your data between sites, it is essential that you still regularly take backups and test the recovery process.  Backup providers usually offer a broader feature set than replication and are more resilient to different types of failures.  Since backups are usually stored offline, they prevent “bad” data from being replicated from the primary VM to the replica.

For example, if a virtual machine is infected with a virus or becomes corrupt, replicating it to a second site simply means that this “bad” data now exists in both locations.  In this case, a healthy backup must be used in order to properly recover. So in addition to optimizing your backup recovery process, make sure that you have accessible backups at both your primary site and your disaster recovery site.  Remember that backups will generally take longer to recover after a crash as the file needs to be located, mounted, restored, and tested before a service can come online. As a rule, backup restorations do take longer than a DR failover, so plan accordingly in this situation.

In Summary

When a disaster strikes, ensure that you are ready to bring your services online as fast as possible by optimizing your hardware and software as we’ve discussed above.

  • By automatically monitoring VM failover within a cluster, you can bring your local services online faster.
  • To prevent the loss of an entire datacenter, configuring a disaster recovery site (or use the public cloud) for replication.
  • Ideally, retain the IP address after failover and help your clients reconnect quicker.
  • When all else fails, make sure you have a complete backup solution for both sites.
  • Most importantly, make sure that you regularly test your processes and train your staff in these best practices.

The post Best Practices for Quicker VM Failover and Recovery after a Disaster appeared first on Altaro DOJO | Backup & DR.

]]>
https://www.altaro.com/backup-dr/best-practices-vm-failover-recovery/feed/ 0
Protection with SQL Server Maintenance Plans https://www.altaro.com/backup-dr/sql-server-maintenance-plans/ https://www.altaro.com/backup-dr/sql-server-maintenance-plans/#respond Wed, 28 Oct 2020 04:30:47 +0000 https://www.altaro.com/backup-dr/?p=707 Most organizations find it essential to protect their SQL databasesas they are essential to business operations. SQL Server comes with a built-inbackup utilitywhich lets administrators automatically protect their databases.

The post Protection with SQL Server Maintenance Plans appeared first on Altaro DOJO | Backup & DR.

]]>

Organizations know that they need to protect their essential SQL databases.  Administrators can leverage Microsoft SQL Server’s maintenance plan feature to automatically back up their databases and automate other tasks that keep the database performing as fast as possible.  It provides significant protection with its ease of use and no requirement for additional software or licenses.

However, enterprise administrators must follow best practices to ensure that their live data remains consistent and their backups recoverable.  This article will help you understand how to put SQL Server’s native tools and third-party solutions to work for you.

SQL Backups with the SQL Server Maintenance Plan

SQL allows admins to schedule series of tasks that optimize, back up, and keep databases consistent.  These tasks can operate independently or as part of a workflow.  The Maintenance Plan Wizard utility guides users through configuration.  The available tasks:

  • Check Database Integrity – runs an internal consistency check against the data and data pages. This is an important operation and should be run regularly, but it is resource-intensive and should not run at the same time as a backup.
  • Shrink Database – reduces the size consumed by data files by removing empty database pages. While this operation reduces the size of the backup on disk, it incurs a performance hit and causes fragmentation.  Unless you work with sparse databases or have recently deleted a lot of data, you will likely not reclaim enough space to justify the operation. Do not run shrink in a backup plan workflow. Make sure to reorganize the index after shrinking.
  • Reorganize Index – sorts indexes and reduces index fragmentation to make queries operate more efficiently. Fast and can run in a daily plan.
  • Rebuild Index – drops and recreates indexes as new, completely unfragmented entities. Slower and more resource-intensive than Reorganize Index and can only run online in the Enterprise Edition. Run no more than weekly. 
  • Update Statistics – provides the latest information about how the data is distributed which will make queries faster. It should be used immediately after Reorganize Index, preferably in the same workflow. Rebuild Index updates statistics automatically, by default.
  • History Cleanup – deletes old metadata from maintenance plan tasks, such as the SQL backup history. Based on the compliance needs of your industry, you may need to retain the proof of your backup history for several years, so you may or may not wish to use this task. Does not impact backed-up SQL data.
  • Execute SQL Server Agent Job – lets you trigger any previously-created SQL Server Agent job(s)or T-SQL statement(s) on a schedule. This enables you to add custom steps during maintenance workflows.
  • Back Up Database (Full) – triggers the built-in SQL Server backup. It calls Volume Shadow Copy Service (VSS) to quiesce the data, flush any existing transactions, and take a completely consistent backup. Run full backups regularly.
  • Back Up Database (Differential) – triggers a partial backup which saves changes made since the last full backup. Differential backups run more quickly than a full backup but take longer to restore since multiple files need to merge during recovery.  Use your recovery time objectives (RTO) and recovery point objectives (RPO) to decide whether to use the differential mode.
  • Back Up Database (Transaction Log) – protects only the transaction logs of a SQL database in Full Recovery mode. These logs enable a granular rebuild of a corrupted database. The transaction log allows you to roll through transactions to troubleshoot problems or restore to a very specific point.  Transaction backups finish quickly and can provide very short RPOs, so use them frequently between full backups.
  • Maintenance Cleanup –run at the end of any maintenance to delete unneeded files.

SQL Backups with Third-Party Providers

SQL maintenance plans serve as an effective backup strategy but can provide some challenges at scale. You may prefer third-party backup tools that offer a universal view of the backup schedule across all databases and the rest of the datacenter.  Maintenance plans also lack business logic, so they may run inefficient operations.  Native SQL backup also has limitations on backup storage types. For example, it doesn’t support remote tape systems.

Third-party providers usually separate the backup management server from the SQL server to reduce the impact on database performance.  Many offer advanced security features like backup encryption.  Some even offer item-level recovery. However, deep integration can tie your database backup into these third-party tools in a way that might challenge your comfort level.

As a compromise, you can combine native SQL backup with your third-party application. Use a maintenance plan to write .bak files to a location where your backup and replication programs can find them. That won’t give you some of the fancier capabilities, but you’ll never have to worry about having an unsupported configuration.

Scheduling SQL Backup Plans

SQL permits scheduling through either a maintenance plan or a third-party backup provider.  If you need more time to perform full backups or you want to reduce their performance impacts, you can intermix them with differential backups. At the other end, take transaction log backups frequently to maximize RPOs. You can schedule them as often as every 15 minutes.  The best practices for scheduling your SQL backups include:

  • Control the order of maintenance plan workflows by placing tasks that touch the same data within the same schedule. This allows you to manage the tasks within a single interface.
  • Run tasks during off-hours, especially those which are resource-intensive like index rebuilds or checking the database integrity.
  • Stagger the schedules of workflows so that they run at different times. This balances the workload across the backup infrastructure.
  • Run a Back-Up Database (Full) task before a differential or transaction log backup. These two types depend on the latest full backup.
  • Have the native tool save backups to disk, then have your regular backup tool capture the files.
  • Schedule regular deletion of on-disk backups that you no longer need to conserve disk space. Administrators commonly keep the last two full backups and the transaction logs and differentials that depend on them.

Regardless of which backup tool(s) you use, make sure that the plan accounts for the unique needs of each database.  Consider acceptable levels of data loss to determine the backup frequency (RPO).  Make sure you know the size and growth rate of database and backup files.  To plan your recovery time objective (RTO), think about your storage media. Leverage scheduling to minimize cost and maximize recovery speed. Remember the difficulty and amount of time necessary to transport data to a recovery site. Familiarize your staff with operational procedures. Most importantly, make sure that you regularly test recovery to ensure consistency of your data.

The post Protection with SQL Server Maintenance Plans appeared first on Altaro DOJO | Backup & DR.

]]>
https://www.altaro.com/backup-dr/sql-server-maintenance-plans/feed/ 0
Ransomware: Best Practices for Protecting Backups https://www.altaro.com/backup-dr/ransomware-best-practices-backup/ https://www.altaro.com/backup-dr/ransomware-best-practices-backup/#respond Wed, 21 Oct 2020 10:15:44 +0000 https://www.altaro.com/backup-dr/?p=694 Ransomware has been devastating organizations around the globe by locking up computersand encrypting data, then requesting thousands of dollars to be transferred to a criminal organization for the decryption keysto unlock them.

The post Ransomware: Best Practices for Protecting Backups appeared first on Altaro DOJO | Backup & DR.

]]>

Criminals devastate organizations around the globe by locking up computers and encrypting data, then demanding thousands of dollars for the decryption keys. This type of malware, known as “ransomware”, represents one of the greatest security threats to technology infrastructure. It has caused the complete failure of some organizations. This article teaches you the best practices for protection from ransomware and allows you to educate your staff, harden your Windows ecosystem, and protect your backups from infection.

Educate Yourself for Protection from Ransomware

Ransomware usually enters the organizations when unsuspecting non-technical users download comprised files. Once activated, it runs with the security permissions of the account that opened it. The malware quickly spreads through the network, planting more copies of its executable as traps for other users. Then, it encrypts every file that it can access. Unfortunately, that includes backups and system recovery files.

Ransomware creators create variations rapidly, making it difficult for antivirus to detect. You only have two viable choices for recovery: pay the ransom or recover from a backup. Both options carry risk; many ransomware distributors will not provide keys after payment, and you might not have any useful backups that escaped the malicious encryption. 

Educate Your Staff for Protection from Ransomware

Ransomware attacks usually start the same way as other types of malware: through a user that opens an infected file. User education is your most effective tool in ransomware mitigation. A few ways to approach staff training:

  • Hold training sessions, create, and share material that explains social engineering, email scams, baiting, and phishing attacks.
  • Teach users about the proliferation of infected files through download sites and e-mail.

The National Institute of Standards and Technology (United States) maintains a list of free and low-cost cybersecurity educational material. It includes a section on “Employee Awareness Training”.

Protect User Devices

Ransomware, like most other malware, enters through user devices. Focus the bulk of your technological ransomware protections there.

  • Employ policies and programs to discourage or prevent users from accessing suspicious files. Filter emails with executable attachments and block users from enabling Microsoft Office macros.
  • Enforce regular endpoint operating system and software patching.
  • Deploy and maintain antimalware programs such as antivirus and intrusion prevention.
  • Remove Adobe Flash and take steps to secure software that can run otherwise non-executable files, such as Java and web browsers.
  • Prevent applications running from the AppData, LocalAppData, or Temp special folders.
  • Install commercial web filtering tools.
  • Block SMB shares on non-server systems.

Group Policy can help with some of these tasks. 

Prevent Ransomware from Spreading

A proper defensive stance anticipates infection. Take these steps to proactively harden your datacenter to impede ransomware’s movement.

  • Restrict administrative accounts to the fewest possible individuals and require them to use standard accounts for non-administrative functions.
  • Create allow lists for known-good applications and block other executables.
  • Firewall traffic into your datacenter.
  • Shut down or block all unnecessary file shares.
  • Disable user access to the volume shadow copy service (VSS).
  • Audit and constrain users’ write permissions on file servers.
  • Require users to store important documents in protected folders.
  • Deploy intrusion detection tools.
  • Do not map network shares to drive letters.
  • Disable RDP, VNC, and other easily compromised remote access methods.
  • Train administrative staff on remote management tools, such as PowerShell Remoting, which can perform most tasks (including file copy) through a secured keyhole on port 5985 or 5986.

Protection from Ransomware for Backups

Properly protected backups will save your organization if ransomware strikes. Follow these best practices to secure them:

  • Create and implement a thorough disaster recovery plan that includes regular full backups. Check out The Backup Bible for much more information on that topic.
  • Use a Managed Service Account, not a user account, to operate backup 
  • Limit backup location ownership and write access to the backup application’s service account.
  • Create a dedicated network for backup, isolated from user networks.
  • If a vendor requires antivirus exclusions, ensure that their unprotected locations do not contain any data and remove any line of sight or access from their systems to vital network shares.
  • Encrypt backups using a key saved to a location that ransomware cannot access.
  • Capture frequent full offline backups.
  • Frequently transfer full backups to cloud storage using a method that ransomware cannot hijack (such as a two-factor-protected vault)
  • Regularly test your ability to restore from backup.

Group Policy can help with several of these points. Your backup software, such as Altaro VM Backup, can help with offline storage, offsite transmission, and encryption. Remember that some sophisticated ransomware knows how to operate backup programs, so you must always maintain offline backups.

You have multiple options for isolating networks. Most commonly, users must access server resources through firewalls and routers. Servers operate on their own network(s). You can then create a specific network just for the backup devices. Use your routers, firewalls, and access permissions to lock down ingress traffic for that network.

Protect Backups from Ransomware

No single “right way” exists for isolating a network. At the extreme, you can completely isolate your backup network, known as an “air gap”. This requires all systems that participate in back up to have their own presence directly on the backup network. The network itself has no gateway or other connection to the rest of the network. A proper full air gap typically requires a fully virtualized datacenter to properly execute, as you have no other way to both provide network access from users to services and prevent network membership from servers to the backup network within the same operating system instance.

Hypervisor Server

Complete air gaps require the hypervisor and backup systems to have no external network connectivity of any kind, which can make maintenance, patching, and offsite difficult transmissions difficult. Every compromise that you make to accommodate these problems reduces the effectiveness of the air gap.

Surviving a Ransomware Attack

These best practices can help you to mitigate your risks and minimize the spread of an attack. 

  • Immediately disconnect affected systems from the network, including wireless and Bluetooth.
  • If the ransomware has a timer that increases the bounty price or counts down to a full lockout, rolling back the BIOS clock may delay the trigger.
  • Research the specific malware afflicting you. Many victims have shared their keys to older or well-known strains.

Protection from Ransomware and the Ongoing Journey 

Nothing can fully protect you from ransomware and this war will never end. Remain vigilant, keep watch on CVEs, and mind your backups and maintenance cycles. The approaches presented here can ensure that you safely make it through an assault.

The post Ransomware: Best Practices for Protecting Backups appeared first on Altaro DOJO | Backup & DR.

]]>
https://www.altaro.com/backup-dr/ransomware-best-practices-backup/feed/ 0
Backup Strategy – Best Practices: Planning and Scheduling https://www.altaro.com/backup-dr/backup-planning-scheduling/ https://www.altaro.com/backup-dr/backup-planning-scheduling/#respond Wed, 21 Oct 2020 07:32:38 +0000 https://www.altaro.com/backup-dr/?p=685 Every organization now knows the importance of backing up their data and services, especially with today’s regular security threats of ransomware and data breaches. Data is critical to mostenterprises, and it is often the livelihood of many companies, andlosing that data could even destroy the entire business.

The post Backup Strategy – Best Practices: Planning and Scheduling appeared first on Altaro DOJO | Backup & DR.

]]>

The constant threats of ransomware and data breaches place even greater urgency on the vital practice of backup strategy.  Permanent data loss or exposure can destroy a business.  To properly plan your backup schedule, you need to know your loss tolerances (RPO) and the amount of downtime that you can withstand (RTO).  Check out this Altaro article for Defining the Recovery Time (RTO / RTA) and Recovery Point (RPO / RPA) for your Business (also Backup Bible). You won’t find any “one size fits all” solution to data protection due to the uniqueness of datasets, even within the same company.  This article covers several considerations and activities to optimize backup strategy, planning, and scheduling.

Best Practices for Backup Infrastructure Design

Your backup infrastructure exists inextricably among the rest of your system. Sometimes, that obscures its unique identity. Not understanding backup as a component can lead to architectural mistakes that cost time, money, and sometimes data. Visualize your backup infrastructure as a standalone design:

Backup Server Infrastructure

Think of systems to back up as “endpoints”, following the categorization of your backup system. A “File Endpoint” label could refer to a standard file server, or it could mean any application server that works with VSS and requires no other special handling. A “Client Endpoint” almost falls into the “File Endpoint”, except that you can never guarantee its availability. Some systems have more particular requirements from your backup application, such as e-mail. Set those apart. In all other cases, use the most generic category possible. Whether you create a diagram does not matter as much as your ability to understand your infrastructure from the perspective of backup.

Consider these best practices while designing your backup infrastructure:

  • Automate every possible point. Over time, repetitive manual tasks tend to be skipped, forgotten, or performed improperly.
  • Add high availability aspects. When possible, avoid single points of failure in your backup infrastructure. Your data becomes excessively vulnerable whenever the backup system fails, so keep its uptime as close to 100% as your situation allows.
  • Allow for growth. If you already perform trend analysis on your data and systems, you can leverage that to predict backup needs into the future. If you don’t have enough data for calculated estimates, then make your best guess. You do not want to face a long capital request process, or worse, denial, if you run out of capacity.
  • Remember the network. Most networks operate at a low level of utilization. If that applies to you, you won’t need to architect anything special to carry your backup data. Do not make assumptions; have an understanding of your total bandwidth requirements during backup windows. Consider your Internet and WAN link speeds when preparing for cloud or offsite backup.
  • Mind your RPOs and RTOs. RPOs dictate how much space capacity your backup infrastructure requires. RTOs control their speed and resiliency needs.
  • Configure alerts. You need to know immediately if a backup process fails. Ideally, you can automate the backup system to continue trying when it encounters a problem. However, even when it successfully retries, it still must make you aware of a failure so that you can identify and correct any small problems before they turn into show-stoppers. Optimally, you want a daily report. That way, if you don’t get the expected message, you’ll know that your backup system has a failure. Minimally, configure to receive error notifications. The notification configuration in Altaro VM Backup looks like this:

Altaro VM Backup

  • Employ backup replication. If one copy of your data is good, two is better. If the primary data and its replica exist in substantially distant geographic regions, they help protect against natural disasters and other major threats to physical systems.
  • Document everything. Your design will make so much sense when you build it that you’ll wonder how anyone would struggle to understand it. But, six months later, some of it will puzzle even you. Even if it can’t fool you, it can fool someone. Write it all down. Keep records of hardware and software. Note install directions. Keep copies of RTO and RPO decisions.
  • Schedule refreshes. Your environment changes. Things get old. Prepare yourself by scheduling frequent reviews. Timing depends on your organization’s flux, but do not allow reviews to occur less than annually.
  • Design and schedule tests. If you have not restored your backup data, then you do not know if you can restore it. Even if you know the routine, that does not guarantee the validity of the data. Test restores help to uncover problems in media and misjudgments in RTOs and RPOs. Remember that a restore test goes beyond your backup application’s green checkmark. Start up the restored system(s) and verify that they can operate as desired. Automate these processes wherever possible.

Best Practices for Datacenter Design to Reduce Dependence on Backup

Remember that backup serves as your last line of defence. Backups don’t always complete and restores don’t always work. Even if nothing goes wrong during recovery, it still needs time to complete. While designing your datacenter, constrain the odds of needing to restore from backup.

  • Implement fault tolerance. “Fault tolerance” means the ability to continue operating without interruption in the event of a failure. The most common and affordable fault tolerant technologies protect power and storage systems. You can purchase or install local and network storage with multiple disks in redundant configurations. Some other systems have fault tolerant capabilities, such as network devices with redundant pathways.
  • Implement high availability. Where you find fault tolerance too difficult or expensive to implement, you may discover acceptable alternatives to achieve high availability. Clusters perform this task most commonly. A few technologies will operate in an active/active configuration, which provides some measure of fault tolerance. Most clusters utilize an active/passive design. In the event of a failure, they can transfer operations to another member of the cluster with low, or sometimes effectively no service interruption.
  • Remember the large data chunks that will traverse your network. As mentioned in the previous section, networks tend to operate well below capacity. Backup might strain that. In a competitive network, backups might not complete within an acceptable timeframe or they might choke out other vital operations. The best way to address such problems is through network capacity expansion. If you can’t do that, leverage QoS. If you employ internal firewalls, ensure that the flood of backup data does not overwhelm them. That might require more powerful hardware or specially tuned exclusions.

Best Practices for Backing Up Different Services

While determining your RTOs and RPOs, you will naturally learn the priority of the data and systems that support your organization. Use that knowledge to balance your backup technology and scheduling.

  • Utilize specialized technology where appropriate. Most modern backup applications, such as Altaro VM Backup, recognize Microsoft Exchange and provide highly granular backup and recovery. In the event of small problems (e.g., an accidentally deleted e-mail), you can make small, targeted recoveries instead of rebuilding the entire system. Do not overuse these tools, however. They add complexity to backup and recovery. If you have a simplistic implementation of a specialized tool that won’t benefit from advanced protections, apply a regular backup strategy. As an example, if you have a tiny SQL database that changes infrequently, don’t trouble yourself with transaction log backups or continuous data protection.
  • Space backups to fit RPOs. This best practice is partly axiomatic; a backup that doesn’t meet RPO is insufficient. However, if a system has a long RPO and you configure its backup for a short interval, it will consume more backup space than strictly necessary. If that would cause your organization a hardship in storage funding, then reconfigure to align with the RPO.
  • Pay attention to dependencies. Services often require support from other services. Your web front-ends depend on your database back-ends, for example. If the order of backup or recovery matters, time them appropriately.
  • Minimize resource contention. With multiple backup endpoints come scheduling conflicts. If your backup system tries to do too much at once, it might fail or cause problems for other systems. Try to target quieter times for larger backups. Stagger disparate systems so that they do not run concurrently.

Best Practices for Optimizing Backup Storage

If you work within a data-focused organization, your quantity of data might grow faster than your IT budget.  Backups naturally incur cost by consuming space. Follow these best practices to balance storage utilization:

  • Constrain usage of SSD and other high-speed technology as backup storage to control costs. Consider systems that require a short RTO first. After that, look at any system that churns a high quantity of data and has a short RPO, then it might need SSD to meet the requirement. Remember that RPO alone does not justify fast storage; if a more cost-effective solution meets your needs, use it.
  • Move backups from fast to slow storage where appropriate. It makes sense to place recent backups of your short RTO/RPO systems on fast disk. It doesn’t make sense to leave it there. If you ever need to restore one of those systems from an older point, then either you have suffered a major storage failure, or you have no failure at all (e.g., someone needs to see a specific piece of data that no longer exists in live storage). In those cases, you can keep costs down by sacrificing quick recovery. Move older data to slower, less expensive storage.
  • Use archival disconnected storage. Recovering from disconnected storage implies that a catastrophe-level event has occurred. Such events happen rarely enough that you can justify media that maximizes capacity per dollar. Manufacturers often market such storage as “archival”. They expect customers to use those products to hold offline data, so they typically offer a higher data survival rate for shelved units. In comparison, some technologies, such as SSD, can lose data if left unpowered for long periods of time. Your offline data must reside on media built for offline data.
  • Carefully craft retention policies. In a few cases, regulations set your retention policies. In all other areas, think through the possibilities before deciding on a retention policy. You could keep your Active Directory for ten years, but what would you do with a ten-year-old directory? You could keep backups of your customer database for five years, but does your database already contain all data stretching back further than that? If you keep retention policies to their legal and logical minimums, you reduce your storage expense.
  • Leverage space-saving backup features. Ideally, we would make full backups of everything. More distinct copies make our data safer. Realistically, we cannot hold or use very many full backups. Put technologies such as incremental, differential, deduplication, and compression backup features to use. Remember that all of these except compression depend on a full backup. Spacing out full backups introduces some risk and extends recovery time.
  • Learn your backup software’s cleanup mechanism. Even in the days of all-tape backup, we had to remember to periodically clean up backup metadata. With disk-based backup, we need to reclaim space from old data. Most modern software (such as Altaro VM Backup) will automatically time cleanups to align with retention policies. However, they also should include a way for you to clear space on demand. You may use this option when retiring systems.

Optimizing Backup Storage

Best Practices for Backup Security

Every business is responsible for protecting its client and employee data. Those in regulated industries may have additional compliance requirements. A full security policy exceeds the scope of a blog article, so ensure that you perform due diligence by researching and implementing proper security best practices. Some best practices that apply in backup:

  • Follow standard security best practices to protect data at rest and during transmission.  
  • Encrypt your backup data, either by using automatically encrypted storage or the encryption feature of your backup application.
  • Implement digital access control. Apply file and folder permissions so that only the account that operates your backup application can write to backup storage. Restrict reads to the backup application and authorized users. Use firewalls and system access rules to keep out the curious and malicious. For extreme security, you can isolate backup systems almost completely.
  • Implement physical access control. A security axiom: anyone who can physically access your systems/media can access its data. Encryption might slow someone down, but sufficient time and determination will break any lock. Keep unnecessary people out of your datacenter and away from your backup systems. Create a chain-of-custody procedure for your backup media. Magnetize or physically damage media that you intend to retire.

Protect Your Data with Standard Best Practices

If you follow the backup strategy best practices listed above, then you will have a strong backup plan with an optimized schedule for each service and system.  Continue your protection activities by following the best practices that apply to all types of systems: patch, repair, upgrade, and refresh operating systems, software, and hardware as sensible.

Most importantly, periodically review your backup strategy. Everything changes. Don’t let even the smallest alteration leave your data unprotected!

The post Backup Strategy – Best Practices: Planning and Scheduling appeared first on Altaro DOJO | Backup & DR.

]]>
https://www.altaro.com/backup-dr/backup-planning-scheduling/feed/ 0
Defining Recovery Time (RTO / RTA) and Recovery Point (RPO / RPA) https://www.altaro.com/backup-dr/rto-rta-rpo-rpa/ https://www.altaro.com/backup-dr/rto-rta-rpo-rpa/#respond Wed, 21 Oct 2020 07:18:17 +0000 https://www.altaro.com/backup-dr/?p=676 The main concepts that we will discuss in this blog are the recovery point and recovery time. The Recovery Point Objective(RPO) can be used to quantify the amount of the data whichis acceptable to lose in a disaster, while the Recovery Point Actual (RPA) measures the real data loss when this happens, which is usually based on the data lost between the failure and when the last backup was made.

The post Defining Recovery Time (RTO / RTA) and Recovery Point (RPO / RPA) appeared first on Altaro DOJO | Backup & DR.

]]>

Most modern businesses need to operate twenty-four hours a day and every day of the week to generate revenue and keep their customers happy.  No matter how well these organizations plan to maintain continuous service availability, downtime is inevitable and can happen for many reasons, including security attacks, IT errors, internal sabotage, or even natural disasters that could destroy a power grid or shut down a datacenter. Having local backups within your site may not be sufficient in the event that you lose access to the entire datacenter, so having a secondary location for disaster recovery is always recommended.  The IT department must be proactive in developing a plan to detect and recover from a variety of unexpected events as a part of their business continuity planning, first within the local datacenter and then across sites. The main concepts that we will discuss in this blog post are the recovery point and recovery time.

The Recovery Point Objective(RPO) can be used to quantify the amount of the data which is acceptable to lose in a disaster, while the Recovery Point Actual (RPA) measures the real data loss when this happens, which is usually based on the data lost between the failure and when the last backup was made.

The Recovery Time Objective(RTO) is a goal for how long should take from when an outage begins until it is detected and the service is restored, and the Recovery Time Actual (RTA) is the real-time that it takes to failover the service or restore a backup and bring services online after a failure.  While the goal is to get both of these values as close to zero as possible, usually some data loss will happen and recovery time is generally measured in minutes or hours.  Every service for every business is different, so these values will vary based on the importance of each application to that business.

The following diagram shows the recovery timeline when a failure happens.  Backups are being taken at regular intervals, and the time between the failure and the last backup shows the Recovery Point Actual (RPA).  The time between the failure and when the services are back online, show the Recovery Time Actual (RTA).  The overall duration between the most recent recovery point and the end of the recovery time shows the overall failure impact.

 

Recovery timeline after a failure

Figure 1: Understanding the Recovery Timeline after a Failure

Defining the Recovery Point Objective

Since each industry, company, and application has different needs for data availability and integrity, this impacts the RPO for that service.  For example, a stock trading company would find it unacceptable to have any data loss for its financial transactions, so RPO is zero, whereas it may also have a backend service that creates monthly reports, and this may allow for a few hours of downtime, with an RPO of 4 hours.

Sometimes the RPO will vary based on the time of day, day of the week, or even the month, such as if your services are only active during weekday trading hours or if you run a seasonal business. The IT department has to collaborate with other business units to define the RPO, but remember that this is only a goal, and actually accomplishing a full recovery while losing that amount of data (or less) is measured by the Recovery Point Actual (RPA).

If no data loss is acceptable, then the organization likely has to find a solution with synchronous replication between the primary data source and a mirrored instance.  This means that as soon as any data is written to the primary location, it is immediately copied to a secondary location, and an acknowledgment is returned to the primary before the data is committed to ensuring that the information is consistent at both sites.  These solutions are expensive and the distance between sites is usually limited to a few miles to prevent a performance impact.

If some data loss is acceptable, then there are multiple solutions available, including asynchronous replication and traditional disk backups.  Asynchronous replication means that data is sent to the secondary site at regular intervals, and some popular solutions include Microsoft’s Hyper-V Replica or VMware’s vSphere Replication.  If you only have a single datacenter, do not worry, you can still back up to a secondary location by taking advantage of the public cloud using technologies like Microsoft’s Azure Site Recovery or Azure Cloud Backup.

If the solution involves taking backups, which is most common, then the frequency of the backup is the most critical component.  If backups are taken every hour, then up to an hour of data could be lost, depending on when the crash happens.  While you may think that you should just take backups every minute (or second) to reduce the RPO and data loss, keep in mind that there is a cost associated with taking more backups.

Each backup consumes more disk space and server resources, so your virtualization hosts may have to run fewer VMs than they could at full capacity during a backup cycle.  Backup companies have developed some technologies to optimize these challenges such as deduplication to save storage space, and incremental or partial backups to minimize the performance hit when backups are taken.  For example, Altaro VM Backup helps organizations optimize how their backups are taken and stored by using the very technologies mentioned above.

Defining the Recovery Time Objective

When an organization defines their RTO, they need to consider the entire duration which it takes to discover the outage and bring services back online.  This includes the time to detect that the service is offline or that data is being lost, the time to begin the recovery, the time to test that the recovery worked and the data is consistent, the time to restart any services in the correct dependency order, and the time that it takes for clients to be able to reconnect.

It is a best practice to automate as many of these steps as possible, as any manual tasks performed by humans will slow down the recovery.  If human intervention is required, then you must also consider the time it takes to alert the staff (including overnight, on weekends, or holidays), drive to the datacenter and execute the recovery tasks.  Also, consider that datacenter access may not be possible or safe if there is a natural disaster like a hurricane or flood, so providing a remote access solution is critical in the planning.

The two most common reasons why disaster recovery plans fail is because of a lack of testing and a dependency on humans, so it is critical that you test your recovery so that you can calibrate the RTO with the Recovery Time Actual (RTA).  You can optimize your RTO through a variety of infrastructure optimizations, such as having fast recovery disks, high bandwidth recovery networks that prioritize that traffic using quality of service (QoS), and trying to restore from the local site first before recovering from a remote site.  Additionally, there are many software solutions to speed up recovery time through high-availability so that the software will automatically detect a crash and restart the service, such as Windows Server Failover Clustering or VMware HA.

Ongoing Recovery Management

Disaster recovery planning must be a regular task for every organization a part of their standard operating procedure to ensure that they meet their RPO and RTO. Each time you change or update your applications, servers, networks, or storage, it can impact service availability and change your RPA and RTA.

Also, make sure that the recovery process is well-documented so that the knowledge (Recovery Time Objective, Recovery Point Objective…etc) can be passed on to future staff. Some well-recognized standards to provide further guidance include the International Organization for Standardization’s ISO/IEC 27031, “Guidelines for information and communication technology readiness for business continuity” and the National Institute of Science and Technology’s Special Publication 800-34, “Contingency Planning Guide for Federal Information Systems”. Stay tuned for another Altaro blog post where we will provide you with the best practices for scheduling backups to optimize your RPO and RTO.

The post Defining Recovery Time (RTO / RTA) and Recovery Point (RPO / RPA) appeared first on Altaro DOJO | Backup & DR.

]]>
https://www.altaro.com/backup-dr/rto-rta-rpo-rpa/feed/ 0