DR Event Recovery Procedures

Eyeglass UnControlled Failover Recovery Procedures





Overview

Use this procedure to recovery from a real DR failure where eyeglass uncontrolled failover was executed.


The two use cases covered in this document are:

  1. Complete loss of source production cluster and new replacement cluster is installed to become primary production cluster again

  2. Site network loss or partial data center impact that left the production cluster disconnected for a long period of time from the DR cluster and still has stale old data on the cluster.  This cluster will be updated from DR and turned into production cluster again.



NOTE:  This is example procedure and many variations exist to recover including 2 clusters same site to resync and cluster move.  This procedure can be modified according to your specific case.


Use Case #1

Scenario

The Isilon source cluster is totally in a disaster state (e.g. destroyed), and needs to be replaced with a new cluster. Eyeglass is configured for Access Zone Failover.

The new Isilon cluster is configured with the same cluster name and ip addresses. But it will be with a new cluster GUID.

Procedures

  1. When a disaster occurs on the primary site (where the Source Cluster is residing) and destroys the source cluster, we need to perform uncontrolled failover with Eyeglass to failover Isilon to the Target Cluster.

  2. Execute Eyeglass uncontrolled Failover to Target Cluster. (unchecked “Controlled Failover”)



  1. Once this uncontrolled failover has been completed, the client will be able to access the data from the Target Cluster as per normal. Verify that clients are able to access the data (R/W).


  1. On the Eyeglass:

    1. Remove the existing Source Cluster from the Eyeglass Inventory (Right Click the Source Cluster from the list of Managed Devices in the Eyeglass’ Inventory View Windows, and click Delete).



  1. In the time of recovering Primary Site, prepare and bring up the new Isilon Source Cluster. Configure the base configuration with same settings as the original Source Cluster.


  1. Refer to the Eyeglass Cluster Report for recreating the settings:

    1. Cluster name.

    2. Internal and External Interfaces - same IP addresses as the original settings.

    3. DNS.

    4. Smartconnect.

    5. Licenses.

    6. Create required directories for:

      1. Access zones

      2. Parent directories that will be replicated with SyncIQ policies (as defined in SyncIQ Policies)

    7. Create non-system access zones with basic configuration (e.g. just the path).

    8. Additional network subnets and pools.

    9. Configure additional network pools with the same settings as the original source cluster including the access zone setting for each pool. For the data zone name create with the format of: igls-original-<source cluster smartconnect zone name>. Example the original source cluster smartconnect zone name: cluster20-z01,ad1.test. Change the name to igls-original-cluster20-z01.ad1.test. As currently the cluster20-z01.ad1.test name is used as the alias on the Target Cluster and used for clients to access the data from target cluster. For the zone that will be used for SyncIQ replication, we can give the same name as the original source cluster smartconnect zone name (e.g. cluster20-z02.ad1.test).

    10. Also re-create the zone aliases:

      1. Zones for Data Access with format  igls-<any-string>.

      2. Zone for SyncIQ Replication with format igls-ignore.

    11. Configure Authentication Providers.

    12. Create eyeglass user with required permissions and also create eyeglass admin group.

      1. Update Isilon sudoers file with the required setting for Eyeglass.

    13. Configure other non-replicated configuration settings. (E.g. Enable NFSv4).


  1. On the Target Cluster:

  1. Disable and delete existing mirror policies. Example

    1. isi sync policies disable synciq-01_mirror

    2. isi sync policies delete synciq-01_mirror

  2. Break the existing local target policies. Example:

    1. isi sync target break synciq-01

  3. Create a new temporary SyncIQ policies (These temporary policies are for  transition state, to replicate data from Target Cluster to the new Source Cluster). These temporary SyncIQ policies are the same as the original SyncIQ policies but in reverse direction (From Target Cluster to new Source Cluster). For SyncIQ policy name we can use different name. E.g. temp-synciq-01. Set the schedule as manual.

  4. Run the initial SyncIQ replication to update the new Source Cluster with data from the Target Cluster.

  5. Schedule a downtime (preventing user to access the data from the Target Cluster).

  6. Delete the zone alias name that contains the Source cluster name. Example delete these zone alias names: cluster20-z01.ad1.test.

  7. Run the latest SyncIQ replication.

  8. Verify that the SyncIQ replication has completed successfully without any error.


  1. On the Eyeglass:

    1. Add a new Source Cluster into the Eyeglass Inventory (Add new managed device).

    2. Wait for the next Eyeglass discovery cycle to see this Source Cluster displayed in the Managed Device list and SyncIQ Policies displayed on the Eyeglass Job Definitions.

    3. Enable the Eyeglass configuration replication, Access zone configuration replication and Run Eyeglass Jobs to replicate those configurations from Target Cluster to Source Cluster.

    4. Verify that Eyeglass configuration replication has no error. (Also verify on the new Source Cluster that these configuration settings have been replicated to the new Source Cluster. E.g. the NFS Export settings, etc.

    5. Disable all those Configuration Replication Jobs.


  1. On the Target Source Cluster:

    1. Disable and delete those temporary SyncIQ replication policies.


  1. On the new Source Cluster:

    1. Create SyncIQ Policies with same settings as the original SyncIQ Policies (To replicate from Source to Target cluster). Set the Schedule to Manual for initial replication.

    2. As at this state the Target Cluster mostly has the same data as the new Source Cluster, we can configure the initial replication from this new Source Cluster to Target Cluster as a differential sync to reduce network traffic during initial baseline replication.

To enable SyncIQ differential sync mode, we can use this command:

# isi sync policies modify <policy> --target-compare-initial-sync on

  1. Verify these SyncIQ Initial Replication Jobs can be run with no error.

  2. If we enable Differential Sync Mode for these SyncIQ initial replication, now we  should disable Differential Sync Mode once these initial replication jobs have completed successfully. Command:

# isi sync policies modify <policy> --target-compare-initial-sync off

  1. Set the SyncIQ Policies sync schedule as per the original settings.



  1. On the Eyeglass

    1. From the Jobs Definitions window, enable Configuration Replication Jobs, .

    2. Run Configuration Replication Jobs and Verify No Error.

    3. Enable and Run the Failover Readiness Job. Verify No Error.

    4. Verify DR Dashboard, check for Zone Readiness Status - OK.

    5. Test to Failover and Failback and verify no error. We can use the Eyeglass Runbook Robot Access Zone failover and failback test. Verify No Error.


  1. On the new Source Cluster:

    1. Rename the data access zone from original-<original name> to original name. Example from original-cluster20-z01.ad1.test to cluster20-z01.ad1.test.


  1. Verify that the client is able to access the data from the new Source Cluster.

    1. Refresh the mounting on the clients to point back to the Source Cluster and verify that they can access the data (R/W).



Use Case #2

Scenario

The Isilon source cluster is unreachable (e.g. network problem, connectivity problem) during uncontrolled failover. In this use case, source cluster data was not destroyed and data was intact on this cluster. After this uncontrolled failover, the target cluster (DR)  had filesystem updates.


This use case is for Eyeglass uncontrolled DFS Failover for SMB share.


Procedures

Pre-Uncontrolled Failover

  1. SyncIQ mirror policy has not created yet. So then during the failback process we need to do resync prep first.

  2. Source Cluster is inaccessible,


Uncontrolled Failover

  1. On the Eyeglass:

    1. Perform uncontrolled DFS failover.

    2. Verify that uncontrolled DFS failover has completed successfully.

  2. From Client:

    1. Verify that client is able to access data (R/W) from the target cluster

      1. Test with dfsutil cache referral.

      2. Test with dfsutil diag viewdfspath \\dfs-namespace-path.

      3. Test to write data to the share.

  3. On the Eyeglass:

    1. Delete the Source Cluster from Eyeglass Inventory’s Managed Device list (Right Click the Source Cluster from the list of Managed Devices in the Eyeglass’ Inventory View Windows, and click Delete).


Failback Process

There are 2 options based on the ways we prevent clients access to the outdated data on Source Cluster:


  1. Option #1: by setting the DFS target folders that are referring to Source Cluster offline temporarily.


  1. Option #2: without changing the DFS target folders status.



Option #1

For this option #1, to prevent clients from accessing DFS shares from the Source Cluster (within a period of time from the moment we have just brought the Source Cluster back online with outdated data and before the time we delete the SMB shares on Source Cluster), we can temporarily disable the DFS target folders that are referring to the Source Cluster.


Procedures

  1. Before we bring the Source Cluster back online, we temporarily disable DFS Folder Targets that are referring to the Source Cluster.

We can use Windows PowerShell cmdlets to manage the DFS folder status . We can run these PowerShell commands from admin machine (Windows Server 2012 / Windows 8 or newer) with DFS Management Tools installed.


Example:

To check the status of the target folders:

Get-DfsnFolderTarget -Path \\ad1.test\t1492\z02-smb01


Path                    TargetPath              State                   ReferralPriorityClass   ReferralPriorityRank

----                    ----------              -----                   ---------------------   --------------------

\\ad1.test\t1492\z02... \\cluster07-z02.ad1.... Online                  global-high             0

\\ad1.test\t1492\z02... \\cluster08-z02.ad1.... Online                  global-low              0


To set the status (Offline : to disable the target folder; Online: to enable the target folder) we can use this powershell command Set-DfsnFolderTarget. We need to run this command as a user that has the permission (e.g, delegated as the DFS management user for that namespace) from a machine with DFS management tools installed.


Set-DfsnFolderTarget -Path \\ad1.test\t1492\z02-smb01 -TargetPath \\cluster07-z02.ad1.test\z02-smb01 -State Offline


Path                    TargetPath              State                   ReferralPriorityClass   ReferralPriorityRank

----                    ----------              -----                   ---------------------   --------------------

\\ad1.test\t1492\z02... \\cluster07-z02.ad1.... Offline                 global-high             0

\\ad1.test\t1492\z02... \\cluster08-z02.ad1.... Online                  global-low              0




To handle large number of DFS shares, we can create a script to set the Offline status for all DFS target folders.


  1. Bring the Source Cluster back online.


  1. On the Source Cluster:

    1. Delete ALL SMB shares on the source cluster as currently that SMB Share names are used to point client to the Target Cluster.

Example to delete the share name ‘z02-smb01’ from the access zone ‘zone02’:

# isi smb shares delete z02-smb01 --zone zone02


  1. Runs Sync Prep on Source cluster to become the target of SyncIQ policy replication

Example:

# isi sync recovery resync-prep synciq-01


  1. Verify the resync-prep process has completed with no error.


  1. On the Target Cluster:

    1. Verify that the mirror policy has been created successfully.

Example:

# isi sync policies list

Name             Path                       Action  Enabled  Target

--------------------------------------------------------------------------

synciq-01_mirror /ifs/data/zone02/z02-smb01 sync    Yes      172.16.80.151

--------------------------------------------------------------------------


  1. On the Source Cluster:

    1. Delete ALL quotas on the source cluster to ensure failback and SyncIQ operations from Secondary (Target) cluster to Primary (Source) cluster.

Example:

# isi quota quotas delete --path /ifs/data/zone02/z02-smb01 --type directory


  1. On the Target Cluster:

    1. Replicate data to the Source Cluster manually.

Example:

# isi sync jobs start synciq-01_mirror


  1. Verify that this mirror policy replication job has completed successfully.


  1. On the Eyeglass:

    1. Add the Source Cluster back to the Eyeglass Inventory’s Managed Device List.

    2. Wait for the next Eyeglass discovery cycle to see this Source Cluster displayed in the Managed Device list and SyncIQ Policies displayed on the Eyeglass Job Definitions.

    3. Enable Eyeglass DFS mode for Mirror Policies and enable these Configuration Replication Jobs for Mirror Policies.

  1. Enable the DFS Target Folders that refer to Source Cluster. We can use the powershell command Set-DfsnFolderTarget with -State Online option.

Example:


Set-DfsnFolderTarget -Path \\ad1.test\t1492\z02-smb01 -TargetPath \\cluster07-z02.ad1.test\z02-smb01 -State Online


  1. On the Target Cluster, prepare for Failback (E.g. disable SMB service, to prevent clients to update data during failback process).

  1. On the Eyeglass - Perform Failback:

    1. Run the Failover Readiness job. Verify - OK.

    2. Verify Eyeglass DR Dashboard => DFS Readiness : No error.

    3. Perform Eyeglass DFS controlled Failover from Target Cluster to Source Cluster (Failback).

    4. Verify that this Eyeglass DFS controlled failover (failback) has completed with no error.


  1. Verify from DFS client:

    1. dfsutil cache referral.

    2. dfsutil diag viewdfspath \\dfs-namespace-path.

    3. Able to read/write data from/to the shares.


  1. On the Eyeglass:

    1. Enable the Configuration Replication Job for SyncIQ Policies.


Option #2

For this Option#2, we do not need to change the status of the DFS Target Folders. But we need to prevent clients (especially client that is trying to establish a new connection to DFS share) from accessing the outdated data and also update data to the shares on the Source Cluster by one of the following methods:

  1. Scheduled Maintenance time to inform client for not accessing shares during this time. Especially to establish a new connection to DFS shares, as it will be pointed to the Source Cluster (During the time of after Source Cluster back online and before we disable the SMB service temporarily on this Source Cluster).

  2. Temporarily disconnect the Source Cluster’s interfaces that has the smartconnect zone name registered as DFS target folder names from public network to make them inaccessible. For the case of DFS target folder names used the smartconnect zone names on the Isilon management interface, we can temporarily connect to private network that is not reachable by DFS clients i..e port vlan not connected to external network.

  3. Temporarily change the Source Cluster SmartConnect Service IP address in DNS server to an unused IP address.


Procedures

  1. Before bring the Source Cluster, we need ensure that clients will not access the shares from primary cluster until it is ready to be accessed. (Refer to the above options A or B or C).


  1. Bring the Source Cluster back online.


  1. Temporarily disable SMB service.

  1. Once SMB service has been disabled, now we can resume the operation from maintenance time (for the case of option A) or reconnect the Source Cluster’s interfaces used for the DFS folder targets back the public network (for the case of option B) or change back the DNS entry for Source Cluster SmartConnect Service IP address (for the case of option C).


  1. On the Source Cluster:

    1. Delete ALL SMB shares on the source cluster as currently that SMB Share names are used to point client to the Target Cluster.

Example to delete the share name ‘z02-smb01’ from access zone ‘zone02’:

# isi smb shares delete z02-smb01 --zone zone02


  1. After SMB Shares have deleted, we can re-enable the SMB service on the Source Cluster.


  1. Runs Sync Prep on Source cluster to become the target of SyncIQ policy replication

Example:

# isi sync recovery resync-prep synciq-01


  1. Verify the resync-prep process has completed with no error.

  1. On the Target Cluster:

    1. Verify that the mirror policy has been created successfully.

Example:

# isi sync policies list

Name             Path                       Action  Enabled  Target

--------------------------------------------------------------------------

synciq-01_mirror /ifs/data/zone02/z02-smb01 sync    Yes      172.16.80.151

--------------------------------------------------------------------------


  1. On the Source Cluster:

    1. Delete ALL quotas on the source cluster to ensure failback and SyncIQ operations from Secondary (Target) cluster to Primary (Source) cluster.

Example:

# isi quota quotas delete --path /ifs/data/zone02/z02-smb01 --type directory


  1. On the Target Cluster:

    1. Replicate data to the Source Cluster manually.

Example:

# isi sync jobs start synciq-01_mirror


  1. Verify that this mirror policy replication job has completed successfully.


  1. On the Eyeglass:

    1. Add the Source Cluster back to the Eyeglass Inventory’s Managed Device List.

    2. Wait for the next Eyeglass discovery cycle to see this Source Cluster displayed in the Managed Device list and SyncIQ Policies displayed on the Eyeglass Job Definitions.

    3. Enable Eyeglass DFS mode for Mirror Policies and enable these Configuration Replication Jobs for Mirror Policies.


  1. On the Target Cluster, prepare for Failback (E.g. disable SMB service, to prevent clients to update data during failback process).


  1. On the Eyeglass - Perform Failback:

    1. Run the Failover Readiness job. Verify - OK.

    2. Verify Eyeglass DR Dashboard => DFS Readiness : No error.

    3. Perform Eyeglass DFS controlled Failover from Target Cluster to Source Cluster (Failback).

    4. Verify that this Eyeglass DFS controlled failover (failback) has completed with no error.


  1. Verify from DFS client:

    1. dfsutil cache referral.

    2. dfsutil diag viewdfspath \\dfs-namespace-path.

    3. Able to read/write data from/to the shares.


  1. On the Eyeglass:

    1. Enable the Configuration Replication Job for SyncIQ Policies.