Eyeglass Simulated Disaster Event Test Procedure

Eyeglass Simulated Disaster Event Test Procedure




Introduction

This document outlines Eyeglass simulated disaster test scenario for clients who want to perform during a scheduled maintenance window:

  1. Controlled failover with Production SyncIQ policies, and  uncontrolled failover with a  DFS mode Test SyncIQ policy, without impacting or exposing production data to data loss or resync risks.

  2. Controlled failover of Production access zone, and uncontrolled failover with EyeglassRunbookRobot access zone, without impacting or exposing production data to data loss or resync risks.

  3. Optional: this can be used to only perform simulated failover without any production data failed over within the same maintenance window.

  4. Open a case with Superna support to review this document prior to attempting the procedures within this document

Support Statement on Use of this Procedure

  1. This procedure is the only supported process.  Any variant of the process that uses uncontrolled failover on production data is unsupported. If a support request is raised for a test that intentionally used uncontrolled failover on production data, the customer will have to take responsibility for recovery steps using documentation without support assistance. Including disrupting Eyeglass connections or simulating failures that prevent the software from performing as expected.

  2. Before attempting this procedure a review if the EULA and support agreement should be consulted. Located here.

  3. This guide requires the planning guide to be followed to maintain support as per the support contract for planned failovers. This guide will be requested for validation from Superna support when any case is opened regarding failover. http://documentation.superna.net/eyeglass-isilon-edition/plan/failover-planning-guide-and-checklist


Important Note

Note: Production SyncIQ policies should target and protect directories in access zones other than the EyeglassRunbookRobot access zone or the test DFS SyncIQ policy.

Initial Environment Setup

If you have already configured the Eyeglass RunbookRobot feature in your environment for Access Zone or DFS Continuous DR Testing, you may skip this initial environment setup section and proceed to section Verify Environment Setup.

ONLY 1 RUNBOOK ROBOT SUPPORTED PER EYEGLASS

  1. Both (DFS, Access Zone): On both Cluster1 and Cluster2, create an Access Zone with name format "EyeglassRunbookRobot-XXXX", where XXXX is a string or number of your choice.

  2. Both (DFS, Access Zone): On both Cluster1 and Cluster2, as best practice, create an IP pool dedicated for SyncIQ data replication.

    1. Note that the Replication IP pool should be in the System Access Zone.  While configuring your production or test SyncIQ policies, for the “Restrict Source Nodes” option, make sure to select the second option that says “Run the policy only on nodes in specified subnet and pool”, then select the IP pool that you have dedicated to SyncIQ data replication.

    2. For the Replication IP pool, make sure to configure a SmartConnect Zone Alias with syntax igls-ignore-xxxx, where xxxx is a string that makes the alias (as a whole string) unique across your infrastructure.

  3. Both (DFS, Access Zone): On both Cluster1 and Cluster2, create an IP pool for clients to access data in the EyeglassRunbookRobot-XXXX Access Zone.

    1. The IP pool for DFS client access will be configured with SmartConnect zone alias of the format igls-ignore-xxxx

    2. The Access Zone failover logic pool will be configured with SmartConnect zone alias of the format igls-aaaa-bbbb, where aaaa is the same string on cluster1 and cluster2, and bbbb make the alias (as a whole string) unique across your infrastructure. Eyeglass uses the first two sections (igls-aaaa) to map the pool from cluster1 to cluster2.


See below a sample pool mapping for the DFS and Access Zone as seen from Eyeglass Zone Readiness:

  1. Both (DFS, Access Zone) Create Test SyncIQ policy on cluster1 with name format "EyeglassRunbookRobot-yyyy", where yyyy is a number or string of your choice.

Note that you can only have one SyncIQ policy within each EyeglassRunbookRobot-XXXX access zone.

    1. Make the test SyncIQ policy target host the SmartConnect zone fqdn of the dedicated SyncIQ replication IP Pool on cluster2.

    2. Restrict Source Nodes option should be selected when configuring the test SyncIQ policy.

    3. The test SyncIQ policy source directory path should be below the EyeglassRunbookRobot Access Zone base directory.

    4. Run the policy on OneFS UI once it has been created.

  1. Both (DFS, Access Zone)  On cluster1 EyeglassRunbookRobot Access Zone, create test SMB shares (and quotas if required) at folder paths below the Test SyncIQ policy source folder.

    1. DFS Mode: In DFS snapin for AD configure the DFS folder targets using fqdn(s) for cluster1 and cluster2 IP pool for client access.

Example format for DFS folder targets are: <\\fqdn-of-DFS-testdata-IPpool-on-cluster1\SMBsharename>, and <\\fqdn-of-DFS-testdata-IPpool-on-cluster2\SMBsharename>. Windows client will mount this DFS share using the path <\\AD domain name\dfsrootname\dfsfoldername>.

  1. DFS Mode: On Eyeglass, change the configuration replication job associated with Test DFS policy to DFS mode. (See product documentation for details)

  2. Both (DFS, Access Zone) In order to have a Test SyncIQ mirror-policy in place (and also as best-practices), using eyeglass, perform an initial controlled failover (cluster1 to cluster2) of the EyeglassRunbookRobot Access Zone or Test DFS SyncIQ policy, followed by a controlled failback (cluster2 to cluster1) of the EyeglassRunbookRobot Access Zone or Test DFS SyncIQ policy.

See sample screenshot below showing that mirror policies are in place for all the Jobs.

  1. Access Zone Mode: Must have DNS Dual Delegation in place and AD permissions for cluster to read and write SPNs.

Verify Environment Setup

  1. Verify production DFS policies are setup correctly with dual folder targets.

  1. Verify test DFS policy: Write data to the share added to the DFS namespace which is protected by the EyeglassRunbookRobot DFS SyncIQ Policy on a Windows Client before attempting the uncontrolled failover. Verify using DFS tab on properties of the DFS folder name in Explorer to confirm that cluster 1 contains the active folder target.

  1. Verify test non-DFS policy:  Repeat write test of the non DFS policy in the access zone.  This will verify DNS resolution to the correct cluster.

  2. Access Zone Mode: Verify AD permissions by following the document here.

Simulated Disaster Scenario  - DFS Test SyncIQ Policy Failover

Overview

Use this procedure to simulate a DFS policy failover using uncontrolled mode to simulate a real DR event. This assumes a DFS mode policy has been enabled inside the Runbook robot access zone.

Note: This test can be done with or without production data failover in the same maintenance window.

Note: Before implementing the following simulated disaster scenario DFS Test SyncIQ policy failover steps, please make sure you have followed instructions/steps in the “Important Note” , “Initial Environment Setup”, “Verify Environment Setup” and “Support Statement” sections.

Pre Simulated Disaster - Cluster1 (prod cluster) is available - Controlled Failover

  1. Review all steps in the planning guide before beginning.  This is required to maintain support for this procedure. See support statement above on planning guide requirement.

  2. Perform Microsoft DFS controlled Failover for Production SyncIQ policies (the uncontrolled test policy will NOT be failed over at this step) from cluster1 to cluster2 using Eyeglass.

  3. On Eyeglass, enable the Production SyncIQ mirror-policy job in the jobs window if it is in USERDISABLED state post failover.

  4. Write data to production shares protected by Production SyncIQ Policy from DFS mount (confirm that cluster2 share path is the active target) on Windows Client after controlled failover.

  5. Production data controlled failover completed.  See below sample DFS Readiness at this stage.

Simulated Disaster - Cluster1 (prod cluster) becomes unavailable - Uncontrolled Failover

Overview

This procedure simulates a source cluster that has been destroyed or is unreachable on the network for a long period of time and requires a failover to the secondary site.

Note: This step will only operate against test policy and access zone created in initial setup section only, to maintain access to support for this procedure.


  1. Simulate Cluster 1 failure:

  1. On cluster1 OneFS UI, remove the node interfaces from dedicated IP pool used for client access to EyeglassRunbookRobot-DFSzone access zone (NOTE: perform only on this 1 IP pool). Consult EMC documentation. The DFS folder target path from cluster1 will now be failed when the node interfaces are removed from the pool.

  2. Step 1a. above  simulates DNS response failure to cluster1 EyeglassRunbookRobot-DFSzone access zone, without actually impacting SSIP or normal DNS operations. At this point in the process, name resolution is down, and NetBIOS sessions are disconnected from the cluster1 EyeglassRunbookRobot-DFSzone access zone.

Notice from the above screenshot, that name resolution to the SmartConnect zone name is not resolving as expected (SERVFAIL is returned). At this point we have simulated a disaster as cluster1’s EyeglassRunbookRobot-DFSzone access zone SmartConnect zone name resolution is failing, and no shares can be access on cluster1 on this access zone.

    1. Set the schedule for the EyeglassRunbookRobot DFS policy on the source cluster to manual. As the policy won’t be able to run anyway if the source cluster has been destroyed. Do not proceed until this step is done.

  1. Perform DFS uncontrolled failover for EyeglassRunbookRobot Test DFS mode SyncIQ policy from cluster1 to cluster2 using Superna Eyeglass.  (see documentation for more details)

  2. Wait until the uncontrolled failover completes.

  3. Write data to share protected by EyeglassRunbookRobot Test DFS SyncIQ Policy from the DFS mount (confirm that cluster2 share path is the active target) on Windows Client after uncontrolled failover.

  1. Uncontrolled DFS Failover is complete.


Post Simulated Disaster - Cluster1 (prod cluster) Recovery Steps for DFS

Overview

These steps are executed to restore the uncontrolled policies to a working state. The production data is currently failed over to cluster 2 using controlled failover. Some customers may choose to stay on cluster 2 as production for some period of time before planning a failback. The test policies can be recovered by following the steps below:

  1. Simulate Cluster 1 returning to Service :

  1. On cluster1 OneFS UI, rename shares within Test SyncIQ policy path to have igls-dfs-<sharename> format (this step should happen after "uncontrolled failover" step)

  2. On cluster1 OneFS UI, reconnect previously removed node interfaces back to IP pool used for DFS client access to test data on EyeglassRunbookRobot-DFSzone access zone.

  1. On cluster1 OneFS UI, run resync-prep on EyeglassRunbookRobot-DFS Test SyncIQ policy (consult EMC Documentation).

  2. Verify that resync-prep process was completed without error before proceeding to next steps.

    1. Check on OneFS SyncIQ reports tab to make all steps pass successfully.

  3. Check the job state in Eyeglass

    1. From Eyeglass, verify both policies on cluster 1 and cluster 2 and re-enable the Eyeglass job for the EyeglassRunbookRobot-DFS Test Policy on cluster 1 and the mirror policy on cluster 2 in the jobs icon.

    2. Allow Eyeglass Configuration Data Replication to run at least once.  

    3. From Eyeglass Jobs-->“Running Jobs” window,  verify that  Eyeglass Configuration Data Replication in step 4b above has completed without errors.

Note: As stated in step 4b above, Eyeglass Configuration Data Replication task must complete before continuing with steps below.

    1. Verify Eyeglass jobs show policy state correctly with Cluster 1 policy showing policy disabled and Cluster 2 showing enabled and OK (green).

    2. Wait for Config sync to correctly show the above state.

    3. Do not continue until the above validations are done.

  1. Perform Microsoft DFS-type controlled failback from cluster 2 to cluster 1 for EyeglassRunbookRobot DFS Test SyncIQ mirror-policy using Superna Eyeglass DR Assistant.


  1. Wait for Failover to complete,

  2. Write data to the share protected by EyeglassRunbookRobot DFS Test SyncIQ Policy from a DFS mount (confirm that cluster1 share path is the active target) on Windows Client after controlled failback.


  1. Perform Microsoft-DFS-type controlled failback of all Production SyncIQ mirror-policies from cluster 2 to cluster 1 using Superna Eyeglass.

  2. Write data to share protected by Production SyncIQ Policy from DFS mount (confirm that cluster1 share path is the active target) on Windows Client after controlled failback.






Simulated Disaster Scenario  - EyeglassRunbookRobot Access Zone Failover

Overview

Use this procedure to simulate an Access zone failover using uncontrolled mode, to simulate a DR event.  This assumes dual delegation has been implemented and the Runbook robot access zone is fully functional.


Note: This test can be done with or without production data failover in the same maintenance window.


Note: Before implementing the following simulated disaster scenario Access Zone failover steps, please make sure you have followed instructions/steps in the “Important Note” , “Initial Environment Setup”, “Verify Environment Setup” and “Support Statement” sections.

Pre Simulated Disaster - Cluster1 (prod) is available - Controlled Failover

  1. Review all steps in the planning guide before beginning.  This is required to maintain support for this procedure. See support statement above on planning guide requirement.

  2. Using Eyeglass, perform controlled Failover of your Production access zone(s) from cluster1 to cluster2.  

  3. On Eyeglass, enable each Production SyncIQ mirror-policy jobs for your Production access zone if they are in USERDISABLED state.  Consult planning guide to maintain support for this procedure.

  4. Write data to share(s) protected by Production SyncIQ Policies from DFS mount (confirm that cluster2 share path is the active target) on Windows Client after controlled failover of the Production access zone.

  5. Do not proceed until above failover is validated as successful.

  6. Do not fail over the EyeglassRunbookRobot-SMBzone Access zone on cluster1.

  7. Procedure complete.  Do not continue to next steps until successful Controlled Failover of your Production access zone(s) is completed successfully.

Simulated Disaster - Cluster 1 (prod) becomes unavailable: EyeglassRunbookRobot-SMBzone Access Zone

  1. Simulate Cluster 1 failure: See below pool cluster 1 to cluster 2 IP pool mapping just pre-disaster:

  1. On cluster1 OneFS UI, disconnect node interfaces from dedicated ip pool used for client access to test data on EyeglassRunbookRobot-SMBzone access zone (IP Pool assigned the robot access zone). If SMB folder was properly set up, SMB folder target path from cluster1 will fail when the node interfaces are removed from the pool. This is required to disconnect SMB session from clients to this pool and cause SMB mount failure.

  2. Step 1a. above  simulates DNS response failure to cluster 1 as well without any IP’s in the pool, without actually impacting SSIP or normal DNS operations. At this point in the process, name resolution is down, and NetBIOS sessions are disconnected from cluster 1 EyeglassRunbookRobot-SMBzone access zone.

Notice from the above screenshot, that name resolution to SmartConnect name is down as expected (SERVFAIL is returned). At this point we have simulated a disaster as cluster1 EyeglassRunbookRobot-SMBzone access zone SmartConnect zone name resolution is failing, and no shares can be access on cluster1 EyeglassRunbookRobot-SMBzone access zone.

  1. Also, set the schedule for the EyeglassRunbookRobot-SMB Test policy on cluster 1 to manual. As a policy won’t be able to run anyway if the source cluster has been destroyed. Do not proceed until this step is done.

NOTE: in a real DR event, it is assumed the source cluster is unreachable on the network.  

NOTE: Make not of the schedule, it will need to be reapplied at the end of this procedure.

  1. Perform Failover: Using Eyeglass, perform uncontrolled failover for EyeglassRunbookRobot-SMBzone access zone from cluster1 to cluster2.

  1. Wait until uncontrolled failover completes.

  1. Check SPN’s are failed over in AD correctly using ADSI Edit.

  2. Validation: Test using nslookup to make sure DNS now resolves to Cluster 2.

  3. Correct or debug resolution of SmartConnect name before continuing.

  1. Test Client Access: This step requires unmount and remount of the share to get new IP address.

    1. Reboot the client machine that was used to validate the share pre-disaster to guarantee that the Netbios session to cluster1 has not been preserved.

    2. Mount the share.

    3. Write data to share protected by EyeglassRunbookRobot-SMB Test SyncIQ Policy from SMB mount on Windows Client after the uncontrolled failover.

  2. Uncontrolled Access Failover complete.

Post Simulated Disaster - Cluster1 (prod) becomes available: EyeglassRunbookRobot-SMBzone Access Zone

  1. Simulate Cluster 1 availability: See below cluster 2 to cluster 1 EyeglassRunbookRobot-SMBzone access zone IP pool mapping just after cluster1 is available. Note that previously removed node interface have not been re-connected at this point:

  1. One cluster1 OneFS UI, edit source cluster EyeglassRunbookRobot-SMBzone access zone IP pool SmartConnect name and apply igls-original prefix to existing SmartConnect name.  This is required step before re-connecting the previously removed cluster1 node interface.

  2. On cluster1 OneFS UI, reconnect previously removed node interfaces back to IP pool used for client access to test data on EyeglassRunbookRobot-SMBzone access zone.

  1. On cluster1 OneFS UI, run resync-prep on EyeglassRunbookRobot-SMB Test SyncIQ policy.  Consult EMC Documentation.

  2. Verify that resync-prep process was completed without error before proceeding to next steps.

    1. Resolve any errors before continuing.  Resync prep must have run successfully before you attempt to complete remaining steps. Check the cluster reports show no errors before continuing.

  3. Check Eyeglass job state

    1. From Eyeglass, verify EyeglassRunbookRobot-SMB Test Policy, and the mirror policy are in the correct state.  Mirror policy should be Enabled and the Cluster 1 policy should be disabled state.   

    2. Allow Eyeglass Configuration Data Replication to run at least once.

    3. Note: Configuration Data Replication task must complete before continuing with steps below. Verify a config sync task has been run from running jobs window without errors.

    4. Verify Eyeglass jobs show policy state correctly with Cluster 1 policy showing policy disabled and Cluster 2 mirror policy showing enabled and OK (green).

    5. Wait for Config sync to correctly show the above state.

    6. Do not continue until the above validations are done.

  4. From Eyeglass, select and run the Zone Failover Readiness jobs. This allows a new Access Zone Failover Readiness Audit to be computed.

  1. On Eyeglass DR Dashboard, confirm that zone readiness looks good for the EyeglassRunbookRobot-SMBzone access zone.

  2. Perform Controlled Failback: Using Eyeglass, perform controlled failback of EyeglassRunbookRobot-SMBzone access zone from cluster 2 to cluster 1.

  1. Wait until controlled failover completes.

  1. Check SPNs are failed over in AD correctly using ADSI Edit.

  2. Validation: Test using nslookup to make sure DNS now resolves to Cluster 1.

  3. Correct or debug resolution of SmartConnect name before continuing.

  1. Test Client Access: This step requires unmount and remount of the share to get new IP address.

    1. Reboot the client machine that was used to validate the share pre-disaster to guarantee that the Netbios session to cluster2 has not been preserved.

    2. Mount the share.

    3. Write data to share protected by EyeglassRunbookRobot-SMB Test SyncIQ Policy from SMB mount on Windows Client after uncontrolled failover.

  2. Controlled procedure complete.

  3. If performing failback of Production data follow planning guide process to maintain support.