Eyeglass Failover Design Guide

Eyeglass Failover Design Guide



Product Name - Superna Eyeglass

Revision Changes to this Document - Release 2.0




Contents

  1. 1 Product Name - Superna Eyeglass
    1. 1.1 Revision Changes to this Document - Release 2.0
  2. 2 Introduction to this Guide
    1. 2.1 Overview
    2. 2.2 What’s New with Eyeglass Failover  
    3. 2.3 Help
  3. 3 Failover Planning  
    1. 3.1 Are you Planning a failover?  
    2. 3.2 How to determine best approach for Quota for failover?
      1. 3.2.1 Quota failover options:
    3. 3.3 Access Zone Failover
    4. 3.4 IP Pool Failover
    5. 3.5 SyncIQ DFS Mode with Eyeglass
    6. 3.6 SyncIQ Mode with Eyeglass
    7. 3.7 Failover Readiness
  4. 4 Storage Failover with Eyeglass Failover Modes
  5. 5 Supported DR Site and Failover Topologies
    1. 5.1 Data Center to Data Center
    2. 5.2 Multi Site Failover
    3. 5.3 Data Center DR Fan-IN Topology
    4. 5.4 2 Site DR - Stretch 3rd site Configuration Sync
  6. 6 How to use the DR Dashboard to Assess Failover Readiness
    1. 6.1 Policy Readiness / DFS Readiness
    2. 6.2 Zone Readiness
  7. 7 How to enable Automated DR Testing the Eyeglass Runbook Robot Feature
  8. 8 Planning and Procedures for Eyeglass SyncIQ DFS Mode Failover
    1. 8.1 DFS Mode Preparation Checklist
    2. 8.2 DFS Mode Compatibility
    3. 8.3 Considerations for Eyeglass SyncIQ DFS mode vs Default Configuration Sync job mode in Eyeglass
    4. 8.4 Procedure to Enable Eyeglass SyncIQ DFS mode
    5. 8.5 Detailed DFS Mode Configuration, Operating procedures and Design guidelines
  9. 9 Planning and Procedures for Eyeglass SyncIQ Mode Failover
  10. 10 Planning and Procedures for Eyeglass Access Zone Failover
  11. 11 How to Execute A Failover with DR Assistant
    1. 11.1 How to know when Uncontrolled failover should be used?
    2. 11.2 Eyeglass Pre-Failover Check Important - Read me
    3. 11.3 How to failover Data With DR Assistant
  12. 12 Post Failover Procedures
    1. 12.1 Post Access Zone Failover Steps
    2. 12.2 Post Access Zone Failover Health Check Steps
    3. 12.3 POST DFS Mode Failover Steps
  13. 13 How to Validate and troubleshoot A Successful Failover WHEN Data is NOT Accessible on the Target Cluster
    1. 13.1 Debug Plan of attack for clients post failover:
    2. 13.2 Steps to Validate SmartConnect and DNS failed for over Successfully:
    3. 13.3 Steps to test Mounting a Share with Access Zone failover on Machine with no previous Mount:
    4. 13.4 Steps to correctly test a machine with existing Mount to Source cluster post Access Zone Failover and remount to test write access:
    5. 13.5 Steps to test Mounting a DFS protected Share with DFS failover mode:
  14. 14 How to Monitor the Eyeglass Assisted Failover
    1. 14.1 In-Progress Failover
    2. 14.2 Completed Failover
  15. 15 Troubleshooting Failover
    1. 15.1 Failover Recovery Procedures
    2. 15.2 Collecting Logs for Failover Troubleshooting
    3. 15.3 Authentication with Service Principal Name Considerations with Active Directory and SMB Shares in Access Zones
  16. 16 Appendix A  - Advanced Failover Modes
  17. 17 Cached config advanced mode
    1. 17.1 How to enable cached config advanced mode
  18. 18 Failover Advanced mode Configuration - Parallel thread and failover jobs
    1. 18.1 Parallel threads
      1. 18.1.1 How to enable Fast Failover parallel threads advanced mode
    2. 18.2 Parallel Failover Jobs
      1. 18.2.1 How to enable Fast Failover parallel Access zones/IP pools advanced mode



Introduction to this Guide

Overview

Eyeglass offers single button assisted failover by; Access Zone (requires 1.4 update 1 and later) , Microsoft DFS enabled SyncIQ policies, or SyncIQ policy(s).  This document provides:

  • An overview of each failover mode

  • High level steps for each failover mode

  • How to assess readiness for failover

  • Planning and operational steps for each failover mode

For guidance on which failover mode is appropriate for your environment, please consult the document Eyeglass Start here First.  The Eyeglass Start Here First document provides the information you will need for each failover option to assist you in making the decision of which option is appropriate for your own environment:

  • When to use it?

  • Why use it?

  • What you need to know?

  • Estimated knowledge to configure


What’s New with Eyeglass Failover  

Release

Description

Failover Mode

1.6

New error handling for OneFS PAPI errors that occur during failover.  Should PAPI return an error such as 503 Service Unavailable on any of the steps for  allow writes, run policy/mirror policy, resync prep, Superna Eyeglass will now retry this action 3 times as an error such as 503 Service Unavailable may be transient.

All Failover Modes

1.6

New timeout count down added to each step that is being processed so timeout is visible during a failover.   URL to long running steps to recovery guide included in log along with login https url to cluster management to allow simple "One" click from a failover to Isilon UI console access to check on cluster operations.

All Failover Modes

1.6

Key steps are now grouped:

  1. Make writeable all policies are processed together (in series) making the filesystem writeable faster for all policies involved in the failover

  2. Resync prep step now run in batch for all policies after the make writeable step for all policies.

All Failover Modes

1.6.1

New release notes on failover acknowledge in DR assistant is required reading before allowed to continue with a failover.

All Failover Modes

1.7

As of release 1.7 and beyond all Failover modes will restrict number of parallel Job requests to the Isilon cluster for the Run SyncIQ Policy data sync step based on cluster version:

  OneFS 7.2 - 5 parallel job requests (OneFS 7.x cluster have a limit of 5 concurrent policies).  Eyeglass will monitor the progress for each Job and submit a new request as previously submitted requests are completed.

  OneFS 8    - parallel job requests limit based on Eyeglass appliance configuration (default 10)

Based on extensive testing for safe failovers, make writeable and resync prep are serialized steps.

All Failover Modes

1.8

This release introduces parallel failover mode disabled by default.

High Speed Failover - Parallel Failover Flag :

    1. Allows make write step and resync prep to run in parallel with up to 10 threads, ensures that 10 policies are submitted to be processed at all times.

    2. NOTE: Risk of a policy failure increases and new flag will NOT stop the failover in progress and will continue to issue api calls to submit all SyncIQ policies in the failover job until all have been submitted. This runs the risk of more complex recovery if more than one policy fails to complete its step (Allow Writes OR resync Prep)

    3. Testing has shown these steps for large quantity policy failover can improve failover times 3x to 4x.

Access Zone Failover Enhancement:

    1. New validation detects time skew between cluster nodes and between eyeglass and the cluster's.

    2. Validation warning raised if detected

    3. Time skew can cause failed steps if the time on different nodes is not within an acceptable range to detect the steps or running status on a policy during failover.

SyncIQ Job Reports appended to Eyeglass failover log :

    1. Now policy run, and resync prep reports are appended to the end of the eyeglass failover log to allow simplified triage of failed steps and escalation to EMC support based on cluster policies failing.

    2. All information and time stamps are now in a single file.

All Failover Modes

1.9

Failover Enhancements

  1. Open files validation removed from dr assistant until Isilon API support per Access Zone open files

  2. New Access Zone readiness validation verifies all IP pools have a SmartConnect zone defined

  3. DR Assistant synciq reports from a failover are now separated  from Eyeglass logs in the failover history, making debugging simpler.

  4. Restrict at source validation updated to show info only in the DR dashboard  

    1. To simplify validation of Access Zones readiness for failover.  Restrict at source is a best practice and shows green if implemented or info if not implemented on each policy

  5. SPN Management Enhancements

    1. SPN failover enhancement for Access Zone failover now restricts the delete and add SPN API calls to a single cluster node in the target cluster.    

      1. This change will insure a single domain controller is used for the failover operations.

    2. Short SPN's are now synced to AD computer objects (not used for Kerberos) during config sync if any are missing they are inserted.  NOTE: This is not related to failover of SPN’s only maintaining newly detected SmartConnect names and ensure they are synced to AD computer object.

  6. Failover log real-time view in DR assistant allows a live failover log to be monitored with auto refresh or stop and pause option.

  1. Quota Failover  Enhancement

    1. Linked quotas that are unlinked to the parent quota creates a quota that be can be managed with a different limit applied from the parent quota.  

    2. Eyeglass will now correctly failover unlinked quotas. Now the unlinked quotas failover as a normal quota and then the parent all users quota is failed over next to ensure no conflict occurs on the target cluster.

    3. Syncing Shares with variable expansion in the path name now sync  correctly between clusters

  2. Ransomware Defender Failover

    1. See Ransomware Defender failover in the Ransomware Defender Admin Guide.


2.0

New Failover Mode

  1. IP Pool failover allowing hot hot data within an Access Zone and more granular failover options.  See Access zone guide for configuration requirements.

Failover Logic Major Enhancements

  1. Parallel Failover Jobs:

    1. This feature will allow multiple failovers to execute in parallel.​  All Failover types are supported.  

    2. NOTE:  parallel threads is set to 10 which is shared across all failover jobs.

  1. LOGGING: Failover log will be split into Failed over data and client redirect.  This will indicate the failover of data and clients and post failover scripts.  The second half of the log will be for post failover steps including  failback steps and quota failover.

  2. Continue on failed Step: After analyzing many failovers the new logic will continue to execute steps as outlined below.  This will ensure SyncIQ policies are attempted even if one syncIQ policy encounters an error.

    1. Make Write Step on each SyncIQ policy - If any policy fails to run, all other policies are run and failover continues​.  The steps that are not yet run for the failed policy will be skipped.

    2. Run Resync Prep SyncIQ - If any policy fails to run, all other policies are run and failover continues​.

    3. NOTE: Any policy that fails a step will have its following steps skipped.  

  3. Cancel a running failover:  This option appears in the running failover ​tab of DR Assistant and allows a running failover to be canceled.  NOTE: No Rollback will occur and failover stops at what ever step was being executed. All steps to recover from this will be manual.  Use with caution.

    1. Cancel Failover option on running failovers UI.  NOTE: Only used if directed by support.​


New Failover Options in DR Assistant

  1. Data Integrity Failover

    1. Access zones or DFS and Per SyncIQ policy failover will now insert deny everyone permissions to shares that will be failed over as a pre-Failover step. This will disconnect openfiles, disconnect users from all shares involved in the failover. This will ensure data integrity of the failed over data set when SyncIQ is run by Superna Eyeglass® after users are disconnected.

    2. Post failover step to correct share permissions to original security settings.

    3. Option to disable this feature on per failover with DR Assistant.

    4. Supports SMB shares in this release

    5. See New DR Assistant option below.  Mouse over help text on options for failover

  1. Failover option added to skip Quota Failover:  This new DR Assistant check box allows skipping quota failover step for situations when a failback is planned within a short period of time.  This also can help avoid failed failovers due to quota scan failing SyncIQ steps.

Skip quota failover step option DR Assistant

    1. In some customer environments the quota scan job interferes with failover and failback performance.  The requirement to wait until quota scan completes adds hours to a failover or interrupts a failover with a failed SyncIQ step.

    2. This feature allows skipping failover of quotas and leave them on the source cluster.

    3. Eyeglass has a special quota sync command line tool that allows quotas to be synced AFTER a failover has been completed.

    4. Customers can now choose to skip quota failover in DR Assistant.  Another feature detects if quotas already exist that will fail SyncIQ steps.

3. DR Assistant Block Failover Failover on Warnings

​Overview:  This will validate failover jobs and prevent a failover from starting under certain conditions that will result in a failure.  This applies to newly created quotas that have not been scanned by quota scan job.

  1. Quota scans are triggered on Onefs 8 when quotas are created or quota scan jobs are scheduled to run to calculate quotas.

    1. This can interfere with the make writeable step and resync prep during failover.

    2. It is best practise to ensure no quotas are created before failover to avoid this conflict.

    3. Quota scan locks the file system blocking SyncIQ from completing steps

  2. DR Assistant will have new option (enabled by default) to detect if any quotas exist on the target cluster at the time of failover that match SyncIQ policies selected for a failover and will abort the failover:

    1. If any quotas have the ready for Quota scan attribute set (this flag indicates quota scan needs to run)

    2. Note: disabling or canceling a running quota scan job on the cluster does not avoid the conflict with SyncIQ.  The attribute on the quota determines of SyncIQ step will fail.

  3. DR Assistant will offer the ability to uncheck this detection function at the users risk of SyncIQ steps failing.

Failover log Enhancements

  1. Color coded Success and Failure per step. To quickly identify any step that was failed

  2. Failover Summary:  Each step is summarized at the end of the failover for all keys steps Example below:

    1. Overall Failover Job status: Completed, total elapsed time: 0 hours, 11 minutes, 40.50 seconds.

    2. Final SyncIQ Jobs status: Completed, elapsed time: 0 hours, 1 minutes, 34.02 seconds.

    3. Client Redirect status: Completed, elapsed time: 0 hours, 0 minutes, 26.17 seconds.

    4. Make Target Writeable status: Completed, elapsed time: 0 hours, 0 minutes, 40.75 seconds.

    5. Quota Jobs status: Completed, elapsed time: 0 hours, 0 minutes, 2.21 seconds.

    6. Preparation for Failback status: Completed, elapsed time: 0 hours, 0 minutes, 56.89 seconds.​




Help

This guide is designed to help with Failover Design with Eyeglass, should you have any issues that are not addressed in this guide, Superna offers support in several forms; on line, voicemail, Email, or live on line chat.

  1. The support site provides online ticket submission and case tracking.  Support Site link - support.superna.net 

  2. Leave a voicemail at 1 (855) 336-1580

In order to provide service you must have an account in our system. When calling in leave; customer name, email, description of question or issue, and primary contact for your company. We will  assign the case to primary contact for email followup.

  1. Email eyeglasssupport@superna.net

  2. To download license keys please go to the following  license keys.

  3. You can also raise a case right from in Eyeglass desktop using the help button, search for your issue and if want to raise a case or get a question answered, click the “leave us a message”  with your name, email and appliance ID and a case is opened directly from Eyeglass.

 http://site.superna.net/_/rsrc/1472870726155/support/LeaveUsAMessage.png?height=200&width=167

  1. Or get Support Using Chat M-F 9-5 EDT  (empty box?  we are not online yet)

  2. Eyeglass Live Chat 

  3. You should also review our support agreement here.

Failover Planning  

Are you Planning a failover?  

We recommend you review our planning checklist for proven process to successfully failover

Failover Planning Guide and checklist

For a summary of Best Practices for Eyeglass and Isilon Refer to Eyeglass and Isilon DR Best Practices.


How to determine best approach for Quota for failover?


Quota have some challenges for failover with Onefs 8.x.   The quota scan job runs as soon as new quotas are created. The quota scan job sets a flag on newly created quotas to indicate when the quota domain has been created.    SyncIQ operations conflict when Quotas are marked with a flag that indicates the quota domain has not been created yet.  This can fail SyncIQ operations for make writeable or resync prep step in a failover.

Quota failover options:

  1. Failover quotas before a planned failover and leave quotas on both clusters after failover

    1. Pros - when many quotas exist this step can take time during failover.  Pre-sync quotas only on Onefs 8.x or later releases. Recommendation: Use this option if you plan to failover on Friday and failback on Saturday.  The quotas will not be required for the failover testing and its safer to leave them on source cluster.

      1. Steps open jobs icon in Eyeglass and run the quota job manually for the policies that will be failed over.

      2. Run this job for any new quotas that are created manually before the failover.  This job type does NOT run on a schedule and must be manually run.

      3. During Failover use the new skip failover option by unchecking the quota check box and the quota step of the failover will be skipped leaving quotas on the source cluster and the target cluster after failover.

      4. NOTE:  On failback make sure to uncheck the quota failover option.

  2. Failover quotas before a planned failover and remove the quotas on the source cluster after failover

    1. Pros - This option will allow quotas to be removed from the source cluster during the failover and will not trigger new quotas to be created during failover since they were pre synced.  Recommendation:  Use this option when you plan to failover and stay on the remote cluster for days or weeks before failback.

      1. Steps are the same as above except leave all defaults in DR Assistant to failover quotas




Access Zone Failover

Eyeglass uses the Access Zone as the basis for grouping data for failover when customers choose not to use DFS mode or per SyncIQ.  This Access Zone is selected as the unit of failover to simplify the DR readiness to the Access Zone level planning and failover operations. Shares, exports and quotas can be failed over with this mode of failover.

Access Zone failover includes networking failover of SmartConnect Zones and any SmartConnect Zone aliases that exist as well.  Eyeglass must failover ALL IP pools that are members of the Access Zone and all aliases which means all SyncIQ policies and ALL shares, exports and quotas must failover at the same time.   The SmartConnect failover process requires the source cluster zone names to be renamed (not deleted) during failover to avoid SPN collisions in Active Directory and to prevent clients from mounting the source cluster after failover.

This requires planning and mapping of IP pools from source to target clusters before readiness for the Access Zone is marked as ready for failover.

In addition, SMB authentication  depends on the AD machine account to have the correct and SPN  values for SmartConnect Zones,  failover and authentication depend on SPN’s being registered with the cluster that is writable .  Eyeglass Access Zone failover automates SPN management. Eyeglass Access Zone failover also creates  SmartConnect Zone aliases required to access data with a simple DNS update that that will delegate the SmartConnect Zone to the Isilon cluster. (NOTE: DFS mode does not require DNS, SPN and SmartConnect Zone changes during failover)  

The following figure shows Cluster Configuration Before Access Zone Failover.  This is the normal  state with primary and secondary clusters available. Preparation for Failover  is the creation of  mapping hints before failover.

The following second figure shows the  Cluster Configuration Access Zone  Failover Steps with the Primary Cluster not accessible (e.g. Real DR example)




Eyeglass DR Assistant - Access Zone Failover - Summary

  1. Ensure that there is no live access to data OR enable the Data Integrity failover option to disable access to SMB Shares before failover

  2. Begin Failover (Eyeglass automated)

  3. Validation (Eyeglass automated)

  4. Set configuration replication for policies to USERDISABLED (Eyeglass automated)

  5. Provide write access to data on target (Eyeglass automated)

  6. Move SmartConnect Zone to Target (Eyeglass automated)

  7. Update SPN to allow for authentication against target (Eyeglass automated)

  8. Repoint DNS to the Target cluster IP address (use post failover script) (Eyeglass automated with scripting)

  9. Refresh session to pick up DNS change (use post failover script) (Eyeglass automated with scripting)

For details on this failover mode consult the Access Zone Failover Guide link. 




IP Pool Failover

Eyeglass now offers IP pools as new failover unit within an Access Zone.  The IP pool is selected as the unit of failover to simplify the DR readiness to the IP pool now has its own DR Readiness calculation  and failover operations. Shares, exports and quotas can be failed over with this mode of failover.

IP pool  failover includes networking failover of SmartConnect Zones and any SmartConnect Zone aliases that exist as well.  Eyeglass must failover ALL policies mapped to the Pool using IP pool policy mapping UI in the DR Dashboard.  All  SmartConnect names and aliases configured on the pool and all mapped SyncIQ policies plus ALL shares, exports and quotas associated to the SyncIQ policies will failover at the same time.   The SmartConnect failover process requires the source cluster zone names to be renamed (not deleted) during failover to avoid SPN collisions in Active Directory and to prevent clients from mounting the source cluster after failover.

This requires planning and mapping of IP pools from source to target clusters before readiness for the pools  is marked as ready for failover.

It also requires converting an Access Zone to IP pool failover, which means all pools within an Access Zone must have a policy mapped to a pool before ANY pool in the zone can be failed over.  

In addition, SMB authentication  depends on the AD machine account to have the correct and SPN  values for SmartConnect Zones,  failover and authentication depend on SPN’s being registered with the cluster that is writable .  Eyeglass IP pool  failover automates SPN management along with SmartConnect Zone aliases creation needed to access data with a simple DNS update that delegates the SmartConnect Zone to the Isilon cluster. (NOTE: DFS mode does not require DNS, SPN and SmartConnect zone changes during failover).  DFS IP pools can be failed with Pool failover feature.


The following figure shows IP Pool Failover with the Primary Cluster is not accessible (e.g. Real DR example)







Eyeglass DR Assistant - IP pool Failover - Summary

  1. Ensure that there is no live access to data OR enable Data Integrity failover option to disable access to SMB Shares before failover

  2. Begin Failover (Eyeglass automated)

  3. Validation (Eyeglass automated)

  4. Set configuration replication for policies to USERDISABLED (Eyeglass automated)

  5. Provide write access to data on target (Eyeglass automated)

  6. Move SmartConnect zone to Target (Eyeglass automated)

  7. Update SPN to allow for authentication against target (Eyeglass automated)

  8. Repoint DNS to the Target cluster IP address (use post failover script) (Eyeglass automated with scripting)

  9. Refresh session to pick up DNS change (use post failover script) (Eyeglass automated with scripting)

For details on this failover mode consult the Access Zone Failover Guide link.   Look for the IP pool failover section.


SyncIQ DFS Mode with Eyeglass

This mode enables the most seamless failover and failback operations with full Quota failover/failback integration (excluding exports).  The solution enables zero touch client failover to always mount the writable copy of the SyncIQ data with quotas active and requires no DNS updates, no remount, no reauthentication.

This is achieved using DFS folder UNC targets (with the same share name) and a SmartConnect Zone for each cluster,  setup with DFS to use both clusters and Eyeglass ensures shares only existing on one cluster at a time and moves them during failover events.  The DFS Target folder - path to the Secondary cluster will automatically be activated once the shares are created by Eyeglass.

NOTE: It’s possible to use 2 different SmartConnect Zones on source and destination cluster so that nothing needs to change during failover on either cluster.  See below

The following figure shows typical DFS folder setup: Eyeglass Isilon Edition - SyncIQ DR Orchestration Appliance Overview v31.png





Eyeglass DR Assistant - DFS Mode Failover - Summary

  1. Ensure that there is no live access to data OR enable Data Integrity failover option to disable access to SMB Shares before failover

  2. Begin Failover (Eyeglass automated)

  3. Validation (Eyeglass automated)

  4. Set configuration replication for policies to USERDISABLED (Eyeglass automated)

  5. Provide write access to data on target (Eyeglass automated)

  6. (Not performed and not required) Move SmartConnect zone to Target (Eyeglass automated)

  7. (Not performed and not required) Update SPN to allow for authentication against target (Eyeglass automated)

  8. (Not performed and not required) Repoint DNS to the Target cluster IP address (use post failover script) (Eyeglass automated with scripting)

  9. Fail over Shares and Quotas - shares and quotas are created on target and deleted from the source cluster (Eyeglass automated)

  10. DFS Clients automatically switch to DR cluster with DFS 2nd Folder UNC target path.

For Details on this failover mode consult the Microsoft DFS Mode Failover Guide link. 

SyncIQ Mode with Eyeglass

This mode of failure allows targeted failover with some manual steps that allows selected policies to failover without entire Access Zone of policies.   Since no SPN management is performed with this failover type, it is better suited to NFS export failover + quotas.  Shares and exports are pre-synced with Eyeglass so both protocols are supported with this mode.

This failover mode does not automate SmartConnect Zone failover as is done with Access Zone failover.  This means selective SmartConnect Zones can be failed over requiring manual SmartConnect Zone aliases and DNS update to complete the failover.

This mode of failover is also useful with post failover script engine that can execute host side unmount and remount commands using scripts and leveraging the samples provided with Eyeglass.  Superna Professional Services can also be engaged to build host side scripts for customer requirements.

Review the  Script Engine Overview  section in the Eyeglass Administration Guide

These scripts allow simple SSH based remote host unmount and remount automation but can also be done without needing to update DNS since the target cluster SmartConnect Zone can be mounted directly once the SyncIQ policy is marked writeable on the target cluster.

We recommend this option for automation when the host count is <30.  If the host count is higher we recommend Access Zone failover and DNS updates.

The following diagrams show the flow of failover and steps with sample commands that would be run during the Eyeglass policy failover.  The SPN commands are shown if SMB manual failover is being executed.

For Details on this failover mode consult the SyncIQ Policy Failover Guide. 

Failover Readiness

The Eyeglass assisted failover has diagnostics to detect when failover is not possible or recommended and updates a simple DR Dashboard to indicate your current state.  


drdash.png

For Access Zones or IP pools, the DR Dashboard indicates when any of the following need attention: Data sync issues, configuration sync issues, SPN out of sync conditions and invalid IP pool mapping for IP pool or Access Zone failover.  

The DR Dashboard also provides a per SyncIQ readiness and DFS mode policy dashboard for SyncIQ + configuration sync readiness.  This allows sub Access failover readiness to be assessed versus the entire Access Zone. Eyeglass validates your DR readiness at regular intervals and will notify you via Eyeglass external alarming (if configured) if a problem is detected.


The Eyeglass Runbook Robot feature is another way to validate your readiness by automating a failover on a specific, non-production “EyeglassRunbootRobot” Access Zone or SyncIQ Policy every night at midnight.  This exercises the actual failover steps in your environment daily and will also notify you via Eyeglass external alarming (if configured) when a problem is detected.

This feature operates as cluster witness and mounts the cluster over NFS and writes and reads back test data to verify failover from the client view of the cluster.   It can be configured in basic or advanced modes.  See Runbook Robot admin guide.

The basic mode only uses a SyncIQ policy for failover with no other logic running.  Easy to setup and provides quick test of failover and failback.

The advanced mode tests all logic and operates with the Access Zone failover mode and provides the same NFS write and re-read logic in addition to SPN management and SmartConnect Zone mapping and failover logic.

Storage Failover with Eyeglass Failover Modes

The following section outlines the storage layer failover steps.  The full end to end DR plan should also include application shutdown and bring up procedures to complete a true end to end failover.  The storage layer is the foundation upon which all higher layer failover depends, and Eyeglass ensures this step is simple to execute and detect errors during failover.

Superna Professional Services can be engaged on end to end POC or recommendations and assessments for complex or application layer orchestrated failover scenarios. examples include:

  1. VMware SRM + externally mounted storage by VM’s

  2. Oracle RAC Data Guard + File System dependencies for applications

  3. Please see Eyeglass Solutions page


Once you have determined which Failover Mode is appropriate for your environment, the table below provides the high level steps for each mode:

Column 1  - Ordered  Steps: Ordered steps and purpose of step

Column 2 - Description: Description of action taken by step

Column 3  & $  - How Action is Initiated How each step is executed with Eyeglass depending on whether a SyncIQ Policy Failover,  or Microsoft DFS Mode failover is being done

Column 5 - Ordered Steps (Access Zone or IP Zone Failover) - Ordered steps and purpose of step

Column 6 - Description: Description of action taken by step

Column 7  - How Action is Initiated for Access Zone or IP Zone Failover

Target of operation is shown in brackets as source, target or Eyeglass in the table below.

Ordered Steps  for  - Non DFS and DFS Mode

Description

How Action is Initiated

DFS Mode

Ordered Steps for - Access Zone OR IP Pool Failover

Description

How Action is Initiated

Access Zone (Release 1.4. and later)

SyncIQ Mode

DFS Mode

1 - Ensure that there is no live access to data (source) (See new feature in 1A in 2.0 or later)

Manual check for open files.

If Open files found, decide whether to failover or wait to be closed.


NOTE: DR Assistant Data Integrity failover option for 2.0 or later releases blocks IO to SMB shares before failover.


It is recommended to always disable SMB and NFS protocols on the SOURCE cluster prior to failover WHICH IS A CLUSTER WIDE OPERATION to eliminate data loss.

Manual

Manual

1 - Ensure that there is no live access to data (source)

Manual check for  open files.

If Open files found, decide whether to failover or wait to be closed.

It is recommended to always disable SMB and NFS protocols on the SOURCE cluster prior to failover WHICH IS A CLUSTER WIDE OPERATION to eliminate data loss.

Manual

1a - Enable Data Integrity Failover (SMB only)

Applies Deny Everyone to SMB shares before failover starts

Automated by Eyeglass

Automated by Eyeglass

1a - Enable Data Integrity Failover (SMB only)

Applies Deny Everyone to SMB shares before failover starts

Automated by Eyeglass

1b - Cache schedule for SyncIQ policies being failed over and prevent SyncIQ policies being failed over from running (source)


Get schedule associated with the SyncIQ policies being failed over on OneFS, set policies to manual so they don’t run again during failover

Automated by Eyeglass

Automated by Eyeglass

1a - Cache schedule for SyncIQ policies being failed over and prevent SyncIQ policies being failed over from running (source)

Get schedule associated with the SyncIQ policies being failed over on OneFS, set policies to manual so they don’t run again during failover

Automated by Eyeglass

2 - Begin Failover with DR Assistant (Eyeglass)

Initiate Failover from Eyeglass

Manual or Eyeglass REST API

Manual Eyeglass REST API

2 - Begin Failover with DR Assistant (Eyeglass)

Initiate Failover from Eyeglass

Manual or Eyeglass REST API

3 - Validation of failover job (Eyeglass)

Verify all warnings before submitting the failover job

Automated by Eyeglass

Automated by Eyeglass

3 - Validation of failover job (Eyeglass)

Verify all warnings before submitting the failover job

Automated by Eyeglass

3a - Validation - Block on Warning enabled

Will prevent continuing a failover on warnings (quota scan required detection)

Automated by Eyeglass

Automated by Eyeglass

3a - Validation - Block on Warning enabled

Automated by Eyeglass

Automated by Eyeglass

3b - Set Eyeglass config Jobs to userdisabled

This sets config jobs to user disabled state to prevent failed steps from allowing these jobs to run unless a user enables them post failover

Automated by Eyeglass

Automated by Eyeglass

3b - Set Eyeglass config Jobs to userdisabled

Automated by Eyeglass

Automated by Eyeglass

4 - Synchronize data (Run SyncIQ policies) (source) (parallelized step)3

Run all OneFS SyncIQ policy jobs related to the Access Zone being failed over

Automated by Eyeglass

Automated by Eyeglass

4 - Synchronize data (run SyncIQ policies) (source) (parallelized step)3

Run all OneFS SyncIQ policy jobs related to the Access Zone being failed over

Automated by Eyeglass (all policies in the Access Zone)

5 - Synchronize configuration (shares/export/alias, snapshot schedules, dedupe paths) (Eyeglass)(parallelized step)3

Run Eyeglass configuration replication

Automated by Eyeglass (configuration exists on source and target)

Automated by Eyeglass

5 - Synchronize configuration (shares/export/alias,snapshot schedules, dedupe paths) (Eyeglass) (parallelized step)3

Run Eyeglass configuration replication

Automated by Eyeglass (based on matching Access Zone base path)

6 - Renaming shares DFS mode to redirect DFS clients (multi threaded) (parallelized step)3

For DFS Failover, shares renamed on source and target cluster so clients are redirected with dual DFS target paths to target cluster

Not Applicable

Automated by Eyeglass

(special handling renames Shares on source and target so that only one DFS target UNC is reachable and active for DFS clients to switch over)


NOTE: It is possible to integrate DFS protected data inside an Access Zone failover to protect Shares , exports and DFS data with Access Zone failover.

If DFS configured redirect rename steps would executed at this point

Automated by Eyeglass (based on matching Access Zone base path)





6 - Change SmartConnect Zone on Source so not to resolve by Clients (source) (dual delegation eliminates DNS updates)

Rename SmartConnect Zones and Aliases (Source)

Automated by Eyeglass (based on matching Access Zone base path)





7 - Avoid SPN Collision (source)

Sync SPNs in all AD providers to current SmartConnect Zone names and aliases (proxy through target cluster (Source)

Automated by Eyeglass

(AD delegation must be completed as per install docs)

9 - Provide write access to data on target (target) (single threaded for safe failover)  (parallelized step)3

Allow writes to SyncIQ policy(s) related to failover2

Automated by Eyeglass

Automated by Eyeglass

8 - Move SmartConnect Zone to Target (target)

Add source SmartConnect Zone(s) and  Aliase(s) on  (Target)

Automated by Eyeglass

10 - Resync prep Step SyncIQ - Disable SyncIQ on source and make active on target (source) (parallelized step)3

Resync prep SyncIQ policy step to failover (Creates Mirror Policy on target and disables source cluster policy and enables target cluster policy OneFS

Automated by Eyeglass

Automated by Eyeglass

9 - Update SPN to allow for authentication against target  (target)

Sync SPNs in all AD providers to current SmartConnect Zone names and aliases (proxied through target cluster) (Target)

Automated by Eyeglass

11- Re-Set SyncIQ schedule on target mirror policy (target)

Set schedule on Mirror Policy(Target) using schedule from step 1 from OneFS for policy(s) related to the Failover job

Automated by Eyeglass

Automated by Eyeglass

10 - Repoint DNS to the Target cluster IP address

DNS Dual  delegation for all SmartConnect Zones that are members of the Access Zone

Automated by Eyeglass (See "Geographic Highly Available Storage solution with Eyeglass Access Zone Failover and Dual Delegation")

12 - Failover quota(s) (Eyeglass) (optional can be skipped) (parallelized step)3

Eyeglass DR Assistant automatically fails over quotas by running the  Quota Jobs related to the SyncIQ Policy(s) being failed over

Automated by Eyeglass (deleted on source cluster and created on target cluster)

Automated by Eyeglass (deleted on source cluster and created on the target cluster so that post failover quotas are applied)

12 - Failover quota(s) (Eyeglass) (optional can be skipped) (parallelized step)3

Eyeglass DR Assistant automatically fails over quotas by running the  Quota Jobs related to the SyncIQ Policy(s) being failed over

Automated by Eyeglass (deleted on source cluster and created on target cluster)

13 - Remove quotas on directories that are target of SyncIQ (Isilon best practice) (source) (parallelized step)3

Eyeglass deletes all quotas on the source for all the policies

Automated by Eyeglass

Automated by Eyeglass




13a - Run Mirror policy (parallelized step)3

Run policy to resync data in reference direction

Automated by Eyeglass

Automated by Eyeglass

13a - Run Mirror policy (parallelized step)3

Run policy to resync data in reference direction

Automated by Eyeglass

13b - Set Eyeglass config Jobs to enabled

Enables configuration sync ONLY if Resync prep completes successful for the policy

Automated by Eyeglass

Automated by Eyeglass

13b - Set Eyeglass config Jobs to enabled

Enables configuration sync ONLY if Resync prep completes successful for the policy

Automated by Eyeglass

14 - Change SmartConnect Zone on Source so that names are not  resolved by Clients (source)

Rename SmartConnect Zones and Aliases (Source)

Manual

Not Required (source and destination clusters can use existing SmartConnect Zones)

14 - Disable SyncIQ on source and make active on target (source)

Resync prep SyncIQ policy step to failover (Creates Mirror Policy on target and disables source cluster policy and enables target cluster policy OneFS (parallelized step)3

Automated by Eyeglass

15 - Avoid SPN Collision (source)

Sync SPNs in all AD providers to current SmartConnect Zone names and aliases (Source)

Manual (deletes SmartConnect SPN from source cluster machine account)

Not Applicable (DFS SPN’s are not changed during failover)

15 - Set proper SyncIQ schedule on target (target)

Set schedule on Mirror Policy(Target) using schedule from step 6 from OneFS for policy(s) related to the Failover

Automated by Eyeglass

16 - Move SmartConnect Zone to Target (target)

Add source SmartConnect Zone(s) as  Aliase(s) on  (Target)

Manual

Not Required (source and destination clusters can use existing SmartConnect Zones)

16 - Synchronize quota(s) (Eyeglass) (parallelized step)3

Run Eyeglass Quota Jobs related to the SyncIQ Policy or Access Zone being failed over

Automated by Eyeglass

17 - Create SPN’s to allow for kerberos  authentication against target for SMB shares  (target)

Sync SPNs in all AD providers to current SmartConnect Zone names and aliases (Target)

Manual (adds new SmartConnect alias SPN’s to target cluster machine account)

Not Applicable (DFS SPN’s are not changed or registered to Cluster machine accounts)

17 - Remove quotas on directories that are target of SyncIQ (Isilon best practice) (source) (parallelized step)3

Delete all quotas on the source for all the policies

Automated by Eyeglass (Requires IP pool hints are configured See docs)

18 - Repoint DNS to the Target cluster IP address

Update DNS delegations for all SmartConnect Zones that are members of the Access Zone

Manual

Not Applicable (no updates are needed as DFS resolution has not changed in DNS, only the target UNC with an active share

18 - Repoint DNS to the Target cluster IP address

Dual Delegation feature with Eyeglass avoids any DNS steps during failover for all SmartConnect Zones that are failed over

Automated by Eyeglass (see dual delegation one time configuration here)

19 - Refresh session to pick up DNS change

Remount the SMB/NFS share(s)

Manual on clients

Automatic (Windows 7 or later  with DFS support)

18 - Refresh session to pick up DNS change

Remount the SMB/NFS share(s) or remount exports

Automated by Eyeglass using Dual SmartConnect Zone Delegation (How to Configure Here)


  1. Initiates Eyeglass Configuration Replication task for all Eyeglass jobs

  2. SyncIQ does NOT modify the ACL (Access control settings on the file system).  It locks the file system.   ls -l   will be identically on both source and target

  3. System.xml change required to enable parallel step mode.  NOTE: all policy steps are attempted.  On failure, the failover job will continue to attempt all steps and skip downstream steps per policy if the previous SyncIQ step failed.  

Supported DR Site and Failover Topologies

This replication topology cover the scenario commonly used to remote sites.   This allows for 1 or 2 DR copies of data to be available at different geographic distances.   The option to automate failover end to end is possible with Access Zone and  DFS mode failover.

Data Center to Data Center

Supported Failover Modes

  1. Access Zone - Fully automated any site failover

  2. DFS mode - Fully automated any site failover

  3. Per SyncIQ - partially automated any site failover

Multi Site Failover


multi site failover.png


Supported Failover Modes (See Multi Site Failover Guide)

  1. Access Zone - Fully automated any site failover

  2. DFS mode - Fully automated any site failover

  3. Per SyncIQ - partially automated any site failover

Data Center DR Fan-IN Topology

Supported Failover Modes

  1. Per SyncIQ

  2. Access Zone

  3. DFS mode

2 Site DR - Stretch 3rd site Configuration Sync

Supported Failover Modes

  1. Access Zone (A to B) Config synced to C manual failover

  2. Per SyncIQ (A to B) Config synced to C manual failover

  3. DFS mode (A to B) Config synced to C manual failover




How to use the DR Dashboard to Assess Failover Readiness


The DR Dashboard is the main status screen for overall cluster readiness for a DR event.  The status column is sent as a critical alarm when a validation function fails (SyncIQ, Config replication, SPN checks,  Network IP Pool mapping readiness audit).  This way you can address any issues that would affect your ability to failover when they are detected instead of discovering these issues at failover time.

Policy Readiness / DFS Readiness

SyncIQ Policy Failover Readiness and SyncIQ DFS Mode Failover Readiness are based upon the status of the SyncIQ Policy Job (Data replication) in OneFS and the Eyeglass Configuration Replication Job (Configuration Replication) for that SyncIQ Policies related configuration data (shares, exports, and aliases).  The status of these two are combined to provide an overall DR Status.  The Policy Readiness and DFS Readiness are updated each time Eyeglass Configuration Replication is run.

Screen Shot 2017-09-13 at 6.21.59 PM.png


For more detailed information on these status, please refer to the Eyeglass Admin Guide here.


Zone Readiness

The Zone Readiness tab provides a per Access Zone summary of all the key networking, kerberos SPN, SmartConnect connect subnet\pool information along with SyncIQ status and Configuration replication validations done for assessing readiness for failover by Access Zone.  The status for each are combined to provide an overall DR Status.  The Zone Failover Readiness is updated every 15 minutes by default.

This information provides the best indicator of DR readiness for failover and allows administrators to check status on each component of failover, identify status, errors and correct them to get each Access Zone configured and ready for failover.

By default the Failover Readiness job which populates this information is disabled.  Instructions to enable this Job can be found in the Eyeglass Administration Guide.

drdash.png


Screen Shot 2017-05-16 at 7.29.16 AM.png



For more detailed information on these status, please refer to the Eyeglass Admin Guide here.


How to enable Automated DR Testing the Eyeglass Runbook Robot Feature

Many organizations schedule DR tests during maintenance windows and weekends, only to find out that the DR procedures did not work or documentation needed to be updated.  Eyeglass Run Book Robot feature automates DR run book procedures that would normally be scheduled in off peak hours, and avoids down time to validate DR procedures, providing Failover and Failback automation tests with reporting.

This level of automation provides high confidence that your Isilon storage is ready for failover with all of the key functions executed on a daily basis.   In addition to automating failover and failback, Eyeglass operates as a cluster witness and mounts storage on both source and destination clusters the same way the cluster users and machines mount storage externally using Access Zone mount paths.

The feature exercises maximum automation used in Access Zone Failover (Advanced mode) or a basic Quick start more that only uses SyncIQ policy failover mode.

For more detailed information on planning and operation for Eyeglass Runboot Robot, please refer to the RunBookRobot Admin Guide

Planning and Procedures for Eyeglass SyncIQ DFS Mode Failover


DFS Mode Preparation Checklist

DFS mode requires the following prerequisites:

  1. Windows 2008 or 2012 Domain Controller

    1. DNS role installed

    2. DFS files services role

  2. 2 x Isilon clusters with SyncIQ

  3. Eyeglass appliance

  4. DFS enabled clients Windows 7, 8, 10, Server 2008, 2012, 2016

DFS Mode Compatibility

  1. Not compatible with RunBook Robot feature, since NFS is used for data access

  2. Hot\Hot and Hot\Cold compatible

  3. Compatible with Access Zones and Access Zone and IP pool Failover mode but requires dedicated subnet:pool with Eyeglass igls-ignore hint applied to retain SmartConnect zones on source and target clusters.  DFS mode does not require SmartConnect Zone names to failover.


Considerations for Eyeglass SyncIQ DFS mode vs Default Configuration Sync job mode in Eyeglass

  1. Default mode Eyeglass Job mode is configuration sync mode which places configuration data on both source and target cluster treating the configuration data the same as SyncIQ, meaning it's maintained in full sync on both clusters.

  2. Eyeglass SyncIQ DFS mode can be enabled and will delete share objects from the target cluster referenced in the policy, and fails over shares.  Quotas are also failed over during share failover.

Procedure to Enable Eyeglass SyncIQ DFS mode

  1. Select policy with shares to be protected and then Select a bulk action option Enable/Disable Microsoft DFS.

Screen Shot 2015-08-17 at 9.07.45 PM.png

  1. Run the DFS Enabled job

  2. Verify its green before configuring DFS in Active Directory

Screen Shot 2015-08-17 at 9.09.55 PM.png


Detailed DFS Mode Configuration, Operating procedures and Design guidelines

See Microsoft DFS Mode Failover Guide

Planning and Procedures for Eyeglass SyncIQ Mode Failover

Recommended for NFS or application failover that requires post failover scripting for DNS, unmount/mount host side automation.

Not recommended for SMB failover.  DFS mode or Access Zone failover handles SPN management and SmartConnect Zone operations during failover.

SyncIQ mode Failover mode Guide



Planning and Procedures for Eyeglass Access Zone Failover

For requirements on setting up Access Zone planning guide see here.

How to Execute A Failover with DR Assistant

Follow these steps to execute a failover.  Note: The planning guide is expected to be the referenced document for all planned failovers.  Support expects this document has been used for planning.

How to know when Uncontrolled failover should be used?

This option in Eyeglass DR assistant should be used while understanding the data protection implications.

READ THIS FIRST: Using this option means you are failing away from the data and losing  ALL  changes at the moment the failover is started in Eyeglass.  

All steps to recover from this failover mode, WILL require manual steps to recover DR sync status and failback from the DR cluster back to the Production cluster.

Given the above statement: reasons you may choose to execute an uncontrolled failover include the following:

  1. WAN link is cut to the data center with a very long repair time to restore service

  2. Loss of power for extended periods of time to the production data center

  3. Damaged cluster or serious cluster issue (upgrade)

  4. Equipment failover blocking access to the cluster, or application server failures with long recovery times

Eyeglass Pre-Failover Check Important - Read me

IMPORTANT:

Making any changes to the SyncIQ Policies or related Eyeglass Configuration Replication Jobs during failover may result in unexpected results.

IMPORTANT:

Eyeglass Assisted Failover has a 180 minute timeout on each failover step.  Any step which is not completed within this timeout period will cause the failover to fail.  This can occur if SyncIQ policies are already running when failover job is started or SyncIQ steps take longer than expected to complete.  This timeout can be changed but does not accelerate failover if lowered.  

IMPORTANT:

Deleting configuration data (shares, exports, quotas) or modifying Share name or NFS Alias name or NFS Export path on the target cluster before failing over without running Eyeglass Configuration Replication will incorrectly result in the object being deleted on the source cluster after failover.  You must run Eyeglass configuration replication before the failover OR select the Config Sync checkbox on failover to prevent this from happening.

How to failover Data With DR Assistant

This covers Access Zone/ IP Pool mode, DFS policy mode or SyncIQ mode.

To failover Data DR Assistant:

  1. Consult the Failover Design Guide for monitoring failover progress.

  2. Then review steps below for post failover

  3. Verify manually that there are no open files on the Source Cluster.  There should be no client access to the Failover Source cluster during failover as this data will not be replicated.

  4. Open DR Assistant Icon

  1. Select Failover Type design for your environment.

  2. Select Source Cluster that has the writeable data to failover

  3. Leave all default check boxes for a planned controlled failover.

    1. Controlled failover  

      1. Check if the source cluster is healthy and reachable. LiveOPS Dashboard Icon

      2. Uncheck if the source cluster is not healthy or reachable This option is a REAL DR event.  NOTE: Do not use this option unless lab testing OR you are prepared for manual steps to recover from the resulting end state.  In this case,  source cluster API calls are skipped and cached knowledge of shares, quotas are used to failover (Real DR Event).

        1. IMPORTANT:

          1. Eyeglass Configuration Replication Jobs will be in USERDISABLED state on source and target cluster after an uncontrolled failover.

          2. Eyeglass requires that directories being failed over exist on the target cluster which means SyncIQ policies have run at least once prior to failover.

    2. Data Sync

      1. Check to run a final SyncIQ data sync Job as part of the failover

      2. Uncheck to skip the SyncIQ data sync step

    3. Config Sync

      1. Check to run a final Eyeglass Configuration Replication Job as part of the failover

      2. Uncheck to skip the Eyeglass Configuration Replication step

    4. SMB Data Integrity Failover

      1. Check to execute SMB Data Integrity Failover step. It disconnects any active SMB sessions prior to failover and ensures that no new sessions can be established on the failover source.

      2. NOTE: if shares use root with full control, you are no longer using Active Directory user, this is a Linux user on Isilon only and not an Active Directory user.  Any share with run as root by passes all security and auditing and cannot be locked out from a share.   Any shares with this permission will not lockout any users.

      3. Uncheck to skip SMB Data Integrity Failover step.

    5. Quota Sync

      1. This option allows skipping of quota failover and will leave the quotas on the source cluster. This would be selected if 1000’s of quotas exist which affects failover performance of SyncIQ operations.  It will also remove the risk of a quota scan job impacting SyncIQ operations on quotas that are flagged with needs a scan on the destination cluster.

      2. Checked means quotas will failover (create on target, delete on the source).

      3. Unchecked means quotas will not be failed over but will remain on the source cluster.

    6. Block Failover on Warning

        1. Will block failover from starting if validation on warnings are detected.   All Warnings in DR Dashboard will block a failover and must be reviewed before unchecking this option to continue.

        2. Quota Domain conflict with SyncIQ Validation:

          1. Allows override of default validation that will detect target cluster quotas with quota scan pending flag set,.  This flag blocks running policies, resync prep and make writeable steps from completing on policies that have newly created quotas and no quota domain created.

          2. See image below on policy quota domain validation check.

          3. Screen Shot 2017-10-31 at 5.58.45 PM.png

          4. This validation will block a failover attempt when checked and a warning validation is detected on the zone, policy or ip pool (enabled).

          5. To continue anyway uncheck this option and restart the failover  Once unchecked you are taking the risk of SyncIQ policies failing either Make Writeable step or Re-sync prep.

          6. Solution:  Run quota scan job from cluster jobs menu and allow quota scan to complete the quota domain creation on all quotas with the flag set.   Then start the failover again.

          7. NOTE: If multiple policies are defined some policies may fail make writeable step OR resync prep step.  In this release Eyeglass will continue to the next policy if a step fails.

    7. SyncIQ Resync Prep

      1. Check to execute the SyncIQ Resync Prep failover step (leave this default advanced setting)

      2. Uncheck to skip  SyncIQ Resync Prep failover step.  This is not recommended as it will leave the system in state where you will not be able to use Eyeglass to failback. This is used ONLY when customers want to failover in one direction and then recreate a new policy or they know how to manually recover and create mirror policy.

    8. Disable SyncIQ Jobs on Failover Target (advanced setting leave defaults )

Disable on failover is optional if you don’t want to configure failback and execute sync job in the return direction.  This is used when you want to verify systems before replicating data back to the source. Warning: Using this option WILL require manual steps to failback

  1. MUST READ: Uncheck Controlled Failover ONLY if this is a REAL DR event (NOTE: If this is unchecked Eyeglass assumes the source cluster is destroyed, NO steps that provide failback are executed.  Customer is responsible for recovery from uncontrolled failover. NO automated recovery is possible from using this option.  It is expected customers make decisions to protect data at all times and only use this option if data is deemed not usable for business reasons.   All recovery is manual if this option is used.

  2. (Screenshot below) Review best practices document, it is expected all release best practices are read and understood before proceeding. This document also covers prep steps for failover example domain mark.  This document is a must read for any failover.

Screen Shot 2017-05-16 at 7.41.31 AM.png

  1. Select the policy or policies or Access Zone for the failover type selected.   

    1. Check readiness again before continuing to ensure you understand the warnings and if they will affect your failover.  In general warnings do not block failover.  Errors block failover.

Screen Shot 2017-05-16 at 7.43.03 AM.png

  1. Review Failover release notes that cover special scenario’s that must be assessed if they affect your planned failover.  This document requires acknowledgment before continuing. Failing to read this document can result in data loss.

    1. Per SyncIQ Policy or DFS failover validation Screen.

      1. This screen lists all policies selected for the failover.  NOTE: The previous policy selection screen only allows policies that are valid choices. The Selected policies will be summarized. (Screenshot below)

Screen Shot 2017-05-16 at 7.45.23 AM.png

    1. Access Zone validation Screen (below).  NOTE: This screen will show all policies within the Access Zone that are eligible for failover.  If any policy is USER DISABLED or policy disabled it will be shown as “will NOT be failed over”. NOTE: Do not failover with a disabled policy unless you know the data protected by this policy does not need to be failed over.

Screen Shot 2017-05-16 at 1.16.46 PM.png

    1. Reference the screenshot below to see a successful validation screen.

val.png

  1. Review final confirmation screen.  

    1. Review link to recovery guide.  

    2. This document is used by support to to assist with recovery steps. It is assumed you are familiar with this document.  

    3. Final acceptance and point of no return.

    4. NOTE: The failover job cannot be canceled once started.

Screen Shot 2017-05-16 at 7.47.58 AM.png

  1. Start the failover with Run button.

  2. Monitor with logs icon.  

    1. Click Watch to follow the failover real-time or click fetch to update log window with current progress.

    2. Screen Shot 2017-09-28 at 6.32.30 PM.png

    3. NOTE: Failover jobs can be canceled cancel job link.

      1. ONLY USE IF DIRECTED BY SUPPORT

      2. WARNING: IF YOU CANCEL A FAILOVER, MANUAL RECOVERY OF NETWORKING POLICY STATE, SHARES, SPN, SMARTCONNECT IS REQUIRED.    SUPPORT IS UNABLE TO ASSIST WITH RECOVERY FROM INTENTIONALLY CANCELING A FAILOVER.

Screen Shot 2017-05-16 at 7.50.20 AM.png

  1. Failover status completes with success or failure.

  2. If status is failure download support log from DR Assistant History tab  and open a support case.

  3. IMPORTANT:  Always test data access for any failover success or failure.   Detailed steps are posted 11 How to Validate and troubleshoot A Successful Failover WHEN Data is NOT Accessible on the Target Cluster

Post Failover Procedures

Use this procedures that apply to your failover mode.

Post Access Zone Failover Steps

  1. Test Dual Delegation

  2. Check SPN for SPN Errors

  3. Refreshing SMB connection after Failover completed

  4. Refreshing NFS connection after Failover completed

  5. Test Data Access and debug

Post Access Zone Failover Health Check Steps

  1. Post Access Zone Failover Checklist

POST DFS Mode Failover Steps

  1. Post Eyeglass Microsoft DFS Mode Failover Manual Steps for NFS Exports

  2. Post Eyeglass Microsoft DFS Mode Failover Checklist

  3. Procedure for Checking your SMB Clients Post DFS Failover

How to Validate and troubleshoot A Successful Failover WHEN Data is NOT Accessible on the Target Cluster

Debug Plan of attack for clients post failover:

NOTE: follow the order below to find root cause

  1. Check DNS

  2. Mount share from client (DFS or non DFS)

  3. If authentication error fix

  4. If no authentication error Test write access

  5. If no write access, remount correctly

  6. Retest mount of share, test write access again.

  7. Done.

Steps to Validate SmartConnect and DNS failed for over Successfully:


Test DNS response on the clusters:  This test verifies that SmartConnect names were failed over successfully and also can verify if dual delegation in your DNS environment is setup correctly.  This test also eliminates an issues with your internal DNS and verifies Isilon SmartConnect zones failed successfully.

  1. Quick test:  From a windows Client machine dos prompt type “ping <SmartConnect name FQDN>”  This should return IP address from the Target cluster IP pool.  If ping does respond with a correct IP from the TARGET cluster.    

    1. Then cancel ping command CTRL-C and ping again to the same SmartConnect name  to make sure a second IP from the same Target cluster IP pool is returned to verify SmartConnect and Isilon DNS is functioning as expected.

    2. If ping test is successful on BOTH ping tests.  CONTINUE TO MOUNT STEPS section.

    3. If you get failed Ping or name does not resolve name to correct IP address of the TARGET cluster.  Continue with steps below to debug DNS.   

      1. From any Windows client machine type “nslookup<press enter key>

      2. Source Cluster DNS Test:

        1. then type "server x.x.x.x" enter key.  where x.x.x.x is the Subnet service ip of the source cluster

        2. type "FQDN of SmartConnect Zone used in failover"  <press enter key> .  Hint: Refer to the failover log from DR Assistant for the full list of SmartConnect names that were failed over  example data.example.com

        3. The expected response is a failed resolution since failover disables the SOURCE cluster DNS response.

          1. Example of a failed nslookup on the cluster you failed away from “** server can't find userdata.ad1.test: REFUSED”

          2. NOTE: if lookup does NOT return REFUSED response, then SmartConnect name did not failover correctly AND consult recovery guide Networking section. To fix SmartConnect names.

  2. Target Cluster DNS Test:

    1. Test TARGET cluster SSIP (subnet service IP ) with  DNS

      1. type "server y.y.y.y" enter key. where y.y.y.y is the subnet service ip of the target cluster

      2. type "FQDN of SmartConnect Zone used in failover" Refer to the failover log for list of SmartConnect names that were failed over

      3. Expected response SUCCESSFUL NAME RESOLUTION RETURNING IP OF THE TARGET CLUSTER. This means SmartConnect was failed over correctly to the target cluster.  

      4. If DNS test fails this step OR  IP fails to resolve OR is the wrong IP address.   consult recovery guide Networking section. To fix SmartConnect names.

  3. If all DNS tests pass in this section

    1. Root Cause: Your internal DNS is not setup correctly for dual delegation is not configured correctly, since SSIP on the cluster correctly answers DNS queries. Stop here and correct using guide and video above Here.   Double check 2 name server entries exist for the SmartConnect name you are failing over.

    2. End debugging.



Steps to test Mounting a Share with Access Zone failover on Machine with no previous Mount:

  1. NOTE: Use a Windows client that DOES NOT have a connection to any cluster to perform this test correctly.

  2. Mount test from a Windows client in File Explorer  \\FQDN of SmartConnect Zone in Access Zone\<share name>

    1. if this step is successful and test write Access to the share - SKIP to section below “Steps to correctly test a machine with existing Mount to Source cluster and remount to test write access:

    2. If you received a Windows login popup message for user id and password.  This indicates AD SPN kerberos failover issue. (Check Eyeglass failover log in DR Assistant for SPN Delete or Create failed steps and check SmartConnect name to the failed SPN step in the log.

      1. Typically SPN issue will mean a popup login dialogue box in windows requesting user id and password since authentication failed to the target cluster of the failover.

      2. on the source cluster use isi command to verify the FQDN(s) are NOT listed

        1. Example “isi auth ads spn list <your AD domain provider here>”

      3. on the target cluster use isi command to verify the FQDN(s) ARE listed.

        1. Example “isi auth ads spn list <your AD domain provider here>”

      4. If the source has the SPN FQDN is listed OR the target does NOT have the SPN listed matching your FQDN(s).  Then MANUAL SPN failover is required to allow kerberos authentication to succeed see recovery guide section here

      5. Correct SPN issue and retest mount access

      6. If successful  SKIP to section below “Steps to correctly test a machine with existing Mount to Source cluster and remount to test write access:

Steps to correctly test a machine with existing Mount to Source cluster post Access Zone Failover and remount to test write access:

  1. Unmount share/export: Note Access Zone failover requires All Windows OS's and linux OS's to unmount before attempting to access data on the target cluster

    1. Windows OS’s net use x: /delete (replace x with drive letter).    OR

    2. Use Windows Explorer right click the drive letter and select the Disconnect menu option.

      1. Note this is not the best way to test if other netbios sessions exist to the cluster this command will not release the session.  #1 way to ensure this step is done correctly is REBOOT THE CLIENT MACHINE.  Proceed below if you do not want to reboot the client.

      2. Using File Explorer to mount FQDN of  the Access Zone SmartConnect name  \\FQDN of SmartConnect name\sharename

      3. Test write access to share

          1. If this step fails and you have read only error  continue to next step

          2. From a dos prompt:   Type “netstat -an | more” to list TCP sessions,  look for an entry that lists an IP address to the source cluster on port 445.   This means that NETBIOS SMB session to the server still exists and the unmount did not release the TCP session.

          3. Next Step:  Reboot to guarantee no sessions to the source cluster and repeat the mount of \\FQDN\share name

      4. After Successful remount of SmartConnect name

        1. verify TCP session to target cluster

          1. From a dos prompt:   Type “netstat -an | more” to list TCP sessions,  look for an entry that lists an IP address to the target cluster on port 445.   This means that NETBIOS SMB session to the target cluster is connected.

        2. Test write access to share

      5. Completed all debugging.

Steps to test Mounting a DFS protected Share with DFS failover mode:

  1. From a Windows client machine connected to Active Directory mount a dfs folder example \\<domain name>\<dfs root name>\<DFS folder name>

  2. Verify file write access by creating a file

    1. If successful - done

    2. Repeat above a sample of DFS folders that were failed over

  3. If write test fails OR mount fails or mount error

    1. Check eyeglass failover log (DR Assistant, Failover history tab) open failover log and look for policy name and share rename step completed successfully on the DFS mount you are testing.   

      1. If failed rename step in the Failover log, login to target cluster find the igls-dfs-<share name> and manually rename the share.   If all rename operations were successful continue to next step (3b)

      2. Now login to the source cluster and find the share name and rename to apply igls-dfs-<share name>

      3. Repeat these steps if more than one share failed to rename by using the failover log to repair share names on both source and destination cluster

      4. Repeat mount write test from step #2 to verify renaming resolved the issue.

    2. If mount test still fails

      1. Verify DFS referrals are correctly configured in Microsoft DFS Management snapin

      2. Check each item below to verify configuration:

        1. Open DFS manager snapin, right click the DFS folder you are validating

          1. See example Screen Shot 2017-08-15 at 11.37.11 AM.png

          2. If both DFS referrals exist pointing at source and target cluster SmartConnect names and the share name is the same name as the screenshot example.  Continue to next step.

        2. Test each referral mount path. example from above tested  from a Windows client \\dr.ad1.test\smb2 (failover target cluster SmartConnect name used in this test).   If the share mounts and data is visible,  verify you can write data.  This test verifies dns and SmartConnect is configured correctly and AD authentication to the SMB2 share is correctly configured.

          1. If this step fails

            1. Follow steps above in this section “Steps to Validate SmartConnect and DNS failed for over Successfully:”

            2. If the above step is successful: follow steps in this section  “Steps to test Mounting a Share with Access Zone failover on Machine with no previous Mount:”

            3. If the above is steps found an issue resolve and retest direct share mount to DR target cluster.







How to Monitor the Eyeglass Assisted Failover

In-Progress Failover

Once a failover has been started, you can monitor its progress from the Eyeglass DR Assistant / Running Failovers tab.


Screen Shot 2016-11-23 at 8.07.27 PM.png



From this window you can expand the Job Details tree to see the progress and status for each failover step.

Screen Shot 2017-09-13 at 6.50.01 PM.png

You can also open the failover log from this window to see the details for each step by selecting the Logs link.  

Screen Shot 2017-05-16 at 7.50.20 AM.png



Each entry in the log is timestamped. The log is updated as the failover proceeds and you can see log updates by closing and opening the log file again.

Should an error occur during failover, an Eyeglass system alarm will be issued.  If you have configured external notification by email or Twitter you will receive these alarms this way.  The alarms are also visible from the Eyeglass Alarms window.

Screen Shot 2017-09-13 at 6.51.09 PM.png



Completed Failover

Once the failover is completed, it will appear in the DR Assistant / Failover History tab.

Screen Shot 2017-05-16 at 7.58.27 AM.png


The Result column displays the SUCCESS if the Failover completed successfully and FAIL if there were errors encountered in the Failover steps.  The SyncIQ reports are available separately to review cluster logs for each step of the failover.

Note: An Access Zone Failover with Result of SUCCESS may have had SPN errors.  Please refer to the Access Zone Failover Guide for details on checking for SPN errors and resolution.

From the Failover History window, click on the row corresponding to the Failover that you would like to review.  The Job Details tree will appear below and the Failover Log can be retrieved for viewing or download by selecting the Open link.

Screen Shot 2017-09-13 at 6.52.56 PM.png

Screen Shot 2017-09-13 at 6.52.47 PM.png


Troubleshooting Failover

Failover Recovery Procedures

In the event that a Failover does not complete all steps successfully, please refer to the Eyeglass Failover Recovery Procedures to assess the state of your environment and for recovery steps.

Collecting Logs for Failover Troubleshooting

To collect the logs for Failover Troubleshooting, following the instructions for collecting support information found in the Eyeglass FAQ document here.  The Failover logs will be included with other Eyeglass logs contained in the Logs Backup file.

Authentication with Service Principal Name Considerations with Active Directory and SMB Shares in Access Zones

Active Directory only allows a single computer account to register a Service Principal Name against a computer account.  This property can be seen with ADSI Edit tool.  The SPN is in the form of HOST/service name and typically has 2 entries one for Netbios naming (15 characters)  and one for DNS URL format for each SmartConnect zone or zone alias created on a cluster.

The service principal name is required to exist on the machine account handling authentication requests from clients to send to a domain controller for authentication using kerberos session tickets.


Active Directory does prevent duplicate SPN from being registered and if this occurs Kerberos authentication fails for clients and they will be unable to mount data if NTLM fall back authentication does not succeed.    Eyeglass failover deletes the SPN's of the subnet pool and it’s aliases on the selected source cluster Access Zone from the  AD computer account or ALL AD providers assigned to the Access Zone during failover.  

Eyeglass also scans cluster machine accounts during configuration replication jobs and fixes missing SPN’s if detected.

Error-DuplicateSPN-Detected.png

Example Error seen after duplicate SPN’s were created.  This is seen on the domain controller attempting to authenticate a mount request. This error only appears once and not for each failed authentication.


For information this event see KB article https://support.microsoft.com/en-us/kb/321044

Appendix A  - Advanced Failover Modes

Cached config advanced mode

In some cases customers can not pre sync configuration data from one cluster to the DR site, as this exposes data at the DR location.   This requirement means, all configuration data must be cached on the Eyeglass appliance versus pre-synced to the DR cluster.

Eyeglass always has a database of changes but it’s not used for failover operations as this information can be stale in planned failovers.   As of release 1.6 and later a failover mode switches Eyeglass to sync configuration data to files that are used in all failover modes controlled and uncontrolled.

The process now looks like this:

  1. Sync every 5 minutes, get configuration information difference changes and update local files on eyeglass appliance

  2. Controlled AND uncontrolled failover reads from cache files only and never communicates with the source cluster to create shares,exports, quotas during the failover process on the target DR cluster

    1. For controlled failover this means that potential stale data is used to failover in this scenario

How to enable cached config advanced mode

  1. Ssh to eyeglass as admin

  2. igls adv failovermode set --readfromfile=true

  3. Done.

Failover Advanced mode Configuration - Parallel thread and failover jobs


These configurations are aimed at customers that have great than 50 policies for business reasons and require faster failover option to maintain SLA on data recovery.

Two features exist:

  1. Parallel threads - allows make writeable and resync prep to operate in parallel upto thread limit of 10.  This means 10 policies will be executed at a time for all SyncIQ steps and Eyeglass will ensure that at least 10 policies are executing at a time throughout the failover process.

  2. Parallel Access Zone or Pool Failover - Default mode of failover allowed a single access zone or ip pool to failover at one time.  If more than one failover was submitted, the 2nd, 3rd etc.. failover would wait in a queue until the 1st failover completed.   This new feature allows customers to start multiple failover jobs in parallel to accelerate the failover process when the customer has multiple assess zones configured.


Parallel threads


This mode switches to parallel policy with up to 10 threads for make writeable step and resync prep step.  This defaults to disabled and uses sequential make writeable and resync prep steps when a group of policies are involved in a failover job.  The default is sequential which is Recommended as Best practice least risk mode.     



Key differences between default sequential and parallel mode:

  1. For 8.x clusters, 50 policies can run at a time and Eyeglass will use a maximum of 10 threads allow 10 policy make writeable or resync prep commands to be sent in at once. For 7.2 clusters only 5 will execute and 5 are queued.  If one policy completes another policy is started with the goal of keeping maximum number queued at all times.  

  2. Testing has shown 3x to 4x overall time to complete make writeable improvements.  Results in production may vary.

  3. NOTE: As of release 2.0 and later any failed policy step will block any further syncIQ steps that can not be run but will continue with other policy syncIQ steps without failing the failover job.


How to enable Fast Failover parallel threads advanced mode


  1. igls adv failovermode set --parallel=true

  2. Done. The change affects all failover jobs.

  3. Disable with

  4. igls adv failovermode set --parallel=false


Parallel Failover Jobs


This feature allows multiple failover jobs of any type to be failed over in parallel to reduce failover time.  This feature still has a 10 thread limit for all failover jobs.  This can be combined with the parallel threads feature to increase each failover jobs parallelization.  Testing this in advance of a failover is mandatory step.


How to enable Fast Failover parallel Access zones/IP pools advanced mode

  1. Login via ssh as admin to eyeglass

  2. sudo -s

  3. Enter admin password to become root user

  4. Type: nano /opt/superna/sca/conf/system.xml,

    1. Add tag to this file as per below

    2. <run_concurrent_fofb>true</run_concurrent_fofb>

    3. Ctrl-x

    4. Yes to save the file

  5. systemctl restart sca

  6. The feature is now enabled

  7. To submit parallel access zone or IP Pool.  Use DR Assistant to start a failover job.  

  8. Close DR Assistant, re-open it and start another failover.

  9. Repeat, the above step to submit more parallel failover jobs.

  10. NOTE:  Cluster resources may be exhausted and testing is mandatory prior to attempting a very large number of failovers.