Eyeglass Failover Design Guide

Eyeglass Failover Design Guide


Contents

  1. 1 Overview
  2. 2 What’s New with Eyeglass Failover  
  3. 3 Are you Planning A failover?  
    1. 3.1 Access Zone Failover
    2. 3.2 SyncIQ DFS Mode with Eyeglass
    3. 3.3 SyncIQ Mode with Eyeglass
    4. 3.4 Failover Readiness
  4. 4 Storage Failover with Eyeglass Failover Modes (SyncIQ Policy Failover, DFS Integrated Failover, Access Zone Failover
  5. 5 Supported DR Site and Failover Topologies
    1. 5.1 Data Center to Data Center
    2. 5.2 Multi Site Failover
    3. 5.3 Data Center DR Fan-IN Topology
    4. 5.4 2 Site DR - Stretch 3rd site Configuration Sync
  6. 6 How to use the DR Dashboard to Assess Failover Readiness
    1. 6.1 Policy Readiness / DFS Readiness
    2. 6.2 Zone Readiness
  7. 7 How to enable Automated DR Testing the Eyeglass Runbook Robot Feature
  8. 8 Planning and Procedures for Eyeglass SyncIQ DFS Mode Failover
    1. 8.1 DFS Mode Preparation Checklist
    2. 8.2 DFS Mode Compatibility
    3. 8.3 Considerations for Eyeglass SyncIQ DFS mode vs Default Configuration Sync job mode in Eyeglass
    4. 8.4 Procedure to Enable Eyeglass SyncIQ DFS mode
    5. 8.5 Detailed DFS Mode Configuration, Operating procedures and Design guidelines
  9. 9 Planning and Procedures for Eyeglass SyncIQ Mode Failover
  10. 10 Planning and Procedures for Eyeglass Access Zone Failover
  11. 11 How to Execute A Failover with DR Assistant
  12. 12 How to Validate and troubleshoot A Successful Failover WHEN Data is NOT Accessible on the Target Cluster
    1. 12.1 Debug Plan of attack for clients post failover:
    2. 12.2 Steps to Validate Smartconnect and DNS failed for over Successfully:
    3. 12.3 Steps to test Mounting a Share on Machine with no previous Mount:
    4. 12.4 Steps to correctly test a machine with existing Mount to Source cluster and remount to test write access:
  13. 13 How to Monitor the Eyeglass Assisted Failover
    1. 13.1 In-Progress Failover
    2. 13.2 Completed Failover
  14. 14 Troubleshooting Failover
    1. 14.1 Failover Recovery Procedures
    2. 14.2 Collecting Logs for Failover Troubleshooting
    3. 14.3 Authentication with Service Principal Name Considerations with Active Directory and SMB Shares in Access Zones
  15. 15 Appendix A  - Advanced Failover Modes
  16. 16 Cached config advanced mode
    1. 16.1 How to enable cached config advanced mode
  17. 17 Fast Failover advanced mode
    1. 17.1 How to enable Fast Failover advanced mode


Overview


Eyeglass offers single button assisted failover by Access Zone (requires 1.4 update 1 and later)  or by SyncIQ policy(s) or by Microsoft DFS enabled SyncIQ policies.   This document provides:

  • An overview of each failover mode

  • High level steps for each failover mode

  • How to assess readiness for failover

  • Planning and operational steps for each failover mode


For guidance on which failover mode is appropriate for your environment, please consult the document Eyeglass Start here First.  The Eyeglass Start Here First document provides the information you will need for each failover option to assist you in making the decision of which option is appropriate for your own environment:

  • When to use it?

  • Why use it?

  • What you need to know?

  • Estimated knowledge to configure


What’s New with Eyeglass Failover  



Release

Description

Failover Mode

1.6

New error handling for OneFS PAPI errors that occur during failover.  Should PAPI return an error such as 503 Service Unavailable on any of the steps for  allow writes, run policy/mirror policy, resync prep, Superna Eyeglass will now retry this action 3 times as an error such as 503 Service Unavailable may be transient.


All Failover Modes

1.6

New timeout count down added to each step that is begin processed so timeout is visible during a failover.   URL to long running steps to recovery guide included in log along with login https url to cluster management to allow simple "One" click from a failover to Isilon UI console access to check on cluster operations.

All Failover Modes

1.6

Key steps are now grouped:

  1. Make writeable all policies are processed together (in series) making the filesystem writeable faster for all policies involved in the failover

  2. Resync prep step now run in batch for all policies after the make writeable step for all policies.


All Failover Modes

1.6.1

New release notes on failover acknowledge in DR assistant is required reading before allowed to continue with a failover.

All Failover Modes

1.7

As of release 1.7 and beyond all Failover modes will restrict number of parallel Job requests to the Isilon cluster for the Run SyncIQ Policy data sync step based on cluster version:

  OneFS 7.2 - 5 parallel job requests (OneFS 7.x cluster have a limit of 5 concurrent policies).  Eyeglass will monitor the progress for each Job and submit a new request as previously submitted requests are completed.

  OneFS 8    - parallel job requests limit based on Eyeglass appliance configuration (default 10)


Based on extensive testing for safe failovers, make writeable and resync prep are serialized steps.


All Failover Modes

1.8

This release introduces parallel failover mode disabled by default.

High Speed Failover - Parallel Failover Flag :

    1. Allows make write step and resync prep to run in parallel with up to 10 threads, ensures that 10 policies are submitted to be processed at all times.

    2. NOTE: Risk of a policy failure increases and new flag will NOT stop the failover in progress and will continue to issue api calls to submit all syncIQ policies in the failover job until all have been submitted. This runs the risk of more complex recovery if more than one policy fails to complete its step (Allow Writes OR resync Prep)

    3. Testing has shown these steps for large quantity policy failover can improve failover times 3x to 4x.


Access Zone Failover Enhancement:

    1. New validation detects time skew between cluster nodes and between eyeglass and the cluster's.

    2. Validation warning raised if detected

    3. Time skew can cause failed steps if the time on different nodes is not within an acceptable range to detect the steps or running status on a policy during failover.


SyncIQ Job Reports appended to Eyeglass failover log :

    1. Now policy run, and resync prep reports are appended to the end of the eyeglass failover log to allow simplified triage of failed steps and escalation to EMC support based on cluster policies failing.

    2. All information and time stamps are now in a single file.



All Failover Modes

1.9

Failover Enhancements

  1. Open files validation removed from dr assistant until Isilon API support per access zone open files

  2. New access zone readiness validation verifies all IP pools have a smartconnect zone defined

  3. DR Assistant synciq reports from a failover are now separated  from Eyeglass logs in the failover history, making debugging simpler.

  4. Restrict at source validation updated to show info only in the DR dashboard  

    1. To simplify validation of access zones readiness for failover.  Restrict at source is a best practice and shows green if implemented or info if not implemented on each policy

  5. SPN Management Enhancements

    1. SPN failover enhancement for access zone failover now restricts the delete and add SPN API calls to a single cluster node in the target cluster.    

      1. This change will insure a single domain controller is used for the failover operations.

    2. Short SPN's are now synced to AD computer objects (not used for Kerberos) during config sync if any are missing they are inserted.  NOTE: This is not related to failover of SPN’s only maintaining newly detected smartconnect names and ensure they are synced to AD computer object.

  6. Failover log real-time view in DR assistant allows a live failover log to be monitored with auto refresh or stop and pause option.

    1. Screen Shot 2017-04-27 at 6.49.24 PM.png

  7. Quota Failover  Enhancement

    1. Linked quotas that are unlinked to the parent quota creates a quota that be can be managed with a different limit applied from the parent quota.  

    2. Eyeglass will now correctly failover unlinked quotas. Now the unlinked quotas failover as a normal quota and then the parent all users quota is failed over next to ensure no conflict occurs on the target cluster.

    3. Syncing Shares with variable expansion in the path name now sync  correctly between clusters

  8. Ransomware Defender Failover

    1. See Ransomware Defender failover in the admin guide.





Are you Planning A failover?  


We recommend you review our planning checklist for proven process to successfully failover


Failover Planning Guide and checklist


For a summary of Best Practices for Eyeglass and Isilon Refer to this link.

Access Zone Failover


Eyeglass uses the Access Zone as the basis for grouping data for failover when customers choose not to use DFS mode or per SyncIQ mode described above.  This access Zone is selected as the unit of failover to simplify the DR readiness to the access Zone level planning and failover operations. Shares, exports and quotas can be failed over with this mode of failover.


Access Zone failover includes networking failover of Smartconnect Zones and any Smartconnect zone aliases that exist as well.  Eyeglass must failover ALL IP pools that are members of the access Zone and all aliases which means all SyncIQ policies and ALL shares, exports and quotas must failover at the same time.   The smartconnect failover process requires the source cluster zone names to be renamed (not deleted) during failover to avoid SPN collisions in Active Directory and to prevent clients from mounting the source cluster after failover.


This requires planning and mapping of IP pools from source to target clusters before readiness for the Access Zone is marked as ready for failover.


In addition, SMB authentication  depends on the AD machine account to have the correct and SPN  values for Smartconnect zones,  failover and authentication depend on SPN’s being registered with the cluster that is writable .  Eyeglass Access Zone failover automates SPN management along with Smartconnect Zone aliases creation needed to access data with a simple DNS update that delegates the smartconnect zone to the Isilon cluster. (NOTE: DFS mode does not require DNS, SPN and smartconnect zone changes during failover)  




Figure: Cluster Configuration Before Access Zone Failover

Normal

Preparation for Failover


Create mapping hints before failover





















Figure: Cluster Configuration Access Zone  Failover Steps



Failover with the Primary Cluster is not accessible (e.g. Real DR example)


Eyeglass DR Assistant - Access Zone Failover - Summary


  1. Ensure that there is no live access to data

  2. Begin Failover (Eyeglass automated)

  3. Validation (Eyeglass automated)

  4. Set configuration replication for policies to USERDISABLED (Eyeglass automated)

  5. Provide write access to data on target (Eyeglass automated)

  6. Move Smartconnect zone to Target (Eyeglass automated)

  7. Update SPN to allow for authentication against target (Eyeglass automated)

  8. Repoint DNS to the Target cluster IP address (use post failover script) (Eyeglass automated with scripting)

  9. Refresh session to pick up DNS change (use post failover script) (Eyeglass automated with scripting)




For Details on this failover mode consult the Access Zone Failover Guide link. 



SyncIQ DFS Mode with Eyeglass


This mode enables the most seamless failover and failback operations with full Quota failover/failback integration (excluding exports).  The solution enables zero touch client failover to always mount the writable copy of the SyncIQ data with quotas active and requires No DNS updates, No remount, no reauthentication.


This is achieved using DFS folder UNC targets (with the same share name) and a SmartConnect Zone for each cluster,  setup with DFS to use both clusters and Eyeglass ensures shares only existing on one cluster at a time and moves them during failover events.  The DFS Target folder - path to the Secondary cluster will automatically be activated once the shares are created by Eyeglass.


NOTE: It’s possible to use 2 different Smartconnect zones on source and destination cluster so that nothing needs to change during failover on either cluster.  See below


Typical DFS folder setup below

Eyeglass Isilon Edition - SyncIQ DR Orchestration Appliance Overview v31.png



Eyeglass DR Assistant - DFS Mode Failover - Summary


  1. Ensure that there is no live access to data

  2. Begin Failover (Eyeglass automated)

  3. Validation (Eyeglass automated)

  4. Set configuration replication for policies to USERDISABLED (Eyeglass automated)

  5. Provide write access to data on target (Eyeglass automated)

  6. (Not performed and not required) Move Smartconnect zone to Target (Eyeglass automated)

  7. (Not performed and not required) Update SPN to allow for authentication against target (Eyeglass automated)

  8. (Not performed and not required) Repoint DNS to the Target cluster IP address (use post failover script) (Eyeglass automated with scripting)

  9. Fail over Shares and Quotas - shares and quotas are created on target and deleted from the source cluster (Eyeglass automated)

  10. DFS Clients automatically switch to DR cluster with DFS 2nd Folder UNC target path.


For Details on this failover mode consult the DFS mode Failover Guide link. 


SyncIQ Mode with Eyeglass


This mode of failure allows targeted failover with some manual steps that allows selected policies to failover without entire access zone of policies.   Since no SPN management is performed with this failover type, it is better suited to NFS export failover + quotas.  Shares and exports are pre-synced with Eyeglass so both protocols are supported with this mode.


This failover mode does not automate Smartconnect Zone failover as is done with Access Zone failover.  This means selective Smartconnect Zones can be failed over requiring manual Smartconnect Zone aliases and DNS update to complete the failover.


This mode of failover is also useful with post failover script engine that can execute host side unmount and remount commands using scripts and leveraging the samples provided with Eyeglass.  We can also be engaged with Professional Services to build host side scripts for customer requirements.


Review the Admin Guide Script Engine Documentation


These scripts allow simple SSH based remote host unmount and remount automation but can also be done without needing to update DNS since the target cluster Smartconnect Zone can be mounted directly once the SyncIQ policy is marked writeable on the target cluster.


We recommend this option for automation when the host count is <30.  If the host count is higher we recommend Access Zone failover and DNS updates.


The following diagrams show the flow of failover and steps with sample commands that would be run during the Eyeglass policy failover.  The SPN commands are shown if SMB manual failover is being executed.


For Details on this failover mode consult the SyncIQ Failover Guide link. 



Failover Readiness


The Eyeglass assisted failover has diagnostics to detect when failover is not possible or recommended and updates a simple DR Dashboard to indicate your current state.  


For Access Zones, the DR Dashboard indicates when any of the following need attention: Data sync issues, configuration sync issues, SPN out of sync conditions and invalid IP pool mapping for Access zone failover.  


The DR Dashboard also provides a per SyncIQ readiness and DFS mode policy dashboard for SyncIQ + configuration sync readiness.  This allows sub Access failover readiness to be assessed versus the entire Access Zone. Eyeglass validates your DR readiness at regular intervals and will notify you via Eyeglass external alarming (if configured) if a problem is detected.


The Eyeglass Runbook Robot feature is another way to validate your readiness by automating a failover on a specific, non-production “EyeglassRunbootRobot” Access Zone or SyncIQ Policy every night at midnight.  This exercises the actual failover steps in your environment daily and will also notify you via Eyeglass external alarming (if configured) when a problem is detected.


This feature operates as cluster witness and mounts the cluster over NFS and writes and reads back test data to verify failover from the client view of the cluster.   It can be configured in basic or advanced modes.  


The basic mode only uses a syncIQ policy for failover with no other logic running.  Easy to setup and provides quick test of failover and failback.


The advanced mode tests all logic and operates with the Access Zone failover mode and provides the same NFS write and re-read logic in addition to SPN management and smartconnect zone mapping and failover logic.


Storage Failover with Eyeglass Failover Modes (SyncIQ Policy Failover, DFS Integrated Failover, Access Zone Failover


The following section outlines the storage layer failover steps.  The full end to end DR plan should also include application shutdown and bring up procedures to complete a true end to end failover.  The storage layer is the foundation upon which all higher layer failover depends and Eyeglass ensures this step is simple to execute and detect errors during failover.


Superna professional services can be engaged on end to end POC or recommendations and assessments for complex or application layer orchestrated failover scenarios. examples include:


  1. VMware SRM + externally mounted storage by VM’s

  2. Oracle RAC Data Guard + File System dependencies for applications

  3. Please see solutions page here


Once you have determined which Failover Mode is appropriate for your environment, the table below provides the high level stops for each mode:

  1. Steps: Ordered steps and purpose of step

  2. Description: Description of action taken by step

  3. Executed On: The device that the step is taken on

  4. Non DFS Mode/DFS Mode/Access Zone: How each step is executed with Eyeglass depending on whether a SyncIQ Policy Failover, Microsoft DFS Mode failover or Access Zone failover is being done

  5. Target of operation is shown in brackets as source, target or Eyeglass in the table below.








Ordered Steps  for  - Non DFS and DFS Mode

Description

Non DFS Mode

DFS Mode

Ordered Steps for - Access Zone Failover

Description

Access Zone (Release 1.4. and later)

1 - Ensure that there is no live access to data

(source)

Manual check for open files.

If Open files found, decide whether to failover or wait to be closed.

It is recommended to always disable SMB and NFS protocols on the SOURCE cluster prior to failover WHICH IS A CLUSTER WIDE OPERATION to eliminate data loss.

Manual

Manual

1 - Ensure that there is no live access to data

(source)

Manual check for  open files.

If Open files found, decide whether to failover or wait to be closed.

It is recommended to always disable SMB and NFS protocols on the SOURCE cluster prior to failover WHICH IS A CLUSTER WIDE OPERATION to eliminate data loss.

Manual

1a - Cache schedule for SyncIQ policies being failed over and prevent SyncIQ policies being failed over from running (source)


Get schedule associated with the SyncIQ policies being failed over on OneFS, set policies to manual so they don’t run again during failover

Automated by Eyeglass

Automated by Eyeglass

1a - Cache schedule for SyncIQ policies being failed over and prevent SyncIQ policies being failed over from running (source)

Get schedule associated with the SyncIQ policies being failed over on OneFS, set policies to manual so they don’t run again during failover

Automated by Eyeglass

2 - Begin Failover with DR Assistant (Eyeglass)

Initiate Failover from Eyeglass

Manual or Eyeglass REST API

Manual Eyeglass REST API

2 - Begin Failover with DR Assistant (Eyeglass)

Initiate Failover from Eyeglass

Manual or Eyeglass REST API

3 - Validation of failover job (Eyeglass)

Verify all warnings before submitting the failover job

Automated by Eyeglass

Automated by Eyeglass

3 - Validation of failover job (Eyeglass)

Verify all warnings before submitting the failover job

Automated by Eyeglass

4 - Synchronize data (Run syncIQ policies) (source)

Run all OneFS SyncIQ policy jobs related to the Access Zone being failed over

Automated by Eyeglass

Automated by Eyeglass

4 - Synchronize data (run SyncIQ policies) (source)

Run all OneFS SyncIQ policy jobs related to the Access Zone being failed over

Automated by Eyeglass (all policies in the Access Zone)

5 - Synchronize configuration (shares/export/alias, snapshot schedules, dedupe paths) (eyeglass)

Run Eyeglass configuration replication

Automated by Eyeglass (configuration exists on source and target)

Automated by Eyeglass

5 - Synchronize configuration (shares/export/alias,snapshot schedules, dedupe paths) (eyeglass)

Run Eyeglass configuration replication

Automated by Eyeglass (based on matching Access Zone base path)

6 - Renaming shares DFS mode to redirect DFS clients (multi threaded)

For DFS Failover, shares renamed on source and target cluster so clients are redirected with dual DFS target paths to target cluster

Not Applicable

Automated by Eyeglass

(special handling renames Shares on source and target so that only one DFS target UNC is reachable and active for DFS clients to switch over)


NOTE: It is possible to integrate DFS protected data inside an access zone failover to protect Shares , exports and DFS data with access zone failover.

If DFS configured redirect rename steps would executed at this point

Automated by Eyeglass (based on matching Access Zone base path)





6 - Change Smartconnect Zone on Source so not to resolve by Clients (source) (dual delegation eliminates DNS updates)

Rename Smartconnect Zones and Aliases (Source)

Automated by Eyeglass (based on matching Access Zone base path)





7 - Avoid SPN Collision (source)

Sync SPNs in all AD providers to current SmartConnect zone names and aliases (proxy through target cluster (Source)

Automated by Eyeglass

(AD delegation must be completed as per install docs)

9 - Provide write access to data on target (target) (single threaded for safe failover)

Allow writes to SyncIQ policy(s) related to failover2

Automated by Eyeglass

Automated by Eyeglass

8 - Move Smartconnect zone to Target (target)

Add source Smartconnect zone(s) and  Aliase(s) on  (Target)

Automated by Eyeglass

10 - Resync prep Step SyncIQ - Disable SyncIQ on source and make active on target (source)

Resync prep SyncIQ policy step to failover (Creates Mirror Policy on target and disables source cluster policy and enables target cluster policy OneFS

Automated by Eyeglass

Automated by Eyeglass

9 - Update SPN to allow for authentication against target  (target)

Sync SPNs in all AD providers to current SmartConnect zone names and aliases (proxied through target cluster) (Target)

Automated by Eyeglass

11- Re-Set SyncIQ schedule on target mirror policy (target)

Set schedule on Mirror Policy(Target) using schedule from step 1 from OneFS for policy(s) related to the Failover job

Automated by Eyeglass

Automated by Eyeglass

10 - Repoint DNS to the Target cluster IP address

DNS Dual  delegation for all Smartconnect Zones that are members of the Access Zone

Automated by Eyeglass (See dual delegation setup details)

12 - Failover quota(s) (eyeglass)

Eyeglass DR Assistant automatically fails over quotas by running the  Quota Jobs related to the SyncIQ Policy(s) being failed over

Automated by Eyeglass (deleted on source cluster and created on target cluster)

Automated by Eyeglass (deleted on source cluster and created on the target cluster so that post failover quotas are applied)




13 - Remove quotas on directories that are target of SyncIQ (Isilon best practice) (source)

Eyeglass deletes all quotas on the source for all the policies

Automated by Eyeglass

Automated by Eyeglass




14 - Change Smartconnect Zone on Source so that names are not  resolved by Clients (source)

Rename Smartconnect Zones and Aliases (Source)

Manual

Not Required (source and destination clusters can use existing smart connect zones)

14 - Disable SyncIQ on source and make active on target (source)

Resync prep SyncIQ policy step to failover (Creates Mirror Policy on target and disables source cluster policy and enables target cluster policy OneFS

Automated by Eyeglass

15 - Avoid SPN Collision (source)

Sync SPNs in all AD providers to current SmartConnect zone names and aliases (Source)

Manual (deletes smartconnect SPN from source cluster machine account)

Not applicable (DFS SPN’s are not changed during failover)

15 - Set proper SyncIQ schedule on target (target)

Set schedule on Mirror Policy(Target) using schedule from step 6 from OneFS for policy(s) related to the Failover

Automated by Eyeglass

16 - Move Smartconnect zone to Target (target)

Add source Smartconnect zone(s) as  Aliase(s) on  (Target)

Manual

Not Required (source and destination clusters can use existing smart connect zones)

16 - Synchronize quota(s) (eyeglass)

Run Eyeglass Quota Jobs related to the SyncIQ Policy or Access Zone being failed over

Automated by Eyeglass

17 - Create SPN’s to allow for kerberos  authentication against target for SMB shares  (target)

Sync SPNs in all AD providers to current SmartConnect zone names and aliases (Target)

Manual (adds new smatrconnect alias SPN’s to target cluster machine account)

Not applicable (DFS SPN’s are not changed or registered to Cluster machine accounts)

17 - Remove quotas on directories that are target of SyncIQ (Isilon best practice) (source)

Delete all quotas on the source for all the policies

Automated by Eyeglass (Requires IP pool hints are configured See docs)

18 - Repoint DNS to the Target cluster IP address

Update DNS delegations for all Smartconnect Zones that are members of the Access Zone

Manual

Not applicable (no updates are needed as DFS resolution has not changed in DNS, only the target UNC with an active share

DNS updates

Dual Delegation feature with Eyeglass avoids any DNS steps during failover for all smartconnect zones that are failed over

Automated by Eyeglass (see dual delegation one time configuration here)

19 - Refresh session to pick up DNS change

Remount the SMB share(s)

Manual on clients

Automatic (Windows 7 or later  with DFS support)

18 - Refresh session to pick up DNS change

Remount the SMB share(s) or remount exports

Automated by Eyeglass using Dual smartconnect zone Delegation (How to Configure Here)

  1. Initiates Eyeglass Configuration Replication task for all Eyeglass jobs

  2. SyncIQ does NOT modify the ACL (Access control settings on the file system).  It locks the file system.   ls -l   will be identically on both source and target




Supported DR Site and Failover Topologies


This replication topology cover the scenario commonly used to remote sites.   This allows for 1 or 2 DR copies of data to be available at different geographic distances.   The option to automate failover end to end is possible with Access Zone and  DFS mode failover.


Data Center to Data Center



Supported Failover Modes

  1. Access zone - Fully automated any site failover

  2. DFS mode - Fully automated any site failover

  3. Per SyncIQ - partially automated any site failover


Multi Site Failover


multi site failover.png


Supported Failover Modes (see multi site failover guide)

  1. Access zone - Fully automated any site failover

  2. DFS mode - Fully automated any site failover

  3. Per SyncIQ - partially automated any site failover

Data Center DR Fan-IN Topology

Supported Failover Modes

  1. Per SyncIQ

  2. Access Zone

  3. DFS mode

2 Site DR - Stretch 3rd site Configuration Sync


Supported Failover Modes

  1. Access zone (A to B) Config synced to C manual failover

  2. Per SyncIQ (A to B) Config synced to C manual failover

  3. DFS mode (A to B) Config synced to C manual failover






How to use the DR Dashboard to Assess Failover Readiness


The DR Dashboard is the main status screen for overall cluster readiness for a DR event.  The status column is sent as a critical alarm when a validation function fails (SyncIQ, Config replication, SPN checks,  Network IP Pool mapping readiness audit).  This way you can address any issues that would affect your ability to failover when they are detected instead of discovering these issues at failover time.


Policy Readiness / DFS Readiness


SyncIQ Policy Failover Readiness and SyncIQ DFS Mode Failover Readiness are based upon the status of the SyncIQ Policy Job (Data replication) in OneFS and the Eyeglass Configuration Replication Job (Configuration Replication) for that SyncIQ Policies related configuration data (shares, exports, and aliases).  The status of these two are combined to provide an overall DR Status.  The Policy Readiness and DFS Readiness are updated each time Eyeglass Configuration Replication is run.   



For more detailed information on these status, please refer to the Eyeglass Admin Guide here.


Zone Readiness


The Zone Readiness tab provides a per Access Zone summary of all the key networking, kerberos SPN, smartconnect connect subnet\pool information along with SyncIQ status and Configuration replication validations done for assessing readiness for failover by Access Zone.  The status for each are combined to provide an overall DR Status.  The Zone Failover Readiness is updated every 15 minutes by default.


This information provides the best indicator of DR readiness for failover and allows administrators to check status on each component of failover, identify status, errors and correct them to get each access zone configured and ready for failover.


By default the Failover Readiness job which populates this information is disabled.  Instructions to enable this Job can be found here.


Screen Shot 2017-05-16 at 7.29.16 AM.png



For more detailed information on these status, please refer to the Eyeglass Admin Guide here.


How to enable Automated DR Testing the Eyeglass Runbook Robot Feature


Many organizations schedule DR tests during maintenance windows and weekends, only to find out that the DR procedures did not work or documentation needed to be updated.  Eyeglass Run Book Robot feature automates DR run book procedures that would normally be scheduled in off peak hours, and avoids down time to validate DR procedures, providing Failover and Failback automation tests with reporting.


This level of automation provides high confidence that your Isilon storage is ready for failover with all of the key functions executed on a daily basis.   In addition to automating failover and failback, Eyeglass operates as a cluster witness and mounts storage on both source and destination clusters the same way the cluster users and machines mount storage externally using access zone mount paths.


The feature exercises maximum automation used in Access Zone Failover (Advanced mode) or a basic Quick start more that only uses SyncIQ policy failover mode.

For more detailed information on planning and operation for Eyeglass Runboot Robot, please refer to the Eyeglass Runbook Robot Guide.


Planning and Procedures for Eyeglass SyncIQ DFS Mode Failover


DFS Mode Preparation Checklist


DFS mode requires the following prerequisites:

  1. Windows 2008 or 2012 Domain Controller

    1. DNS role installed

    2. DFS files services role

  2. 2 x Isilon clusters with SyncIQ

  3. Eyeglass appliance

  4. DFS enabled clients Windows 7, 8, 10, Server 2008, 2012, 2016

DFS Mode Compatibility


  1. Not compatible with RunBook Robot feature, since NFS is used for data access

  2. Hot\Hot and Hot\Cold compatible

  3. Compatible with Access Zones and Access Zone Failover mode but requires dedicated subnet:pool with Eyeglass igls-ignore hint applied to retain smartconnect zones on source and target clusters.  DFS mode does not require smartconnect zone names to failover.

Considerations for Eyeglass SyncIQ DFS mode vs Default Configuration Sync job mode in Eyeglass

  1. Default mode Eyeglass Job mode is configuration sync mode which places configuration data on both source and target cluster treating the configuration data the same as SyncIQ, meaning it's maintained in full sync on both clusters.

  2. Eyeglass SyncIQ DFS mode can be enabled and will delete share objects from the target cluster referenced in the policy, and fails over shares.  Quotas are also failed over during share failover.

Procedure to Enable Eyeglass SyncIQ DFS mode

  1. Select policy with shares to be protected and then Select a bulk action option Enable/Disable Microsoft DFS.

  2. Screen Shot 2015-08-17 at 9.07.45 PM.png

  3. Run the DFS Enabled job

  4. Verify its green before configuring DFS in Active Directory

  5. Screen Shot 2015-08-17 at 9.09.55 PM.png



Detailed DFS Mode Configuration, Operating procedures and Design guidelines




Planning and Procedures for Eyeglass SyncIQ Mode Failover


Recommended for NFS or application failover that requires post failover scripting for DNS, unmount/mount host side automation.


Not recommended for SMB failover.  DFS mode or Access Zone failover handles SPN management and smartconnect zone operations during failover.


SyncIQ mode Failover mode Guide

Planning and Procedures for Eyeglass Access Zone Failover



For requirements on setting up Access Zone planning guide see here.



How to Execute A Failover with DR Assistant


Follow these steps to execute a failover.  Note: The planning guide is expected to be the referenced document for all planned failovers.  Support expects this document has been used for planning.


  1. Verify manually that there are no open files on the source cluster.  There should be no client access to the Failover Source cluster during failover as this data will not be replicated.

  2. Open DR Assistant

  3. Screen Shot 2017-05-16 at 7.36.30 AM.png

  4. Select failover mode (consult design documents)

  5. Select source cluster that has the writeable data to failover

  6. Leave all default check boxes for a planned controlled failover.

    1. Controlled failover  

      1. Check if the source cluster is healthy and reachable. LiveOPS Dashboard Icon

      2. Uncheck if the source cluster is not healthy or reachable This option is a REAL DR event.  NOTE: Do not use this option unless lab testing OR you are prepared for manual steps to recover from the resulting end state.  In this case,  source cluster API calls are skipped and cached knowledge of shares, quotas are used to failover (Real DR Event).

        1. IMPORTANT:

          1. Eyeglass Configuration Replication Jobs will be in USERDISABLED state on source and target cluster after an uncontrolled failover.

          2. Eyeglass requires that directories being failed over exist on the target cluster which means SyncIQ policies have run at least once prior to failover.

    2. Data Sync

      1. Check to run a final SyncIQ data sync Job as part of the failover

      2. Uncheck to skip the SyncIQ data sync step

    3. Config Sync

      1. Check to run a final Eyeglass Configuration Replication Job as part of the failover

      2. Uncheck to skip the Eyeglass Configuration Replication step

    4. SyncIQ Resync Prep

      1. Check to execute the SyncIQ Resync Prep failover step (leave this default advanced setting)

      2. Uncheck to skip  SyncIQ Resync Prep failover step.  This is not recommended as it will leave the system in state where you will not be able to use Eyeglass to failback. This is used ONLY when customers want to failover in one direction and then recreate a new policy or they know how to manually recover and create mirror policy.

    5. Disable SyncIQ Jobs on Failover Target (advanced setting leave defaults )

Disable on failover is optional if you don’t want to configure failback and execute sync job in the return direction.  This is used when you want to verify systems before replicating data back to the source. Warning: Using this option WILL require manual steps to failback

  1. MUST READ: Uncheck Controlled Failover ONLY if this is a REAL DR event (NOTE: If this is unchecked Eyeglass assumes the source cluster is destroyed, NO steps that provide failback are executed.  Customer is responsible for recovery from uncontrolled failover. NO automated recovery is possible from using this option.  It is expected customers make decisions to protect data at all times and only use this option if data is deemed not usable for business reasons.   All recovery is manual if this option is used.

  2. (Screenshot below) Review best practices document, it is expected all release best practices are read and understood before proceeding. This document also covers prep steps for failover example domain mark.  This document is a must read for any failover.

  3. Screen Shot 2017-05-16 at 7.41.31 AM.png

  4. Select the policy or policies or access zone for the failover type selected.   

    1. Check readiness again before continuing to ensure you understand the warnings and if they will affect your failover.  In general warnings do not block failover.  Errors block failover.

  5. Screen Shot 2017-05-16 at 7.43.03 AM.png

  6. Review Failover release notes that cover special scenario’s that must be assessed if they affect your planned failover.  This document requires acknowledgment before continuing. Failing to read this document can result in data loss.

    1. Per SyncIQ Policy or DFS failover validation Screen.

      1. This screen lists all policies selected for the failover.  NOTE: The previous policy selection screen only allows policies that are valid choices. The Selected policies will be summarized. (Screenshot below)

  7. Screen Shot 2017-05-16 at 7.45.23 AM.png

    1. Access Zone validation Screen (below).  NOTE: This screen will show all policies within the access zone that are eligible for failover.  If any policy is USER DISABLED or policy disabled it will be shown as “will NOT be failed over”. NOTE: Do not failover with a disabled policy unless you know the data protected by this policy does not need to be failed over.

    2. Screen Shot 2017-05-16 at 1.16.46 PM.png

    3. Reference the screenshot below to see a successful validation screen.

    4. val.png

  8. Review final confirmation screen.  

    1. Review link to recovery guide.  

    2. This document is used by support to to assist with recovery steps. It is assumed you are familiar with this document.  

    3. Final acceptance and point of no return.

    4. NOTE: The failover job cannot be canceled once started.

  9. Screen Shot 2017-05-16 at 7.47.58 AM.png

  10. Start the failover with Run button.

  11. Monitor with logs icon.  

    1. Click Watch to follow the failover real-time or click fetch to update log window with current progress.

  12. Screen Shot 2017-05-16 at 7.50.20 AM.png

  13. Failover status completes with success or failover.

  14. If failure download support log and open support case.


How to Validate and troubleshoot A Successful Failover WHEN Data is NOT Accessible on the Target Cluster


Debug Plan of attack for clients post failover:


NOTE: follow the order below to find root cause


  1. Check DNS

  2. Mount share from client

  3. If authentication error fix

  4. If no authentication error Test write access

  5. If no write access, remount correctly

  6. Retest mount of share, test write access again.

  7. Done.


Steps to Validate Smartconnect and DNS failed for over Successfully:



  1. Test DNS response on the clusters:  This step verifies that smartconnect names were failed over successfully and also can verify if dual delegation in your DNS environment is setup correctly.  This step eliminates an issues with your internal DNS and verifies Isilon Smartconnect zones failed successfully.

    1. Quick test:  From a windows Client machine dos prompt type “ping <Smartconnect name FQDN>”  This should return IP address from the Target cluster IP pool.  If ping does respond with a correct IP from the TARGET cluster.    

        1. Then cancel ping command CTRL-C and ping again to the same smartconnect name  to make sure a second IP from the same Target cluster IP pool is returned to verify Smartconnect and Isilon DNS is functioning as expected.

        2. If ping test is successful on BOTH ping tests.  CONTINUE TO MOUNT STEPS section.

        3. If you get failed Ping or name does not resolve name to correct IP address of the TARGET cluster.  Continue with steps below to debug DNS.   

    2. From any Windows client machine type “nslookup<press enter key>

    3. Source Cluster DNS Test:

      1. then type "server x.x.x.x" enter key.  where x.x.x.x is the Subnet service ip of the source cluster

      2. type "FQDN of smartconnect zone used in failover"  <press enter key> .  Hint: Refer to the failover log from DR Assistant for the full list of smartconnect names that were failed over  example data.example.com

      3. The expected response is a failed resolution since failover disables the SOURCE cluster DNS response.

        1. Example of a failed nslookup on the cluster you failed away from “** server can't find userdata.ad1.test: REFUSED”

        2. NOTE: if lookup does NOT return REFUSED response, then smartconnect name did not failover correctly AND consult recovery guide Networking section. To fix smartconnect names.

    4. Target Cluster DNS Test:

      1. Test TARGET cluster SSIP (subnet service IP ) with  DNS

        1. type "server y.y.y.y" enter key. where y.y.y.y is the subnet service ip of the target cluster

        2. type "FQDN of smartconnect zone used in failover" Refer to the failover log for list of smartconnect names that were failed over

        3. Expected response SUCCESSFUL NAME RESOLUTION RETURNING IP OF THE TARGET CLUSTER. This means smartconnect was failed over correctly to the target cluster.  

        4. If DNS test fails this step OR  IP fails to resolve OR is the wrong IP address.   consult recovery guide Networking section. To fix smartconnect names.

    5. If all DNS tests pass in this section

      1. Root Cause: Your internal DNS is not setup correctly for dual delegation is not configured correctly, since SSIP on the cluster correctly answers DNS queries. Stop here and correct using guide and video above Here.   Double check 2 name server entries exist for the smartconnect name you are failing over.

      2. End debugging.



Steps to test Mounting a Share on Machine with no previous Mount:

  1. NOTE: Use a Windows client that DOES NOT have a connection to any cluster to perform this test correctly.

  2. Mount test from a Windows client in File Explorer  \\FQDN of smartconnect zone in access zone\<share name>

    1. if this step is successful and test write Access to the share - SKIP to section below “Steps to correctly test a machine with existing Mount to Source cluster and remount to test write access:

    2. If you received a Windows login popup message for user id and password.  This indicates AD SPN kerberos failover issue. (Check Eyeglass failover log in DR Assistant for SPN Delete or Create failed steps and check smartconnect name to the failed SPN step in the log.

      1. Typically SPN issue will mean a popup login dialogue box in windows requesting user id and password since authentication failed to the target cluster of the failover.

      2. on the source cluster use isi command to verify the FQDN(s) are NOT listed

        1. Example “isi auth ads spn list <your AD domain provider here>”

      3. on the target cluster use isi command to verify the FQDN(s) ARE listed.

        1. Example “isi auth ads spn list <your AD domain provider here>”

      4. If the source has the SPN FQDN is listed OR the target does NOT have the SPN listed matching your FQDN(s).  Then MANUAL SPN failover is required to allow kerberos authentication to succeed see recovery guide section here

      5. Correct SPN issue and retest mount access

      6. If successful  SKIP to section below “Steps to correctly test a machine with existing Mount to Source cluster and remount to test write access:



Steps to correctly test a machine with existing Mount to Source cluster and remount to test write access:


  1. Unmount share/export: Note Access zone failover requires All Windows OS's and linux OS's to unmount before attempting to access data on the target cluster

    1. Windows OS’s net use x: /delete (replace x with drive letter).    OR

    2. Use Windows Explorer right click the drive letter and select the Disconnect menu option.

      1. Note this is not the best way to test if other netbios sessions exist to the cluster this command will not release the session.  #1 way to ensure this step is done correctly is REBOOT THE CLIENT MACHINE.  Proceed below if you do not want to reboot the client.

      2. Using File Explorer to mount FQDN of  the access zone smartconnect name  \\FQDN of smartconnect name\sharename

      3. Test write access to share

          1. If this step fails and you have read only error  continue to next step

          2. From a dos prompt:   Type “netstat -an | more” to list TCP sessions,  look for an entry that lists an IP address to the source cluster on port 445.   This means that NETBIOS SMB session to the server still exists and the unmount did not release the TCP session.

          3. Next Step:  Reboot to guarantee no sessions to the source cluster and repeat the mount of \\FQDN\share name

      4. After Successful remount of smartconnect name

        1. verify TCP session to target cluster

          1. From a dos prompt:   Type “netstat -an | more” to list TCP sessions,  look for an entry that lists an IP address to the target cluster on port 445.   This means that NETBIOS SMB session to the target cluster is connected.

        2. Test write access to share

      5. Completed all debugging.







How to Monitor the Eyeglass Assisted Failover


In-Progress Failover


Once a failover has been started, you can monitor its progress from the Eyeglass DR Assistant / Running Failovers tab.


Screen Shot 2016-11-23 at 8.07.27 PM.png



From this window you can expand the Job Details tree to see the progress and status for each failover step.

You can also open the failover log from this window to see the details for each step by selecting the Logs link.  


Screen Shot 2017-05-16 at 7.50.20 AM.png



Each entry in the log is timestamped. The log is updated as the failover proceeds and you can see log updates by closing and opening the log file again.


Should an error occur during failover, an Eyeglass system alarm will be issued.  If you have configured external notification by email or Twitter you will receive these alarms this way.  The alarms are also visible from the Eyeglass Alarms window.




Completed Failover


Once the failover is completed, it will appear in the DR Assistant / Failover History tab.


Screen Shot 2017-05-16 at 7.58.27 AM.png


The Result column displays the SUCCESS if the Failover completed successfully and FAIL if there were errors encountered in the Failover steps.  The SyncIQ reports are available separately to review cluster logs for each step of the failover.


Note: An Access Zone Failover with Result of SUCCESS may have had SPN errors.  Please refer to the Access Zone Failover Guide for details on checking for SPN errors and resolution.


From the Failover History window, click on the row corresponding to the Failover that you would like to review.  The Job Details tree will appear below and the Failover Log can be retrieved for viewing or download by selecting the Open link.



Troubleshooting Failover


Failover Recovery Procedures

In the event that a Failover does not complete all steps successfully, please refer to the Eyeglass Failover Recovery Procedures to assess the state of your environment and for recovery steps.



Collecting Logs for Failover Troubleshooting

To collect the logs for Failover Troubleshooting, following the instructions for collecting support information found in the Eyeglass FAQ document here.  The Failover logs will be included with other Eyeglass logs contained in the Logs Backup file.


Authentication with Service Principal Name Considerations with Active Directory and SMB Shares in Access Zones


Active Directory only allows a single computer account to register a Service Principal Name against a computer account.  This property can be seen with ADSI Edit tool.  The SPN is in the form of HOST/service name and typically has 2 entries one for Netbios naming (15 characters)  and one for DNS URL format for each smartconnect zone or zone alias created on a cluster.


The service principal name is required to exist on the machine account handling authentication requests from clients to send to a domain controller for authentication using kerberos session tickets.


Active Directory does prevent duplicate SPN from being registered and if this occurs Kerberos authentication fails for clients and they will be unable to mount data if NTLM fall back authentication does not succeed.    Eyeglass failover deletes the SPN's of the subnet pool and it’s aliases on the selected source cluster access zone from the  AD computer account or ALL AD providers assigned to the Access Zone during failover.  


Eyeglass also scans cluster machine accounts during configuration replication jobs and fixes missing SPN’s if detected.

Error-DuplicateSPN-Detected.png

Example Error seen after duplicate SPN’s were created.  This is seen on the domain controller attempting to authenticate a mount request. This error only appears once and not for each failed authentication.


For information this event see KB article https://support.microsoft.com/en-us/kb/321044








Appendix A  - Advanced Failover Modes


Cached config advanced mode


In some cases customers can not pre sync configuration data from one cluster to the DR site, as this exposes data at the DR location.   This requirement means, all configuration data must be cached on the Eyeglass appliance versus pre-synced to the DR cluster.


Eyeglass always has a database of changes but it’s not used for failover operations as this information can be stale in planned failovers.   As of release 1.6 and later a failover mode switches Eyeglass to sync configuration data to files that are used in all failover modes controlled and uncontrolled.


The process now looks like this:


  1. Sync every 5 minutes, get configuration information difference changes and update local files on eyeglass appliance

  2. Controlled AND uncontrolled failover reads from cache files only and never communicates with the source cluster to create shares,exports, quotas during the failover process on the target DR cluster

    1. For controlled failover this means that potential stale data is used to failover in this scenario


How to enable cached config advanced mode

  1. Ssh to eyeglass as admin

  2. igls adv failovermode set --readfromfile=true

  3. Done.


Fast Failover advanced mode


This mode switches to parallel policy with up to 10 threads for make writeable step and resync prep step.  This defaults to disabled and uses sequential make writeable and resync prep steps when a group of policies are involved in a failover job.  This default of sequential is Recommended as Best practice least risk mode.     NOTE:  Errors on policies are ignored in this mode which may result in multiple policies failing if the cluster cannot process all the failover requests.  Use with caution and understanding recovery procedures.


For customers that have 100’s or greater number of policies for business reasons require faster failover option.  

Key differences between default sequential and parallel mode:

  1. For 8.x clusters, 50 policies can run at a time and Eyeglass will use a maximum of 10 threads allow 10 policy make writeable or resync prep commands to be sent in at once. For 7.2 clusters only 5 will execute and 5 are queued.  If one policy completes another policy is started with the goal of keeping maximum number queued at all times.  

  2. Testing has shown 3x to 4x overall time to complete make writeable improvements.  Results in production may vary.

  3. Error handling with sequential failover stopped failover, failed the job, and allowed simplified recovery from failed steps.

  4. Error handling with parallel mode does NOT stop if a failed make writeable or resync prep policy fails.  Failover will continue.  This completes recovery if multiple policies fail, all policies will be tried with the potential for many failed policies.


How to enable Fast Failover advanced mode


  1. igls adv failovermode set --parallel=true

  2. Done. The change affects all failover jobs.

  3. Disable with

  4. igls adv failovermode set --parallel=false