Failover Planning Guide and checklist

Failover Planning Guide and checklist



Superna Eyeglass Isilon Edition

Revision Changes to this Document



Table of Contents





.

Chapter 1 - Introduction to this Guide


Overview

The Eyeglass Isilon edition greatly simplifies DR with DFS.  The solution allows DFS to maintain targets (UNC paths) that point at both source and destination clusters.

The failover and failback operations are initiated from Eyeglass and move configuration data to the writeable copy of the UNC target. Grouping of shares by SyncIQ policy allows Eyeglass to automatically protect shares added to the Isilon.  Quotas are also detected and protected automatically.

The following checklist will assist you in plan and test your configuration for Failover in the event DR (Disaster Recovery) is needed.

Where to go for Support

Superna offers support in several forms, on line, voicemail, Email, or live on-line chat.

  1. The support site provides online ticket submission and case tracking.  Support Site link - support.superna.net 

Leave a voicemail at 1 (855) 336-1580 (must leave customer name email, description of question or issue, primary contact for your company with an account in our system. We will  assign the case to primary contact for email followup)

  1. Email eyeglasssupport@superna.net

  2. This is also how to download license keys.

  3. You can also raise a case right from in Eyeglass desktop using the help button, search for your issue and if want to raise a case or get a question answered, click the leave us a message, name, email and appliance ID and a case is opened directly from Eyeglass.

 http://site.superna.net/_/rsrc/1472870726155/support/LeaveUsAMessage.png?height=200&width=167

  1. Or get Support Using Chat M-F 9-5 EDT (empty box?  we are not online yet)

  2. Eyeglass Live Chat 

  3. You should also review our support agreement here.


Chapter 2 - Checklist to plan for Failover



Steps Before failover Day

Task

Description

Completed

0

Document DR Runbook plan

  • Organize steps, contacts, order of steps, contacts per step required on execution of failover day


0A

Purchase Superna DR Readiness Audit Service


1A

Review DR Design Best Practices

Review Failover Release notes

Warning: Mandatory Step for all customers DR Assistance requires acceptance before continuing


1B

Upgrade Eyeglass to latest version (Eyeglass releases includes failover rules engine updates that add rules found from other customer failovers that continuously improve or avoid known failover issues) Review the Failover Release notes disclaimer on latest release requirement.


1C

Test DR procedures



1D

Benchmark Failover (access Zone)

  • Copy data into a test policy or the runbook robot access zone (note Robot can only use 1 policy for testing, to complete multi policy testing a test access zone would need to be created and configured for access zone failover)

  • Execute test failover  and use failover log to find the make writable step time delta to the start of the log.  This is the point at which failover is completed, and failback steps now begin to execute but clients are able to write data to target at this point.

  • Repeat above with 2 policies and a known quantity of data so that both policies sync data and failover.  Record the make writable time difference log step to the beginning of the failover log time stamp

  • Repeat one more time with 3 policies same amount of data in each directory

  • Now average the 3 test run times to the make writeable step and use this value that is unique to your environment (clusters, WAN, nodes in replication, etc..) to use to calculate estimated failover times if you have more than 3 policies.  

  • Note the test access zone should have all configuration completed (hints, spn, shares and exports and quotas) to ensure that the time estimates are as close to production configuration when estimating failover times.

  • Note: If change rate is expected to be zero before planned failover then skip step to create changed data before failover.

  • Note: The reason to create as many shares under each policy as in production is to get the time for the rename step to complete for each share, this step is parallel operation but should be benchmarked on your clusters

  • Note: failover logs include steps post failover to prepare for failback and complete audit of the clusters. The failover job time DOES NOT REPRESENT THE TIME IT TAKES TO FAILOVER. YOU MUST CALCULATE THE MAKE WRITABLE STEP IN THE LOGS



1E

Benchmark Failover (DFS Mode)

  • Use the Access Zone with DFS mode policy or create a test DFS mode policy

  • Copy test data into path

  • Create one more shares into the path of test policy (if you have more than one share under a policy in production than create as many shares as you have in production policy configuration)

  • Create more than one policy as per above step example 3 to get a good time average

  • Create changed data if you plan to failover with un-synced data (optional step)

  • Run DFS mode failover on 1 policy, then 2 then 3.  Record the make writeable step time difference to the start of the failover log.

  • Calculate the average time per policy (based on your production configuration)

  • Use this number to estimate the time to complete your production failover times

  • Note: The reason to create as many shares under each policy as in production is to get the time for the rename step to complete for each share, this step is parallel operation but should be benchmarked on your clusters

  • Note: failover logs include steps post failover to prepare for failback and complete audit of the clusters. The failover job time DOES NOT REPRESENT THE TIME IT TAKES TO FAILOVER. YOU MUST CALCULATE THE MAKE WRITABLE STEP IN THE LOGS


2

Contact list for failover day

  • AD administrator

  • DNS administrator

  • Cluster storage Administrator

  • workstation, server administrators

  • Application team for dependant applications

  • Change Management case entered for outage window


3

Reduce failover and failback time - Run manual domain mark jobs on all syncIQ policy paths (this will speed up failover because domain mark can take a long time to complete and elongates the failover time)

All policies http://doc.isilon.com/onefs/7.0.1/help/en-us/GUID-8550C23D-9550-457E-A368-477E65CFD683.html


4

Count shares,exports, NFS alias, quotas on source and target with OneFS UI

Validates approximate config count is synced correctly (also verify Superna Eyeglass DR Dashboard)

(there should be no quotas synced on target - only shares, exports and NFS alias)


5

Verify dual delegation in DNS before failover

This verifies that DNS is pre-configured for failover for all Smartconnect Zones that will be failed over (Access Zone failover fails over all Smartconnect Zones on all IP pools in the Access Zone)


6

DFS failover preparation

using dfsutil verify clients that will be failing over show two active paths to storage and that correct path is active

dfsutil tool downloaded by OS type

check path resolution

Screen Shot 2016-02-25 at 7.07.38 PM.png

Screen Shot 2016-02-25 at 7.08.48 PM.png


7

Communicate to application teams and business units  that use the cluster the failover outage impact

  1. Scheduled maintenance window with application and business units

  2. Ensure to explain that data loss will occur if data is written passed the maintenance window start time


Steps on the  failover Day

Task

Description

Completed

0

SMB and NFS protocol disable

Determine if this will be used to enforce no writes to source cluster (consult EMC documentation on isi command)


1

Disable eyeglass configuration sync

This avoids failover configuration sync

  1. igls admin schedules list

  2. igls admin schedules set --id Replication --enabled false (day of failover)

  3. igls admin schedules set --id Replication --enabled true (after failover)



2

Ensure Active Directory admin is available

ADSIedit recovery steps are required and needs Active Directory Administrator access to cluster machine accounts


3

Force run syncIQ policies  1 hour before planned failover

Run each syncIQ policy before so that the failover policy run will less data to sync