Failover Recovery Procedures

Failover Recovery Procedures


Failover Log Analysis

This section covers steps to review failover logs to determine which step failed and next steps using the tables in this document.

Each failover log contains ALL sections in this document.  See expanded failed failover log.

Screen Shot 2016-12-09 at 4.16.27 PM.png

Steps to reading a failover log.

  1. Identify the section in the log with an error message expand folders to find red X

  2. Determine which step failed and which table applies for next step

  3. See which scenario in each table applies to the scenario in the failover log

  4. NOTE: Frequently SyncIQ policy issues are source of errors.  Each step run, make writeable, resync prep (4 steps) and run mirror policy have errors that appear on the cluster as SyncIQ job report with details on the error condition.

  5. As of 1.8 or later releases:  The following syncIQ job reports are now collected and added to the end of the failover log, making it easier to see failures on the cluster related to SyncIQ:

    1. This error reporting if sent to Support will allow faster resolution and opening a case with EMC if root cause is SyncIQ policy failure that can not be recovered or retried.

  • Run Report;

  • Resync Prep Report;

  • Resync Prep Domain Mark Report;

  • Resync Prep Restore Report;

  • Resync Prep Finalize Report;

Report added for mirror policy:

  • Run Report.

  1. NOTE: If using advanced parallel mode expect more policy reports and errors will potentially exist and need review to see which policy failed.   Also note that, sequential failover logs are in order but in parallel mode logging will not be sequential based on failover logic.

Replication Policy Failover Preparation (Data sync and config Sync)

Failure to Complete Step

Description of Impact to Failover

Recovery Steps

Special Instructions or Warnings

Wait for other failover jobs to complete

Eyeglass only runs one failover job at a time.


Failover has not started.

  1. Wait until no Failover jobs are present in the running jobs window.  

  2. Restart failover

This step will wait for up to two hours in the running state before timing out.



SOURCE get POLICY info

Eyeglass cannot communicate with source cluster.


Failover fails.

  1. Validate connectivity between Eyeglass and the source cluster.

  2. Restart failover

This step is not run during uncontrolled failover.


Data Loss Impact - Failover can not proceed, data loss scenario until resolved.


Wait for existing policy jobs to complete.



Failover fails if this step times out. Default timeout failover timeout  in the current release is 180 minutes, this can be increased with igls cli command see admin guide to increase timeout for each step in failover. Also see best practise here.  

  1. Validate no other failover jobs are running on eyeglass.

  2. Restart failover.

  3. NOTE: SyncIQ policies that return error from the cluster will require EMC support case to be opened to resolve policies that will not start or run resync prep or execute make writeable operations

The timeout on this step read from the failover timeout value in minutes.


Data Loss Impact - Failover can not proceed, data loss scenario until resolved.

SOURCE remove schedule POLICY

Eyeglass cannot communicate with source cluster.


Failover fails.

  1. Validate connectivity between Eyeglass and the source cluster.

  2. Restart failover.

This step is not run during uncontrolled failover.


Data Loss Impact - Failover can not proceed, data loss scenario until resolved.


Replication Policy Failover run all policies

(SyncIQ Data Sync option enabled on failover job and multiple policies failed over)

Final incremental sync of data on the filesystem has failed.


Failover Aborted.


Source and targets still in initial state meaning target cluster is read-only still

  1. Use the Eyeglass Job Details tree to determine which policies ran successfully, and which failed.

  2. Open OneFS on the source to determine the reason for policy failure(s).

    1. If policy job is still running, wait for it to complete, or cancel the job in OneFS.

    2. Manually run the policy, and see if it can be completed.

  3. Open Case with EMC for SyncIQ jobs failing to run.

  4. Restart Failover.

Eyeglass waits for up to failover over timeout value in minutes for each policy to run.  It will abort with a timeout if the incremental sync takes longer than an hour.  Wait for this job to finish, then restart the failover job to try again.


If the unsynced data on the source filesystem is unimportant and it's more important to get the target cluster file system writable, the Failover can be restarted with the “Data Sync” box unchecked.  This will skip the final incremental sync, and will result in data loss for any data not previously replicated with SyncIQ.


NOTE: This step is not run during uncontrolled failover.

Run Configuration Replication now.

(Config Sync enabled in failover job)


Final sync of configuration shares/exports/aliases has failed.


Failover Aborted.


Source and targets still in initial state meaning target cluster is read-only still

  1. Open the Eyeglass Jobs window, and switch to the running jobs tab.

  2. Select the most recent configuration replication job.

  3. Use the Job Details to determine the reason for failure.

  4. Address the config replication failure OR make note of the unsynced configuration data

  5. Restart failover.

If the configuration on source and target are known to be identical, the Config Sync option can be unchecked when restarting failover to skip this step.


If the unsynced configuration on the source is unimportant and it's more important to get the target cluster file system writable, the Failover can be restarted with the “Config Sync” box unchecked.  This will skip this step.


Note This step is not run during uncontrolled failover.   

                               

DFS Mode


Failure to Complete Step

Description of Impact to Failover

Recovery Steps

Special Instructions or Warnings

DFS Share(s) rename failure on target or  source

DFS clients will not switch clusters

  1. Manually remove igls-dfs from share(s) on the target cluster that did not get renamed (consult failover log), will complete the failover and clients will auto switch

  2. Manually add igls-dfs prefix to share(s) on the source cluster that did not get renamed (consult failover log), will block client access to source and switch to target

  3. Manually run allow writes from OneFS for the policy(s) that were selected for failover

  4. Manually run related Quota Jobs from Eyeglass

  5. Manually run re-sync prep from OneFS for the policy(s) that were selected for failover

  6. Apply SyncIQ Policy schedule to the target cluster policies that were failed over.




Networking Operations


Failure to Complete Step

Description of Impact to Failover

Recovery Steps

Special Instructions or Warnings

Rename source sc (Smartconnect) zone names & aliases

Failover failed during networking step.  


Auto Rollback of networking will be applied to place Smartconnect Zone names and aliases back on the source cluster.


Source and target clusters result in same initial state as before failover and file system will be read/write on the source with Smartconnect Zones return to original configuration

  1. Use the info link in the job details tree to determine the reason for failure. (At this point retrying access failover is likely to fail again.  you are now switching to manual failover use this table as a guide on the order of the steps that must be run http://documentation.superna.net/eyeglass-isilon-edition/design/eyeglass-assisted-failover/access-zone-failover-guide)

  2. You will need to know the steps for these procedures (consult EMC documentation on these steps)

  3. Restart failover but only select SyncIQ failover job type since Access zone failover will attempt the networking failover again and is likely to fail again.  Select the policies that are required to failover.

  4. Start failover with policies selected and review the table for manual step order.

This step is not run during uncontrolled failover.



Modify Source SPNs

SPN operation failure does not abort failover but are logged in the failover log


Failover continues.

  1. Post failover, open the failover log, and look for instructions listing which SPNs need to be fixed.

  2. Manually create/delete SPNs on the source cluster.

  3. Use ADSIedit tool and administrator access on the domain to edit source cluster machine account and remove SPN entries (short and long) for each smartconnect zone that failed over)


SPN operations on the source are proxied through the target cluster ISI commands during failover operations, so source cluster availability does not affect Eyeglasses ability to fix SPN during failover to target

Rename target sc (Smartconnect) zone names & aliases

Failover failed during networking step.


Rollback will be applied.


Source and target clusters result in same initial state as before failover with smartconnect zones reverting to original configuration.

  1. Use the info link in the job details tree to determine the reason for failure. (At this point retrying access failover is likely to fail again.  you are now switching to manual failover use this table as a guide on the order of the steps that must be run http://documentation.superna.net/eyeglass-isilon-edition/eyeglass-assisted-failover#TOC-Storage-Failover-with-Eyeglass-Failover-Modes-SyncIQ-Policy-Failover-DFS-Integrated-Failover-Access-Zone-Failover-Q4-2015-)

  2. You will need to know the steps for these procedures (consult EMC documentation on these steps)

  3. Restart failover but only select SyncIQ failover job type since Access zone failover will attempt the networking failover again and is likely to fail again.  Select the policies that are required to failover.

  4. Start failover with policies selected and review the table for manual step order.


Modify Target SPNs

SPN operation failure does not abort failover.


Failover continues.

  1. Post failover, open the failover log, and look for instructions listing which SPNs need to be fixed.

  2. Manually create/delete SPNs on the source cluster.

  3. Use ADSIedit tool and administrator access on the domain to edit source cluster machine account and remove SPN entries (short and long) for each smartconnect zone that failed over)




Replication Policy Failover - All policies


To assist use the failover log browser view to map each table to the correct section of the guide.



Failure to Complete Step

Description of Impact to Failover

Recovery Steps

Special Instructions or Warnings Or Data Loss Impact

Replication Policy Failover all policies

One or more of the syncIQ policies in the Failover job did not successfully complete its failover operation.


This is the parent task containing all sub policies.


IMPORTANT

If at least one policy has succeeded to make the file system writeable command, the rollback will not be applied, and the cluster is assumed as “failed over”.  Any policy not writeable will need failed over manually and will likely require EMC technical support to fix root cause on make writeable failure.

  1. Use the table in the next section to determine why the SyncIQ policy did not successfully failover and address the issue.

  2. Use the Job Details for the policy failover jobs to determine if at least one policy has successfully executed the step named CLUSTERNAME allow writes POLICY PATH.

    1. If yes: the source cluster is failed over. To recover, create a new failover job of type Sync IQ, select all of the policies that failed or did not run, and start the SyncIQ failover job.

    2. if no: the cluster is not failed over yet.  After fixing the error for the SyncIQ policy, start a new Access Zone Failover job.

    3. NOTE: SyncIQ policies that return error from the cluster will require EMC support case to be opened to resolve policies that will not start or run resync prep or execute make writeable operations


This is a parent step containing sub-steps for each replication policy. Recovery from a failure in this step depends on how far the failover proceeded through the child steps.


Some steps should be reviewed for data loss impact.



Replication Policy Failover <Policy Name>

Note: On Eyeglass, step names in the following table contain actual the name of the cluster(s) and the policy name.  These have been replaced with SOURCE, TARGET, and POLICY in the steps below.

To assist use the failover log browser view to map each table to the correct section of the guide.



Failure to Complete Step

Description of Impact to Failover

Recovery Steps

Special Instructions or Warnings OR Data Loss Impact

TARGET allow writes POLICY PATH

The target cannot be put into writes allowed state.  


NOTE: as of 1.6 all policies involved in a failover job, are issued the make writeable command before Re-sync prep is executed.


Failover fails.

  1. Validate connectivity between eyeglass and the target cluster.

  2. From the OneFS UI, manually allow writes again for policy that failed on the target. Address any errors that may arise on OneFS from this step.

  3. NOTE: SyncIQ policies that return error from the cluster will require EMC support case to be opened to resolve policies that will not start or run resync prep or execute make writeable operations

  4. For any policies where allow writes was not attempted due to an error, manually allow writes for those policy that were not run.

  5. Manually run related Quota Jobs from Eyeglass

  6. If this is a controlled failover, manually run resync_prep on the policy on the the source cluster.

  7. If this is a controlled failover, manually set the schedule on the mirror policy.

  8. Consider this policy job as a success, and relaunch failover according to the logic in the previous table.

  9. NOTE: SyncIQ policies that return error from the cluster will require EMC support case to be opened to resolve policies that will not start or run resync prep or execute make writeable operations

Data Access Impact - Users will only have access to read-only data for any policy where allow writes failed or was not run dur to error.



Replication Policy Failover - Recovery


Always review DR Assistant failover log to use this table below.   


See screenshot below as reference to column heading name to see how to review the failover log

Replication Policy Failover - Recovery.png



Failure to Complete Step

Description of Impact to Failover

Recovery Steps

Special Instructions or Warnings or Data loss Impact

Replication Policy Failover - Recovery

One or more of the syncIQ policies in the Failover job did not successfully complete its failover operation (multiple steps).


This is the parent task containing all sub policy steps (see screenshot above to see the order of parent and child steps)


IMPORTANT

The rollback will not be applied in the event of a failure in this section of the failover.  Previously networking and allow-writes steps have succeeded and the cluster is assumed as “failed over”.  Any policy with an error in this section on resync prep, apply schedule or run policy will need to be completed manually and will likely require EMC technical support to fix root cause on the failure.

  1. Use the table in the next section to determine why the SyncIQ policy did not successfully failover and address the issue.

NOTE: SyncIQ policies that return error from the cluster will require EMC support case to be opened to resolve policies that will not start or run resync prep and these steps will need to be completed manually


This is a parent step containing sub-steps for each replication policy. Recovery from a failure in this step depends on how far the failover proceeded through the child steps.


Data Loss Impact - none (failover for any policy without this step running is completed BUT reprotecting the filesystem is blocked until the mirror policy is created and run successfully)




Replication Policy Failover <Policy Name>

Note: On Eyeglass, step names in the following table contain actual the name of the cluster(s) and the policy name.  These have been replaced with SOURCE, TARGET, and POLICY in the steps below.

See screenshot below as reference to column heading name to see how to review the failover log


Replication Policy Failover <Policy Name> recovery.png


Failure to Complete Step

Description of Impact to Failover

Recovery Steps

Special Instructions or Warnings

SOURCE resync prep POLICY

The mirror policy cannot be created.


Policy is failed over. Target cluster is active.


Overall failover status is failure.



NOTE: as of 1.6 all policies involved in a failover job, are issued the make writeable command before Re-sync prep is executed.

  1. Login to OneFS on the source cluster.

  2. Run related Quota Jobs manually. Once successful delete Quotas from the source (Isilon best practice to NOT have Quota on directory that is target of a SyncIQ Policy.

  3. Manually execute resync prep on the policy.  Address any errors that may arise.

  4. Manually set the schedule on the mirror policy.

  5. Consider this policy job as a success

  6. NOTE: SyncIQ policies that return error from the cluster will require EMC support case to be opened to resolve policies that will not start or run resync prep

This step is not run during uncontrolled failover.

Data Loss Impact - none (failover for any policy without this step running is completed BUT reprotecting the filesystem is blocked until the mirror policy is created and run successfully)


TARGET run POLICY_mirror

The mirror policy is unrunnable.


Policy is failed over. Target cluster is active.


Overall failover status is failure.

  1. Login to OneFS on the target cluster.

  2. Manually run the mirror policy.  Address any errors.

  3. Manually set the schedule on the mirror policy.

  4. Consider this policy job as a success

  5. Contact Superna support for quota run procedures.

  6. NOTE: SyncIQ policies that return error from the cluster will require EMC support case to be opened to resolve policies that will not start or run resync prep

This step is not run during uncontrolled failover.


Data Loss Impact- none (failover for any policy without this step running is completed BUT reprotecting the filesystem is blocked until the mirror policy is created and run successfully)

TARGET set schedule

The schedule on the mirror policy cannot be set.


Policy is failed over.

Target cluster is active.


Overall failover status is failure.

  1. Login to OneFS on the target cluster.

  2. Manually set the schedule on the mirror policy.

  3. Consider this policy job as a success

  4. Quotas will not failover contact Superna support.

Data Loss Impact - none (failover for any policy without this step running is completed BUT reprotecting the filesystem is blocked until the mirror policy is created and run successfully)




Run Quota Jobs Now Failover Recovery Steps


NOTE: Quota jobs get applied automatically during failover.    Contact Support

See screenshot below as reference to column heading name to see how to review the failover log


Run Quota Jobs Now Failover Recovery Steps.png



Failure to Complete Step

Description of Impact to Failover

Recovery Steps

Special Instructions or Warnings

Quota partially failed over    

  1. Disk limits not applied on target cluster.
  2. No data loss impact, quota step runs last in the failover logic
  3. compare  quotas manually on source and target using oneFS UI consult errors in active alarms to assist with identifying quotas that did not failover


  1. recreate quotas manually and delete on source cluster
  2. open case to get list of failed quota creates or deletes - requires logs to be uploaded to support case




Replication Policy Failover Finalize

Note: this step runs one child step per policy, listed as “Finalize quota for path <policy source path>”.  The table below describes failures on those steps, and the following should be done for any failed steps, or steps that did not run.

See screenshot below as reference to column heading name to see how to review the failover log


Replication Policy Failover Finalize.png


Failure to Complete Step

Description of Impact to Failover

Recovery Steps

Special Instructions or Warnings or Data Loss Impact

Finalize quota for path: PATH

Delete Quotas on Source

Could not delete quotas from source.


Policy is failed over. Target cluster is active.


Failover overall status is failure.  


  1. On SOURCE OneFS, find all quotas on data protected by the SyncIQ policy.

  2. Validate that those quotas are present on the TARGET cluster.

  3. Delete these quotas from the SOURCE.

This step is not run during uncontrolled failover.


Data Loss Impact - none (failover for any policy without this step running is completed BUT reprotecting can be impacted by the presence of quotas on the source cluster)




Set configuration replication for policies to ENABLED

Could not enable Eyeglass Configuration Replication Jobs


Policy is failed over. Target cluster is active.


Failover overall status is failure.  

  1. Open the Eyeglass Jobs window.

  2. Select configuration replication job and Enable.

  3. Use log  to determine the reason for failure.

This step enables the newly configured mirror policies post failover (if they did not exist). Eyeglass will detect the new policy and it will be enabled post failure to replicate configuration data back to the source cluster.


Data Loss Impact - none (if this step fails, it block config from syncing back to source cluster and can be enabled manually from jobs window.




Post Failover Script Execution

See screenshot below as reference to column heading name to see how to review the failover log

Post Failover Script Execution.png


Failure to Complete Step

Description of Impact to Failover

Recovery Steps

Special Instructions or Warnings or Data Loss Impact

Eyeglass Script Engine

A user supplied post-failover script failed.


Failover overall status is failure.  

  1. Use the script engine to fix errors in the failing scripts, and re-run those that failed.

  2. Use test script function to validate output and error codes returned to failover jobs

This step relies on user-supplied implementations.


Review the script output to verify if it executed correctly.  Error codes set by the script should fail the failover job if set correctly.  See admin guide on proper script exit code values to indicate failure vs successful execution.  Guide is here


Data Loss Impact - This step should only result in failure to remount or start applications post failover.  Logs should be reviewed to ensure all steps completed and correct any script failures manually if they failed.



Post Failover


This step completes after all critical steps are executed,  if all steps passed to this point then no data loss condition can result from failures.  This section is used for Smartconnect roll back to source cluster (no steps for DNS if dual delegation configured).  


See screenshot below as reference to column heading name to see how to review the failover log


Post Failover.png

Failure to Complete Step

Description of Impact to Failover

Recovery Steps

Special Instructions or Warnings or Data Loss Scenario

Check Network Health

The Networking Rollback job could not be initiated.


Only is an impact if the failover did not reach the make writeable step of a policy.

  1. Use the failover log to determine the networking operations that were performed during failover.

  2. Use OneFS to manually revert the networking operations on SOURCE and TARGET as required.

The rollback logic is only executed if the failover job failed before the make writeable step on a policy.


Data Loss Impact - In this scenario failover never completed and the source cluster data is still the production copy.  Potential data loss scenario of the source data was deemed not usable.  Contact EMC to get assistance correcting make writeable step also consult best practises which cover scenario’s that block make writeable step. Link Here to best practises.


Post Failover Inventory

The post failover Inventory job failed.


Failover is successful, Eyeglass UI may be out of date.

  1. Validate source and target cluster connectivity with Eyeglass.

  2. Open the alarms window, and look for any alarms related to configuration replication.

  3. Manually run configuration replication, or wait until the next automatic cycle.

This step is not run during uncontrolled failover.


Data Loss Impact - None. This step updates the Eyeglass UI.

Post Failover Readiness Task

The post failover Readiness job failed.


Failover is successful, Eyeglass UI may be out of date.

  1. Validate source and target cluster connectivity with Eyeglass.

  2. Manually run the Access Zone Readiness job, or wait until the next automatic cycle.

This step is not run during uncontrolled failover.


Data Loss Impact - None. This step updates the Eyeglass UI.




Check Client Access (Manual Step)

Failure to Complete Step

Description of Impact to Failover

Recovery Steps

Special Instructions or Warnings

DNS smartconnect zone validation

Note: Dual delegation switches DNS automatically with networking failover available in Access Zone Failover

nslookup smartconnect  zone name

  1. confirm the ip address returned is the target cluster ip pool

  2. repeat for all smartconnect zones that were failed over

  3. If incorrect then update target cluster by using isi command to create the missing alias on the ip pool

  4. If ip returned is still the source cluster, then rename source cluster smartconnect zone name to ensure dual delegation will not use this cluster's DNS service to answer queries


Refresh session to pick up DNS change

SMB Client unable to access data on  Failover Target cluster despite successful failover and DNS Updates and session refresh

Check SPNs using ADSI Edit tool and confirm that

  1. Failover Source Cluster - SPN for SmartConnect Zone that Client is using does NOT exist.  

  2. Failover Target Cluster - SPN for SmartConnect Zone that Client is using DOES exist.

  3. If above condition is not met, using ADSI Edit to update SPN to be on correct cluster.


You cannot create a missing SPN on the Active Cluster if it still exists for the Failed Over cluster.  You need to remove from Failover Over cluster first and then add to active cluster.


SMB Direct Mount Shares

Dual delegation updates DNS but requires clients to remount and query dns to get a new cluster ip address

net use //sharename /delete


net use //sharename


Or use map network drive in Explorer


NFS mounts

Dual delegation updates DNS but requires clients to remount and query dns to get a new cluster ip address

umount -fl /path of mount -a (reads fstab file and remounts, does force and lazy unmount to handle open files)


DFS clients

auto switch

no action needed