Eyeglass and Isilon DR Best Practices

Best Practices for Failover with Eyeglass



This section is a collection of best practices.  Details on configuration is in the admin guide.   This is section is aimed at quick short descriptions of best practices in one easy to read place, that covers Eyeglass and SyncIQ.



IMPORTANT READ this --- All Planned Failover MUST read this support statement

    1. Support statement on Eyeglass Release.

    IMPORTANT READ this --- Do not attempt failover without completing this step. Best Practise for Fast Failback and Pre Failover Steps

    1. Run domain mark manually on all SyncIQ paths following instructions in online ISILON documentation.

      1. Create a SyncIQ domain

      2. You can create a SyncIQ domain to increase the speed at which failback is performed for a replication policy.
      3. Failing back a replication policy requires that a SyncIQ domain be created for the source directory. OneFS automatically creates a SyncIQ domain during the failback process. However, if you intend on failing back a replication policy, it is recommended that you create a SyncIQ domain for the source directory of the replication policy while the directory is empty.


    Create a protection domain Procedures
    You can create replication or snapshot revert domains to facilitate snapshot revert and failover operations. You cannot create a SmartLock domain. OneFS automatically creates a SmartLock domain when you create a SmartLock directory.


    1. Click Cluster Management > Job Operations > Job Types

    2. In the Job Types area, in the DomainMark row, from the Actions column, select Start Job.

    3. In the Domain Root Path field, type the path of the directory you want to create a protection domain for.

    4. From the Type of domain list, specify the type of domain you want to create.

    5. Ensure that the Delete this domain check box is cleared.

    6. Click Start Job.

    7. Confirm completed step

        1. Run this on source cluster isi_classic domain list

        2. Output should show SyncIQ domain on each syncIQ policy that has been created if you have successfully run domain mark on all policies

        3. If any paths are missing repeat step 4


    IMPORTANT READ this --- Failover timeouts with Eyeglass - Cluster Operations that can take longer than planned


    The following section is very important to review,  If you have never failed over a policy than you have never run domain mark which eyeglass and Isilon require to run domain mark job on the source cluster before failover.  The following conditions WILL increase the time to run cluster operations and if you have policies that match this criteria then increase the timeout for Eyeglass failover jobs.


    Policies criteria for increased timeout:

    1. Many TB of data protected by Single SyncIQ policy (many is not precise but if you think it's a lot of data for your environment then this applies to you)

    2. Many small files (same as above if you know it has a lot then it likely does and this applies to you)

    3. You have daily schedules for SyncIQ AND you have high change rate in GB’s per day and policies take over 1 hour to run normally each day


    If you have policies as per above AND you have run domain mark in advance of a failover as recommended above as a MUST DO.  Domain mark can take hours so read and please do this before failover.


    When Eyeglass starts and cluster task (example start resync prep, run policy, even make writeable for policies that match the criteria above).  Then the per task time should be increased.  From the default of 180 minutes to a number greater than 180 minutes based on looking RPO graph or report of the policy you are planning to failover.  Do this before attempting a failover or failback of a policy that matches the above criteria


    How to change the timeout


    igls adv failovertimeout set --minutes 360




    IMPORTANT READ this --- Failover make writeable step will fail in case of unexpected snapshots


    1. There must be one failover snapshot on the target cluster per SyncIQ Policy.  If there is no failover snapshot the Allow Writes step will fail.  Check to confirm that you have a snapshot per SyncIQ Policy using this command below for each policy in the failover. 


    On Source cluster

    isi snapshot snapshots list | grep <SyncIQ Policy path on source cluster >

    Replacing <SyncIQ Policy path> with your SyncIQ Policy source cluster path

    Example for expected configuration:

    isi snapshot snapshots list | grep /ifs/data/userdata/share1

    12345 SIQ-Failover-policy1-2016-05-25_21-33-37 /ifs/data/userdata/share1


    On Target cluster

    isi snapshot snapshots list | grep <SyncIQ Policy path on target cluster >

    Replacing <SyncIQ Policy path> with your SyncIQ Policy target cluster path NOTE: this will be the same path as above if target cluster path is the same, if the path on target cluster was changed. Use this path for the command.

    Example for expected configuration:

    isi snapshot snapshots list | grep /ifs/data/userdata/share1

    12345 SIQ-Failover-policy1-2016-05-25_21-33-37 /ifs/data/userdata/share1


    1. Additional snapshot scenarios that will cause Allow writes step to fail

    Eyeglass Release 1.6.3 and lower:  must manually check to determine whether this condition exists

    Eyeglass Release 1.7 and higher:  OneFS SyncIQ Readiness Validation by Eyeglass - Corrupt  Failover Snapshots


    Scenario 1:

    If the target cluster has a leftover snapshot with the format -

    SIQ-<policy-id>-restore-latest

    from previous failovers/synciq jobs - Allows writes step will fail (as it attempts to create a snapshot with the same name and fails)

    Scenario 2:

    If the target cluster has a leftover snapshot with the format -

    SIQ-<policy-id>-restore-new

    from previous failovers/synciq jobs - Allows writes step will keep running with the status "enabling writes" (as it attempts to create an intermediate snapshot with the same name before creating a snapshot with the *restore-latest suffix and will fail). This will lead to allow writes step timing out in Eyeglass.


    IMPORTANT READ this -- Failover resync prep step may fail if all nodes do not have same date/time

    In the case  where all nodes on a cluster do NOT have the same date/time, this can cause problem during failover because resync prep multiple steps can have last step with a timestamp "before" the first step and then even though on the cluster resync prep step is completed, Eyeglass does not consider to be completed.


    ACTION: ssh to each node on the target cluster and determine the date/time on each node.  Compare and ensure that they are showing the same day and time.  If NOT the same  resolve, before attempting a failover..


    IM

    IMPORTANT READ this --- Do not attempt failover without completing this step. Best Practise for Fast Failback and Pre Failover Steps

    PORTANT READ this --- Set SyncIQ Schedule to Manual prior to failover - Eyeglass Release < 1.6.3

    Set your SyncIQ schedule to Manual prior to running failover. Keep a record of the schedule.  Failover by default will run final data sync (do NOT uncheck option for Data Sync to ensure this) and to prevent SyncIQ Policy from running while failover operations are being done against the SyncIQ Policy.

    After failover, update Mirror policy manually with schedule from production policy.

    Eyeglass 1.6.3 and higher does not require this as schedules are removed and cached at the beginning of the failover and reapplied at the end of the failover.

    Best Practice General:


    1. Eyeglass - We recommend DFS mode for SMB share protection and DR

    2. Eyeglass - We recommend Access Zone Failover when NFS and SMB data needs to failover together

    3. Eyeglass - We recommend syncIQ policy mode failover for customers with small numbers of NFS exports and hosts for automation

    4. Eyeglass We recommend Access zone when multi protocol SMB/NFS is required within a single Access zone OR when only NFS DR protection is required

    5. Eyeglass NFS only failover - Use simpler per policy Failover with Eyeglass and unmount remount new DR Smartconnect zone name.  It’s faster and requires less planning and configuration than Access Zone Failover

    6. Eyeglass Multi-protocol failover  allows both protocols to failover together using Access Zone failover

    7. Isilon - For a syncIQ best practise for System level recovery you can refer to EMC document (Isilon - Backup and recovery guide)  https://www.emc.com/collateral/TechnicalDocument/docu56055_onefs-backup-recovery-guide-7-2.pdf

    8. Eyeglass - Create smartconnect mapping alias hints on all ip subnet pools,  hint the syncIQ smartconnect zone with ignore to ensure it's not failed over

    9. Eyeglass - Delegate machine account credentials to cluster machine accounts in Active Directory

    10. Eyeglass - Enable phone home support for faster support response times

    11. Eyeglass - Configure Run Book Robot Access Zone and policies to ensure failover and failback is functioning daily

    12. Isilon - Always use FQDN on Smartconnect zone names

    13. Isilon - Create a SyncIQ Failback Domain to ensure fail back operations take less time

      1. Create a SyncIQ domain You can create a SyncIQ domain to increase the speed at which failback is performed for a replication policy. Because you can fail back only synchronization policies, it is not necessary to create SyncIQ domains for copy policies.

      2. Failing back a replication policy requires that a SyncIQ domain be created for the source directory. OneFS automatically creates a SyncIQ domain during the failback process. However, if you intend on failing back a replication policy, it is recommended that you create a SyncIQ domain for the source directory of the replication policy while the directory is empty. Creating a domain for a directory that contains less data takes less time.

      3. Procedure 1. Click Cluster Management > Job Operations > Job Types. 2. In the Job Types area, in the DomainMark row, from the Actions column, select Start Job. 3. In the Domain Root Path field, type the path of a source directory of a replication policy. 4. From the Type of domain list, select SyncIQ. 5. Ensure that the Delete domain check box is cleared. 6. Click Start Job.

    14. Isilon - Create an IP and smartconnect pool that is only used for SyncIQ and create policies with run policy only on nodes subnet IP Pool/Smartconnect zone.

      1. Select option to Connect to nodes in the target smartconnect zone when creating policies

    15. Isilon - Don't mount data using the SyncIQ smartconnect zone, use other IP pools and smartconnect zones for users to mount data



    This section covers key topics to review before planning DR with Eyeglass

    SyncIQ Performance Tuning Best Practices

    Consult the document below to turn SyncIQ job worker threads per node for high latency WAN and faster SyncIQ node operations (Syncing, make writeable, resync prep steps).


    OneFS 7 and 8 are both covered in the document below.


    https://www.emc.com/collateral/hardware/white-papers/h8224-replication-isilon-synciq-wp.pdf

    Data Loss Considerations

    When SyncIQ is set to a schedule or on changes mode it’s important to understand the impact to data loss on failover operations.


    1. When a SyncIQ job is running and Eyeglass failover job is started the default behaviour will attempt to start a final data sync by running the SyncIQ policies in the job.  

      1. If there is an existing SyncIQ Job running, Eyeglass failover will wait a maximum of 1 hour for the running SyncIQ Policy job to complete.

      2. For Urgent Failover  requirements skip config sync and data sync option in the DR assistant UI by unselecting.

      3. If SyncIQ Job has not completed with an hour, an error is returned and the failover is aborted.

      4. Data Loss impact -  Since SyncIQ is snapshot based, changes that have occurred since the start of the existing running job will be lost. Depending on the start time of the currently running job, this could represent a large amount of data

    2. Mitigate Data Loss - Login to Isilon to verify whether a SyncIQ Job is running for the policies being failed over.   

      1. Steps: Wait for the running job to complete and then start the failover.  You may also consider disconnecting client access at this point to ensure that there is not a large amount of data that requires replication during SyncIQ Job run by the failover.

      2. Set the SyncIQ Job schedule to manual before starting a failover. Eyeglass will run the SyncIQ policy as part of the failover procedure.

    Best practices for DFS mode Failover Design:

    1. DO Use Domain based DFS roots

    2. DO Use DFS referral ordered list to select production UNC path as default first in the list to speed up referral processing and mount times

    3. DO Use UNC path targets that point to SmartConnect zones

    4. DO Name SmartConnect zones differently on source and target clusters so that debugging with dfsutil.exe is easier and smartconnect can load the cluster nodes during normal operations and after with failover

    5. DO Group one or more SyncIQ policies by name and enable DFS mode in Eyeglass to failover related SyncIQ policies with DFS.  (No hard rule requires this but it's easier to manage groups of related DFS failover if the names have similar prefix)

    6. DO Create dedicated IP pools on source and target clusters for DFS protected data

      1. Within an Access Zone, create igls-ignore hints to ensure smartconnect zones are not failed over with Access Zone failover


    Best practices for Access Zone and per SyncIQ mode Failover Design


    Sub access Zone means a syncIQ policy within an access zone is used for failover of the data protected by the policy.  This is supported but has limitations in amount of automation possible with this option.


    1. Don’t attempt Failover of a single SyncIQ policy within an Access zone unless you are prepared for manual steps below.

      1. There is no method to map a SyncIQ policy to a SmartConnect zone used by clients to mount the data.  Incorrect configuration, or failing over a SmartConnect zone using an alias could impact other clients using the SmartConnect zone.  Eyeglass can not failover SmartConnect zones without risk of causing inaccessible data on the production cluster unless ALL Smartconnect Zones are failed over to the target cluster.

      2. The storage admin is responsible to failover the SmartConnect zone manually in this scenario

      3. The SPN delete of the access zone and creation on the target cluster is also a manual step the storage admin must execute using ISI commands.

    2. Do configure Access zone failover and design DR to failover all policies and SmartConnect zones in the access zone

    3. Do all SyncIQ policies to be at the same level as the Access Zone base path or lower in the file system

    4. Do create shares or exports underneath the path of  SyncIQ policies  to ensure they are automatically protected as well.

    5. Do setup subnet:pool mappings for Access Zone failover using hints to map pools

    6. Do setup Runbook Robot Advanced with Access zone configuration and verify it succeeds before attempting an Access zone failover

    7. Do Use DFS mode for SMB within an Access Zone Failover Multi Protocol design

    8. Don’t Failover with Eyeglass per SyncIQ level failover unless you understand the limitations below.

      1. To allow partial, single SyncIQ policy(s) within an Access Zone the following constraints apply:

      2. Any smartconnect zones used are assumed to be manually failed over with aliases and DNS updates to point DNS at target cluster smartconnect ip address

      3. AD SPN creation on target and  deletion on source cluster is manual, since Eyeglass does not know which smartconnect zones and SPN’s are required on the source cluster after a policy is failed over leaving data accessible on the source cluster



    DNS Configuration for Access Zone Failover - Best Practice


    1. DNS that delegates NS records to Smartconnect Zones are the last step in the failover process to point the the failover Smartconnect Service IP on the target cluster (typically at the DR site).  

    2. This NS record is setup to point at the SSIP of the production cluster for the Smartconnect Zones within the Access Zone that will be failed over.  

    3. SmartConnect Zone aliases will also have NS records to delegate the alias entries as well to the SmartConnect Zone SSIP that has the alias assigned.

    4. Delegation should use an A record for each SSIP but the Delegation for the NS should use a CNAME that points to the A record.  This is best practice and simplifies the update on failover of the CNAME to point at the DR cluster SSIP A record


    Best Practice for Protecting Data for HA and Failover with Eyeglass


    1. DO - Organize Data into Protocol failover policies example policies for SMB and policies for NFS to take advantage of DFS mode

    2. DO - Organize Data / SyncIQ Policies / Shares / Exports / Aliases / Quotas by Zone for failover

    3. DO - Shares/Exports/Alias should be grouped into Zones based on which data sets that need to be failed over together.

    4. DO - Map each subnet/pool clients use to access data to a target cluster subnet\pool using Eyeglass hint aliases

    5. DON’T -  Put SyncIQ policies at a level above the Access Zone root directory

    6. DON’T -  Use excludes and includes in your SyncIQ Policy.  Excluded directory will be read-only after failover.  For DFS mode, share on source cluster related to excluded path is not preserved



    Best Practice for Isilon Networking

    1. It’s best to use fewer ip pools to simplify DNS, Alias creation on failover and reduce updates to DNS required for failover.

      1. Example:

        1. SmartConnect Zone for Data

        2. SmartConnect Zone for SyncIQ

        3. SmartConnect Zone for management (Eyeglass and other applications)

        4. SmartConnect Zone for Backup


    Best Practice for Kerberos Service Principal Names (SPN’s)


    1. Use Eyeglass DFS mode to limit kerberos authentication issues for cluster machine accounts

    2. If NTLM fallback is disabled OR Microsoft patches or new OS’s disable NTLM fallback, you don’t want your DR strategy depending on authentication fallback to a legacy protocol.   It’s best to ensure SPN’s are accurate for Kerberos authentication and use Access Zone failover as the unit of failover.


    Best practice to verify the following on all DNS


    1. To prevent giving out stale DNS entries, the DNS time-to-live (TTL) on the NS delegations should be set to zero, or as close to zero as possible, so that the DNS information is as fresh as possible.

    2. Certain clients perform DNS caching and might not connect to the node with the lowest load if they make multiple connections within the lifetime of the cached address.

    3. Do not create reverse DNS entries, also known as pointer (PTR) records, for Isilon SmartConnect service IP addresses or SmartConnect zone names. SmartConnect does not provide reverse lookups.

    4. DO Make sure forward and reverse looks are created in DNS for Subnet Service IP’s used by Eyeglass when adding clusters to ensure TLS connections correctly.


    Best practice DNS delegation of NS records

    This section describes best practices for DNS delegation for Isilon clusters.


    1. Delegate to address (A) records, not to IP addresses. The SmartConnect service IP on an Isilon cluster must be created in DNS as an address (A) record, also called a host entry. An A record maps a URL such as www.superna.net to its corresponding IP address. Delegating to an A record means that if you ever need to failover the entire cluster, you can do so by changing just one DNS A record. All other nameserver delegations can be left alone. In many enterprises, it is easier to have an A record updated than to update a name server record, because of the perceived complexity of the process.

    2. Use one name server record for each SmartConnect zone name or alias. We recommend creating one delegation for each SmartConnect zone name or for each SmartConnect zone alias on a cluster. This method permits failover of only a portion of the cluster's workflow—one SmartConnect zone—without affecting any other zones. This method is useful for scenarios such as testing disaster recovery failover and moving workflows between data centers.

    3. Follow consistent mount paths

      1. Mount entries for any NFS connections must have a consistent mount point, in the format of sczonename.domain.com:/ifs/path. This way, when you fail over, you don't have to manually edit your fstab or automount entries.

    4. Use Access Zones to compartmentalize your data based on importance.  If your environment is OneFS 7.1.1 or later and you use access zones, you must define an access zone root path to help segment data into the appropriate access zone and enable the data to be compartmentalized. This is similar to what Celerra or VNX administrators might do if they have a VDM that has its own root file system. So, in addition to the default System access zone, you must add another layer. For example: /ifs/clustername/accesszonename/

    5. Recommend to your client system administrators that they turn off client DNS caching, where possible. To handle client requests properly, SmartConnect requires that clients use the latest DNS entries. If clients cache SmartConnect DNS information, they might connect to incorrect SmartConnect zone names. In this situation, SmartConnect might not appear to be functioning properly.

    6. Do NOT: We do not recommend creating a single delegation for each cluster and then creating the SmartConnect zones as sub records of that delegation


    SmartConnect service IPs Each cluster needs only one SmartConnect service IP (SSIP), as long as there are no firewalls between the infrastructure DNS servers, and the SSIP that block TCP and UDP port 53. It doesn’t matter how many domains or subnets the cluster is joined to or participates in. SmartConnect is essentially a very selective DNS server that answers only for the SmartConnect zone names and SmartConnect zone aliases that are configured on it. A DNS server doesn’t have to respond with an IP address from the subnet that the DNS server is in: it responds only with the correct IP address based on the name being looked up. Which subnet the DNS server resides in is irrelevant.


    This above means that failover to the target cluster can update the A record to point to the SSIP of the target cluster using the hints mapping described below for Eyeglass to create aliases in the correct smartconnect subnet on the target.


    Best practice - Do DR Testing with RunBook Robot for Access Zones

    Note:  Runbook Robot is Access Zone Failover and allows testing of Access Zone failover on non-production access zones

    1. It is best practice to setup an environment with non-production data and shares / exports / quotas representative of the production environment and run Failover and Failback testing to understand the failover operation in your environment with Eyeglass DR Assistant.

    2. It is best practice to set up SyncIQ Robot for regular automated Failover and Failback for non-production data and shares / exports / quotas in your environment.