Amazon AWS Cross Region Replication (CRR) Failover

Introduction

AWS Cross-Region Replication (CRR) is used to replicate objects across Amazon S3 buckets in different AWS Regions. AWS Same-Region Replication (SRR) is used to replicate objects across Amazon S3 buckets in the same AWS Region.

This document contains information regarding using AWS S3’s CRR and SRR functionality with the Nasuni platform, and the process a customer would use in order to fail-over to the configured destination bucket, should a failure on the primary bucket or region occur.

Note: Nasuni recommends using this process only in a complete disaster recovery scenario, and it requires Nasuni's direct involvement.

AWS S3 Cross Region Replication

Cross-region replication (CRR) enables automatic, asynchronous copying of objects across buckets in different AWS Regions. Buckets configured for cross-region replication can be owned by the same AWS account or by different accounts.

Cross-region replication is enabled with a bucket-level configuration. You add the replication configuration to your source bucket. In the minimum configuration, you provide the following:

The destination bucket, where you want Amazon S3 to replicate objects
An AWS IAM role that Amazon S3 can assume to replicate objects on your behalf

When to Use CRR

Cross-region replication can help you do the following:

Comply with compliance requirements: Although Amazon S3 stores your data across multiple geographically distant Availability Zones by default, compliance requirements might dictate that you store data at even greater distances. Cross-region replication allows you to replicate data between distant AWS Regions to satisfy these requirements.
Disaster recovery: In case of a major region-wide AWS outage, cross-region replication allows failing over the S3 buckets to a different region that is not affected by the outage.

Requirements for CRR

Cross-region replication requires the following:

Both source and destination buckets must have versioning enabled.
The source and destination buckets must be in different AWS Regions.
Amazon S3 must have permissions to replicate objects from the source bucket to the destination bucket on your behalf.
If the owner of the source bucket does not own the object in the bucket, the object owner must grant the bucket owner READ and READ_ACP permissions with the object ACL.

For more information, see Using CRR with Nasuni.

If you are setting the replication configuration in a cross-account scenario, where source and destination buckets are owned by different AWS accounts, the following additional requirements apply:

The owner of the destination bucket must grant the owner of the source bucket permissions to replicate objects with a bucket policy. For more information, see Granting Permissions When Source and Destination Buckets Are Owned by Different AWS Accounts.

Section referenced from https://docs.aws.amazon.com/AmazonS3/latest/dev/crr.html.

Using CRR with Nasuni

Consideration 1: Eventual Consistency

CRR uses an eventual consistency model. This means that the writes to the source and replication are not synchronous. The time that it takes for Amazon S3 to replicate an object depends on the size of the object. For large objects, it can take several hours. Although it might take a while before a replica is available in the destination bucket, it takes the same amount of time to create the replica as it took to create the corresponding object in the source bucket.

If a failover is required at a point that not all objects have replicated from the primary bucket, data loss occurs.

The failover might not work at all if the missing objects is a TOC object (table of contents for the version).
Objects lost at a lower level (folder manifest or file data) might not be found for a significant time. A user gets an I/O error when accessing a file with a missing object.

Consideration 2: CRR setup

For new volumes, the user must perform a manual snapshot. This creates the new bucket. Then the user can set up the CRR policy. Objects in the bucket must be copied from the source bucket, because CRR does not replicate objects created before CRR configuration. This can be done using:

the "Replicate Existing Objects" option (preferred because less prone to human error);
or the AWS CLI;
or any third-party S3 browser tool.

Controlled failover process with Nasuni

Important: This procedure is supported only if all Edge Appliances are running version 9.8 or later.

This process can be executed if the company’s governing policy requires failing over from Primary to recovery at a certain frequency.

Tip: Nasuni recommends trying this process on a test system before performing the procedure on a production system.

The customer is responsible for ensuring that AWS CRR has successfully completed the sync between Primary and recovery before failing over. This process assumes that the user guarantees that replication has completed entirely successfully.

Call Nasuni Support referencing this procedure.
1. Make sure to have the CRR destination bucket name available for Nasuni Support.
2. If the cloud credentials are different for the destination bucket, add these credentials to Nasuni via the NMC for all appliances connecting to the volume.
3. Enable remote support on the Primary appliance.
On all remote appliances connected to a volume, perform these steps:
1. Disconnect all users from the affected volume on all Edge Appliances sharing the volume. You can use the Edge Appliance UI or the NMC for this.
2. Execute a final snapshot on all Edge Appliances sharing the impacted volumes.
3. Convert all shares to read only.
On the Primary appliance, perform these steps:
1. Disconnect all users from the affected volume on all Edge Appliances sharing the volume. You can use the Edge Appliance UI or the NMC for this.
2. Convert all shares to read only.
At this point, Nasuni Support must switch the bucket and region via remote support.
After step #4 is completed, on the Primary appliance, make all shares read/write where appropriate.

At this point, the data is available. On remote appliances, you might want to “warm” the cache with the “Bring into cache” option for any data that initial access performance is important.

Uncontrolled Failover Process with Nasuni

This process is executed when failover is unplanned.

Disconnect all users from the affected volume on all Edge Appliances sharing the volume. You can use the Edge Appliance UI or the NMC for this.
Ensure that all Edge Appliances that share the impacted volume do not attempt to snapshot while this process is performed.
Convert the volume shares to read only.
Contact Nasuni Support immediately.