Yai365 as AWS Optimized Architecture

Yai365 as AWS Optimized Architecture

 

As APN partner, YaiGlobal hosts the current solution Yai365 as SaaS. Yai365 has well designed architecture to fully satisfy business requirements on performance and availability.

Architecture (2D)

Image
  • User will be routed to the Yai365 ELB through Route53 from a web browser using SSL certificate.
  • Web Application Servers are running on AWS EC2 in an Auto Scaling Group that spans multiple availability zones within an AWS VPC.
  • Web Application servers store data in Amazon Aurora Cluster (us-east-1) with read replica in a different availability zone and with a full replica (writer/read) in a different region (us-west-1)
  • Files are is stored in Amazon S3 (region us-east-1) with full region in a second region(us-west)
  • Complete Standby by solution is setup in a different region (us-west-1)
  • Access to the Aurora Cluster and S3 is allowed only within VPC and not accessible for public

Architecture (3D)

Image

Backup

Backup and restore are not used for disaster recovery, but more for recovering user errors. The DR is covered in chapter 7. In case any user delete files or data by mistake, it is possible to recover them. Files are more important for the users than the data entries. For that using S3 storage with versioning is an ideal solution to recover from deleted files by users.

Application

The EC2 AMI allows restoring the entire EC2 instance of the Yai365 Web Application anytime, needed.

Database

Aurora backs up Yai365 cluster volume automatically and retains restore data for the length of the backup retention period. Aurora backups are continuous and incremental so we can quickly restore to any point within the backup retention period. No performance impact or interruption of database service occurs as backup data is being written. We specified a backup retention period of one day.

Deployment team can still create snapshots as backup beyond the backup retention period especially in case of big data import.

The deployment team can recover your data by creating a new Aurora DB cluster from the backup data that Aurora retains, or from a DB cluster snapshot that you have saved. We can quickly restore a new copy of a DB cluster created from backup data to any point in time during the backup retention period. The continuous and incremental nature of Aurora backups during the backup retention period means we don't need to take frequent snapshots of Yai365 data to improve restore times.

To determine the latest or earliest restorable time for a DB instance, the deployment team looks for the Latest Restorable Time or Earliest Restorable Time values on the RDS console.

Storage

The Yai365 S3 bucket is enabled with versioning, so that Amazon S3 automatically generates a unique version ID for the object being stored. Versioning-enabled buckets enable the deployment team to recover objects from accidental deletion or overwrite. For the business requirement, the is no need to enable a layer more for prevention of deleting the file (MFA Delete disabled) by the deployment team.

Load Balancing over Availability Zones

Image

Disaster Recovery (DR)

The Yai365 disaster recovery plan is based on two levels:

  • Component disaster: in case any incident happens to the application, database, or file storage, a second instance would take immediately over. This is possible thanks the AWS scaled architecture as listed in the table below.
  • Region Disaster: If the whole region in us-east-1 is not available because of a natural disaster like earthquake or flood, a complete standby solution can be started through changed the domain target to the standby region. This is thanks to the AWS hardware managed to support of different regions located physically separated from each other.

 

 

Component

Standby

Region

Application Availability

Using ELB with auto scaling in different availability zone and standby in a different region

Cold

A complete replication of the region us-east-1 to the region us-west-1 allows the recover from a region disaster

Data Recovery

Using Aurora Cluster with read replica in a second availability zone and full cluster replica in different region.

Hot

Files Recovery

Using S3 with a full replica in a different region

Hot

 

Note: To create a cross-Region replica, we enabled the  binary logging for your Aurora MySQL DB cluster using a new parameter group yai24parametergroup having the parameter binlog_format equal to MIXED.

Disaster of a component within a region

Yai365 has is a stateless web application server, so that every HTTPS request is unique. Furthermore, all data and files are stored in an external database and storage. This allows an optimal load balancing using the AWS Elastic Load Balancer architecture, as an automatic switch from one EC2 instance in any availability zone to another availability zone would not cause any loss of data or files.

Furthermore the EC2 could be created anytime for an AMI image with the last release of Yai365.

While Amazon RDS provides a highly available Multi-AZ configuration, it cannot protect from every possibility, such as a natural disaster, a malicious actor, or logical corruption of a database. To maintain business continuity, the DR plan include a full standby solution that can be switched from cold to hot, which is the time of starting the EC2 instances and changing to Route 53 to link to the Standby Elastic Load Balancer. Furthermore a second link proxy.yai365.com is linking to the standby solution for allowing routine tests of the standby solution or in case of any domain naming problems with the main domain.

Disaster of the whole region

The recovery plan is based on a standby recover, which has two modus cold and hot for the running EC2 instance. The replication of database replication over AWS Aurora and storage replication over S3 region cluster is hot standby. In case of alarm (performance, health check, …) in any components of the master region, the region standby will be changed immediately by the administrator from cold to hot through starting the EC2 instances in the standby region. After solving the problem in the master region, the standby region will be turned back to cold.

 

RPO and RTO for all in-scope services

Understanding RTO and RPO

Recovery time objective (RTO) and recovery point objective (RPO) are two key metrics to consider when developing a DR plan. RTO represents how many hours it takes you to return to a working state after a disaster. RPO, which is also expressed in hours, represents how much data you could lose when a disaster happens. For example, an RPO of 1 hour means that you could lose up to 1 hour’s worth of data when a disaster occurs.

Disaster Recovery Plan

For the YAI365 production environment, YaiGlobal deployment team has been taking precautions so that Yai365 solution can recover if there is an unexpected event. The following services are covered through YaiGlobal disaster recovery plan to ensure business continuity:

 

Design

RTO

RPO

Application Availability

Using ELB with auto scaling in different availability zone and standby in a different region

4 h

2 h

Data Recovery

Using Aurora Cluster with read replica in a second availability zone and full cluster replica in different region.

4 h

2 h

Files Recovery

Using S3 with a full replica in a different region

4 h

2 h

 

The RTO for using the standby solution will be time of starting the EC2 instances and time of updating the domain zones with the new standby elastic load balancer.

10 Tips For Developing an AWS Disaster Recovery Plan

Identify critical resources and assets

All 6 months, a Business Impact Analysis (BIA) is executed to allow YaiGlobal deployment team to have a picture of which areas can become more affected in the event of a threat. It also can guide you to preview the potential impact of a disaster in operations.

Define RTO and RPO

The input to the Yai365 are audio data and output are annotation files. The critical issue is less to have trouble uploading the audio files, but more losing the annotation data, as it includes the working effort of the transcriber. Losing 2 hours of data is affordable, as this will correspond to the time spent by the transcriber working on the annotation. An RPO of 2 hours can be absorbed, however thanks to the AWS architecture, it is possible to even avoid losing any data.

Thanks to AWS ELB auto scaling, Aurora Database cluster and S3 managed services, it is possible to limit RTO to max of 4 hours. Using Amazon S3 service, the deployment team can assure even less RTO down to half hour, but this is not much needed for the business continuity.

Define RTO and RPO

An Aurora DB cluster is fault tolerant by design. The cluster volume spans multiple Availability Zones in a single AWS Region (us-east-1), and each Availability Zone (us-east-1a or us-east1-b) contains a copy of the cluster volume data. This functionality means that the DB cluster can tolerate a failure of an Availability Zone without any loss of data and only a brief interruption of service.

As the DB cluster has one Aurora Replicas in the region us-west-1, then an Aurora Replica is promoted to the primary instance during a failure event. A failure event results in a brief interruption, during which read and write operations fail with an exception. However, service is typically restored in less than 120 seconds, and often less than 60 seconds.

 

Amazon S3 Bucket Access Policies

Yai365 deployment team uses uses Amazon CloudWatch Events to detect PutObject and PutObjectAcl API calls in near real time and helps ensure that the objects remain private by making automatic PutObjectAcl calls, when necessary. Note that this process is a reactive approach, a complement to the proactive approach in which we use the AWS Identity and Access Management (IAM) policy conditions to force your users to put objects with private access. The reactive approach is for “just in case” situations in which the change on the ACL is accidental and must be fixed.

Whenever a user in my account makes a PutObjectAcl call against the bucket yai24-main-storage, S3 will deliver the corresponding event to CloudWatch Events via CloudTrail. The event must match the CloudWatch Events rule to be delivered to the Lambda function. Finally, the function checks if the object ACL is expected. If not, the function makes the object private.

 

Securing S3 Files

Image

AWS Config

AWS Config enables continuous monitoring of the AWS resources, making it simple to assess, audit, and record resource configurations and changes. AWS Config does this using rules that define the desired configuration state of your AWS resources. AWS Config provides a number of AWS managed rules that address a wide range of security concerns such as checking if you encrypted your Amazon Elastic Block Store (Amazon EBS) volumes, tagged your resources appropriately, and enabled multi-factor authentication (MFA) for root accounts. You can also create custom rules to codify your compliance requirements using AWS Lambda functions.

Securing S3 Files

Image

Print