Are You Ready When the Cloud Turns into a Downpour?

I think it’s safe to say that everyone in the technology industry was shocked by last week’s Amazon Web Services outage. As one of the major backbones of cloud-based Internet services, Amazon’s issues affected everything from Disney+, Netflix and Roku streaming to services such as Venmo and CashApp to the company’s own delivery drivers. Although this was unprecedented for Amazon, it was definitely a wakeup call to anyone who relies on cloud hosting services. And it underscored the need for true SaaS-based, multi-region “high availability” solutions to support resiliency in modern data architectures.

Between a blip and a disaster

According to Amazon, the issue originated in its US-East-1 region in Virginia. Of course, Amazon’s services include high availability within its East, West and other global regions, so if there’s an issue in one datacenter or zone within a single region, workloads can be routed to different locations within the same region to ensure uptime. This incident was so shocking because the number of core services affected in a single region increased the blast radius so significantly.

Yet, it might not qualify as a true “disaster” as defined in most disaster recovery planning. Those are natural events (hurricanes, tornadoes, tsunamis) or man-made (accidental/intentional, terrorism, hacking) that cause a long-term disruption to the business. In this case, there was no natural event or obvious intentional sabotage that would clearly impact availability longer term. While companies could have implemented their disaster recovery plans, doing so would have come with costs: potential loss of a limited amount of data, absorption of employee resources to implement the plan, the time necessary to reverse the changes once the incident had passed, if it wasn’t permanent. There was no way to know how temporary or long lasting this incident might be so executing a disaster recovery plan could have placed an additional and unnecessary burden on the business.

That still leaves a gap between normal operations and full-on disaster recovery – a gap that can cause major issues. Many software companies today pride themselves on the “five nines” – 99.999% uptime in a given year. Some even include that SLA in their contracts. Even though this incident lasted less than a day, it was potentially enough to drop affected companies’ uptime for 2021 from five nines to three: 99.9%. They lost two orders of magnitude in one incident.

In the critical path of data

For modern enterprises who rely on data to run their internal operations, uptime and availability are just as important, and in the modern data architecture, the entire system is only as reliable as its weakest link. This is especially critical right now for industries like banking where the movement of money, and in fact the entire system, relies on the flow of data. But this will become increasingly important across all industries as data becomes more and more essential to core business functions.

That means when you’re picking your database, your ETL provider, and your BI tools, availability needs to be a key “non-functional” factor you evaluate as part of your buying process along with any features you need. When it comes to data control and security solutions, ALTR’s availability is unmatched. Because we interact with and follow data from on-prem through the ETL process to the cloud, we’re in the critical path of data, making it crucial that our service keep up with the rest of the data ecosystem. Our answer is multi-region, high availability built into normal operations via a true multi-tenant SaaS solution. Because SaaS allows economies of scale, we don’t have to build and maintain dedicated single-customer infrastructure in multiple geographic regions. We just have to construct ALTR infrastructure in multiple geographies for all our customers to leverage, significantly reducing cost and complexity.

Disruption doesn’t have to slow you down

This incident highlighted a weakness in resiliency planning: what happens when something less than a disaster causes a significant disruption to the business? We don’t think the uptime guarantees many have trusted to date are strong enough for today’s business environment, especially for business-critical data infrastructure. The future is high availability based on multi-region redundancy – we expect this to be a broad and consistent theme across industries. And for essential data control and security, that means ALTR.