AWS confirms it is working to 'fully restore' services after major outage

0
2
Oracle enhances customer experience platform with a B2B refresh

Source is ComputerWeekly.com

Amazon Web Services (AWS) said it is working to “fully restore” its customers’ cloud environments, after an “operational issue” within its North Virginia datacentre region knocked out multiple internet sites and services across the globe.

Users of the public cloud giant’s services are known to have started reporting problems at around 8am UK time, according to outage tracking website Downtime Detector.

This is around the same time the AWS Health Dashboard service, which provides users with a rundown of how the company’s cloud environments are performing, started tracking issues with multiple services hosted within its US-East-1 region in North Virginia.

This message was followed up with several admissions of “serious error rates” affecting AWS services within the US-East-1 region, alongside assurances that the company had engineers on hand who are  “immediately engaged and are actively working on both mitigating the issue, and fully understanding the root cause.”  

The Dashboard later confirmed, at around 10am UK time that: “Global services or features that rely on US-East-1 endpoints… may also be experiencing issues.”

AWS subsequently said the outage related to a DNS issue affecting its DynamoDB NoSQL database service: “We have identified a potential root cause for error rates for the DynamoDB APIs in the US-East-1 region. Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-East-1.”

The technical difficulties are known to have had a knock-on effect for many AWS customers across the globe, who have also reported problems as a result of the cloud giant’s services going down.

Among those affected are financial services provider Lloyds Bank, along with its Halifax and Royal Bank of Scotland subsidiaries, as well as social media and communications services such as Snapchat and Signal, and online gaming portals, Fortnite and Roblox.  

Amazon-owned internet services, such as its retail site and Ring doorbell service have also suffered disruption as a result of the outage.

Computer Weekly contacted AWS to request details of when it hoped to have the matter resolved. In response, Computer Weekly was directed to the AWS Health Dashboard by a spokesperson, where among the most recent updates are statements about how the company is seeking to fully restore affected services, and is at a point where it has begun to successfully relaunch those blighted by the problems.  

Even so, public cloud market watchers have been quick to point out how the wide range of users and services that have been taken offline as a result of the outage could be indicative of how over-reliant the world has become on AWS’s services.

Experts claimed the incidents highlight why it is so important for enterprises to diversify the mix of cloud providers they work with in the interests of uptime and service availability.

Nicky Stewart, senior advisor to the The Open Cloud Coalition, a pro-competition in the public cloud advocacy organisation, said the outage is a “visceral reminder of the risks of over-reliance on two dominant cloud providers,” given how widespread its after-effects have been.

“It’s too soon to gauge the economic fallout, but for context, last year’s global CrowdStrike outage was estimated to have cost the UK economy between £1.7bn and £2.3bn,” said Stewart.

“Incidents like this make clear the need for a more open, competitive and interoperable cloud market – one where no single provider can bring so much of our digital world to a standstill.”

Dai Vaughan, chief technology officer at digital transformation consultancy Public Digital, said the AWS outage demonstrates that accidental technology failure can pose as big a risk to company operations as a cyber attack.

For this reason, he said companies should be seizing on today’s news to develop a “defensive mindset” when it comes to evading downtime threats that “embraces preparedness and resilience” in the long-term.

“One thing all organisations should do to prepare is to create a designated crisis response team. This should be fewer than 12 people and include those with expertise in IT, data management, communications and stakeholder management, as well as senior leadership,” said Vaughan.

“Ultimately, resilience isn’t about eliminating risk entirely, but about understanding it, planning for it, and cultivating a culture that can absorb shocks and recover quickly.”  

He continued: “Those who take this holistic, anticipatory, and internet-era approach will not only protect their operations but also preserve trust with customers and partners in an uncertain digital landscape.” 

Source is ComputerWeekly.com

Vorig artikelCompanion.energy intros cost- and carbon-aware networking