Amazon Prime Day is not available? Very stuck? Here's why Amazon Prime Day is not available? Very stuck? Here's why

Amazon Prime Day is not available? Very stuck? Here's why

On Prime Day, the No. 1 electronics retailer was dealing with a major website outage that slowed its sales process and a technical problem at one of its largest warehouses in Ohio that delayed the shipment of thousands of packages.

The problem was largely blamed on Amazon's migration from Oracle Database to its own database technology, the documents show. The outage highlights the challenges Amazon may face in its effort to completely move away from Oracle Database by next year, and how difficult it will be to regain that level of reliability. It also suggests that Oracle's database is indeed more efficient in some ways than Amazon's comparable software, a point Oracle is likely to highlight at its annual OpenWorld conference in San Francisco this week.

After the Prime Day outage, Amazon engineers wrote a 25-page report that Amazon said was to correct an error. Amazon uses this standard process to try to understand why major incidents occur and how to prevent them from happening in the future.

Reports suggest that Amazon struggled to identify the root cause of the Prime Day issues, which were due to a functionality lost following a database migration. Amazon also did not offer a contingency plan in case errors occurred with the newly installed database, called Aurora PostgreSQL, the documents show.

In one question, an engineer was asked why Amazon's warehouse database didn't experience "the same problem it did during the last traffic spike, when an Oracle database was used." They responded, “Oracle and Aurora PostgreSQL are two different [database] technologies” that handle “savepoints” differently.

Savepoints are an important database tool used to track and recover individual transactions. The report said that on Prime Day, too many save points were created, overwhelmed Amazon's Aurora software, and thus dragged down the overall performance of the database.

It could happen anyway

“Had Amazon stuck with Oracle, this outage likely would not have occurred,” said Matt Caesar, a computer science professor at the University of Illinois at Urbana-Champaign, after CNBC revealed details of the document. “Also, it appears that Amazon would have been able to diagnose the problem earlier if it had used Oracle, which would have hopefully shortened the duration of the outage.”

An Amazon spokesperson downplayed the problem in an emailed statement, saying there were no outages, despite internal documents claiming that database "performance degradation caused lags and severe outages."

"It is important to note that there was no outage at the warehouse and the issue only resulted in a short delay in the shipment of approximately 1% of the parcels. This issue was quickly identified and resolved," the spokesperson said.

The Ohio warehouse was the largest of 13 that moved away from Oracle databases ahead of Prime Day. The warehouse handled more than 1.1 million packages a day during the Prime Day promotion, the documents said. All services and software that handle inventory and shipping data have been migrated to Aurora databases in those warehouses.

The outage lasted for several hours on Prime Day, delayed more than 15,000 packages and cost an estimated $90,000 in labor costs, according to the report. The damages do not include any time engineers spent troubleshooting and fixing the bugs or any sales that may have been lost.

In a section of the report titled "Lessons Learned," Amazon engineers wrote that "savepoints behave differently in Aurora PostgreSQL than they do in Oracle," implying that Oracle's software handles the problem more effectively. The report also said there was no SQL statement data available for analysis in PostgreSQL, which "would have helped pinpoint the root cause" of the problem.

Had Amazon been better prepared, the outage might have been less severe. In one section of the document, the company said that due to a "lack of workarounds when the underlying PostgreSQL database encountered performance issues," the problem "resulted in a lengthy period of time to resolve." The document also claims that a "complete response plan or operations manual" would have helped "eliminate the impact more quickly."

“My guess is that Amazon just changed its database and didn’t test the specific load patterns that occurred on Amazon Prime Day and got a surprise,” said Henning Schulzrinne, a computer science professor at Columbia University who reviewed the document.

Amazon and Oracle have been engaged in a war of words in recent years as Amazon has expanded its software offerings to compete more directly with Oracle. CNBC reported in August that Amazon was working to migrate its entire database away from Oracle by the beginning of the year.

'It's really hard'

Oracle chairman and co-founder Larry Ellison doesn't believe Amazon's claims. During the company's December earnings call, Ellison said Amazon was "not moving away from Oracle." He reiterated his view at an event in August, saying: "I don't think Amazon can do that."

“Amazon has 10 years to migrate off Oracle, and they’re still using Oracle,” he said. “It’s not easy for them to use their own technology. It’s not cost-effective to do that. I mean, it’s really hard.”

Patrick Moorhead, principal analyst at market research firm Moor Insights & Strategy, said the incident shows how difficult it is to migrate older applications, like those used in Amazon warehouses, away from Oracle, which has worked with the world's largest companies for decades.

"AWS Aurora is designed for forward-looking applications, while Oracle is designed for more traditional applications," he said.

Paper: Amazon Aurora: Design Considerations for a High-Throughput Cloud-Native Relational Database