Machine learning operations don’t belong with cloudops

On Aug 24, 2019

It’s Monday morning, and after a long weekend of system trouble the cloud operations team is discussing what happened. It seems that several systems that were associated with a very advanced, new inventory management system enabled with machine learning had issues over the weekend. The postmortem concluded the following:

The batch process that moved raw data from the operational database to the training database failed, as well as the auto recovery process. An ops team member who was working over the weekend attempted to resubmit but caused not one, but four partial updates that left the training database in an unstable state.

This caused the knowledge models in the machine learning systems to train with bad data and required that the new information in the knowledge base be removed and the models rebuilt.
Also, several outside data feeds, such as pricing and tax data, were updated at the same time to the training database. Although those worked fine, they too needed to be backed out of the knowledge database considering that the operational data was not in a good state.
The system was unavailable for two days and the company lost $4 million, considering lost productivity, customer reactions, and PR issues.

This is not 2025; this is today. As enterprises find more uses for “cheap and good” cloud-based machine learning systems we’re finding that the systems that leverage machine learning are complex to operate. The ops teams do not expect the degree of difficulty and the complexity and are finding that they are undertrained, understaffed, and underfunded.

The assumption is that the cloud operations teams could handle cloud-based databases, cloud-based storage, and cloud-based compute with a fairly easy transition. For the most part that’s been the case, considering that cloud-based systems are similar to traditional systems.

However, systems based on machine learning have not yet been seen for the most part by operations teams. These systems have specialized purposes, as well as specialized systems such as databases and knowledge engines that have to be monitored and managed in certain ways. This is where the current operations teams are failing.

The fix is pretty easy to understand, but most enterprises are not going to like it, considering it means spending more dollars for ML cloudops or abandoning ML cloudops. Machine learning systems are technological chainsaws. If used carefully, they are highly effective. If mishandled they can be dangerous. Failures can go undetected, and if the system automatically uses the resulting bad knowledge, you could end up with huge issues that may not be discovered until much damage is done. More risk than reward, it seems.