Big Data 2019: Cloud redefines the database and Machine Learning runs it

On Jan 3, 2019

In the predictions game, it’s time for us to bat clean-up once more. Following Big on Data bro Andrew Brust’s roundup of AI-related predictions from a cross section of industry executives, now it’s our turn. We’ll focus mostly on what this all means to the database, a technology that after Y2K was thought to be entering its finished state.

In 2019, we view the AI and the cloud as being the great disruptors.

Let’s paint the big picture first. At Ovum, we’ve long forecast that by 2019, half of all new Big Data workloads would run in the cloud. According to our latest data, that scenario is already bearing out, with our surveys showing roughly 45% of respondents reporting running at least some Big Data workloads in the cloud.

The cloud’s impact on databases is that it is redefining the basic architectural assumptions on how to design them and manage data. On-premises, it was all about threading the needle in sizing just enough capacity to be fully utilized, but not too much capacity to trigger software audits or result in excess license charges. And for Big Data, it was all about bringing compute to the data because the network overhead of moving all those terabytes was not considered particularly rational.

Enter the cloud, commodity infrastructure, cheapening storage, faster network interconnects, and most of all, virtually limitless scale, and for database vendors, it was back to the drawing board, such as separating storage from compute. Add some fuel to the fire: our belief that the best way to realize value from cloud database deployment is through managed Database-as-a-Service (DBaaS) where patches, upgrades, backups, failovers, and provisioning and handled by the cloud provider, not the DBA. And that sets us up for our first prediction, which by the way, happens to be buzzword-compliant.

Self-driving databases using ML will proliferate

Cloud database providers will apply machine learning (ML) to make their DBaaS offerings self-running. Just over a year ago, Oracle kicked the door open, first with Autonomous Data Warehouse 18c, followed about six months later with Autonomous Transaction Database 18c. Don’t try this at home, Oracle only offers the autonomous database in its public cloud, where it, not the DBA, controls the environment.

Applying ML to database operation is a no-brainer for several reasons. First, database operations generate huge quantities of log data to feed the models. Secondly, database operation (especially in a managed cloud service) is a well-bounded problem that resists drift or scope creep. Finally, the legwork that ML automates, such as how to configure a database for different load patterns, or how to optimize queries, is work that, for the DBA, doesn’t add value.

Not surprisingly, the advent of autonomous databases created significant fears among DBAs as to the security of their jobs. As we noted in our Oracle OpenWorld postmortem, the longest line that we saw for any breakout was the one for DBA vs. Autonomous Database session. As we noted in that piece, unless their employers are stupid, they will still have jobs – you still need DBAs to make strategic decisions on what the database will cover, design the schema, and set (and be accountable for) policies related to running and securing the database.

We expect in 2019 that more cloud database providers will follow Oracle’s lead. Employing ML to run the database will become a standard checkbox item for any DBaaS offering; we also expect a few database providers to differentiate from Oracle and apply some of these concepts to on-premise deployments.

Serverless becomes checkbox option

We also expect that serverless computing, which was first introduced with AWS Lambda to simplify application development by eliminating the need to provision servers with autoscaling, will become increasingly widespread with cloud DBaaS services. In this scenario, DBAs specify upper and lower thresholds and then the database autoscales. Examples include Amazon DynamoDB, where serverless is core to the design, and Amazon Aurora, where serverless was recently introduced as an option for applications where spikes are infrequent or hard to predict. Google Cloud Firestore is also serverless; over the past year, MongoDB introduced its Stitch serverless offering for its Atlas cloud service.

Serverless is not for all use cases; for instance, if your loads are predictable or steady, it will be more economical to reserve capacity. Nonetheless, demand from developers will make serverless an option for all cloud operational databases in 2019.

Distributed databases: Writes get respect

Another innovation made feasible with the cloud is the distributed database. This year, we will see the distributed database make writes first-class citizens on par with reads.

Let’s explain. Distributed databases didn’t start with the cloud – early examples included Clustrix (recently acquired by MariaDB), Aerospike, and NuoDB on the relational side, and NoSQL stalwarts like MongoDB, Couchbase, and Apache Cassandra. Of these players, MongoDB has been the big break-out, largely on account of its developer-friendliness that made its spread viral, although Cassandra has scored some big Internet names like Netflix.

But the cloud provided some unfair advantages for distributed databases. First, it eliminated the need for IT organizations to set up their own data centers and wide area backbones. Secondly, much of this data, such as logs, product catalogs, IoT data, and so on, already lived in the cloud. Last, but not least, the cloud added some unfair architectural advantages: cloud providers could natively engineer in automated replication, smart storage, and automated scaling into their platforms.

So, what does this all have to do with write and read performance? Most distributed databases have operated with master/slave architectures with centralized master nodes for committing writes or updates, surrounded by read-only replicas that could be geographically distributed. That made reads, which could be performed on any local replica, much faster than writes.

We are already seeing new approaches, such as multi-master, which allow local nodes to be declared write masters for specific transactions, or consensus algorithms, that polls nodes to designate the write master, to overcome the write bottlenecks on globally distributed databases. Amazon Aurora and DynamoDB; Google Cloud Spanner; Microsoft Azure Cosmos DB; and Cockroach DB already support these capabilities (or offer them in beta), but with the exception of Cloud Spanner and Cosmos DB, these capabilities are only supported within a region, not across regions. In 2019, we expect that multi-region support will grow more common.

A related development, brought on by data privacy regulations like GDPR and local mandates enforced by many nations requiring data to stay within the country of origin will be the role of sharding the database to have local or regional masters. This practice will become more widespread.

George Anadiotis gets vindicated: The stars finally align for graph databases

OK, you’ve probably heard more than your fill from my Big on Data bro George Anadiotis, who has performed yeoman duty educating the market on graph databases. He has done the deep dive on knowledge graphs, introduced us to new graph database players, enlightened us on graph query languages, and ventured the insane notion that graphs could represent the web as a database.

As Anadiotis put about 18 months ago, “Graph technology is well on its way from a fringe domain to going mainstream.” Well, back in early 2017, that statement was a bit premature.

The business problems that graph databases address are quite straightforward. Deciphering the patterns of influence on social networks so leading brands can identify and cultivate opinion leaders; mapping and optimizing the intricacies of supply chain operations; or understanding the propagation of cyberthreats, those are just a few examples of real world problems that all have one thing in common: they are characterized by many-to-many relationships that are not easily represented by relational databases. The challenge is that, as databases, graphs are unfamiliar. They lacked the advantage of decades of knowledge building relational schema, the simplicity of key-value structures, or the existing knowledge base of JSON documents that came from the JavaScript community. And until recently, graph lacked consensus standards against which a critical mass skills base could develop.

What’s changed over the past year is growing acceptance of de facto standards, such as Apache TinkerPop framework and the associated Gremlin query language, which provides a common target for developers. And we’re seeing competition from Neo4J and TigerGraph that are introducing their own variants that are more SQL-like. And we’re seeing the cloud giants enter the field, with Amazon introducing Neptune, while Microsoft’s Azure Cosmos DB includes graph one of its family of supported data models. But as necessity is the mother of invention, in 2019, we expect Customer 360, IoT applications, and cybersecurity to be the drivers of demand for graph databases, which are now more accessible than ever.