Microsoft ‘Builds’ its data story, in the cloud and at the edge
Microsoft Build (stylized as “//build/”) – Redmond’s annual developer-focused conference kicks off today in Seattle. There’s tons of news coming out of this event; my ZDNet colleagues Mary Jo Foley and Ed Bott will have plenty to say and Larry Dignan, our Editor in Chief, will have Build news in abundance as well.
Meanwhile, as your trusty data guy who is also a Microsoft Data Platform MVP, I’ve got coverage of the data side of things here. I’ll have another post covering Microsoft’s .NET-based machine learning technology too, but this one is all about database technology on the Azure cloud (and edge) platform.
Perhaps the biggest news comes from the good folks on the Azure Cosmos DB team; nominally announcing support for a couple of new APIs. But peel away the first layer of the Cosmos onion and you’ll find that headline understates things by quite a lot. That’s because one of these APIs is the Spark API and, in fact, what this comes down to is the full-on addition of Apache Spark to the Cosmos DB platform.
You read that right: Cosmos DB will now provide a full implementation of Apache Spark on-board the Cosmos DB infrastructure, with the ability to query data in Cosmos DB containers and bring the results back as Spark DataFrames. This allows developers to perform both Spark-based data engineering and machine learning workloads on Cosmos DB data, without the need to fire up a dedicated Spark cluster. Instead, all that’s necessary is to select Spark as one of the supported APIs when provisioning the Cosmos DB database.
The Cosmos DB implementation of Spark will inherit its host’s global distribution properties, as well as its multi-master personality. So the Spark code can write to any local replica and be assured those writes will propagate across the Cosmos DB partitions, worldwide. Spark jobs will also inherit Cosmos DB’s “five 9’s” (99.999%) service-level agreement (SLA) support.
Of course, lots of Spark code is written notebook environments, and you might be concerned that Cosmos DB’s special implementation of Spark might not support this. But, as it turns out, the Cosmos DB team is also announcing support for Jupyter notebooks from within the Cosmos Explorer interface, be it within the Azure portal (as shown in the figure above) or in that tool’s full-screen experience at https://cosmos.azure.com.
The especially interesting part of this is that although Cosmos DB’s Jupyter notebook support nicely accommodates Spark developers, it works in non-Spark contexts, too. So, for example, you could fire up a Cosmos DB notebook and use it just to run SQL queries against the data in your Cosmos DB containers. And all of this – including the Jupyter and Spark feature sets – is essentially serverless. From what I can tell, you don’t have to worry abut special servers, clusters or even cloud storage.
And speaking of not having to worry about cloud storage, Cosmos DB is also adding support for the etcd API, which allows Cosmos DB to act as a stateful storage resource for various distributed systems, likely foremost among them Kubernetes clusters. Now your k8s clusters can persist their state across container/pod/node lifecycles and do it without discrete cloud storage provisioning. Azure Kubernetes Services (AKS) users can take advantage of this capability by signing up for the Cosmos DB etcd API preview and then set up the AKS Engine with Azure Cosmos DB etcd API.
The Spark, Jupyter and etcd support on Cosmos DB are all being launched as public previews.
There’s more too – including the GA (general availability) release of Cosmos DB’s .NET v3 API, which is compatible with .NET Standard 2 and, thereby, the cross-platform .NET Core. Also cross-platform, the Table .NET Standard SDK has gone GA as well. On the Java side, there’s also a release of a v3 API, and of a Cosmos DB change feed processor. To cap it all off, there’s now Azure Active Directory authentication support for the full-screen Cosmos Explorer tooling.
Azure SQL Data Warehouse (SQL DW), Microsoft’s SQL server-engine based columnar, massively parallel processing (MPP) cloud DW service, is adding several new features, all designed to make queries go faster.
This includes Workload Management Importance, which enables prioritization for important users, applications and queries – so, for example, the CEO can get her dashboards with consistently quick performance. There’s also Result Set Caching, which ensures that important queries, once run, will return their data almost instantaneously on subsequent requests (that’s great for dashboards too). Materialized Views are another query optimization tool that allows the results of views against tables to be physically persisted and re-used, even when users query the physical tables those views are based on.
Dynamic Data Masking and support for querying data in JSON format – data security/privacy and semi-structured data query features, respectively – which were already supported on SQL Server and Azure SQL Database (SQL DB) – are now supported on SQL DW too. Auto-Update Statistics, another feature imported from the other Microsoft SQL platforms, comes along for the ride, providing SQL DW’s query optimizer with the up-to-date info needed to create the best query plan.
Last, but hardly least, among the new features is something called Ordered Columnstore indexes. I already mentioned that SQL DW is a columnar data warehouse platform, of course. But now, the clustered columstore indexes that implement that functionality won’t just segregate individual columns’ data – they’ll sequence the data within those columnar segments by an optional ordering expression, provided upon index creation.
Microsoft says these new Azure DW features are being launched in public preview.
Hyperscale, serverless and edge, oh my
Want more cloud database goodness? How about Hyperscale functionality allowing essentially boundless, and independent, scale-out of compute, storage and memory for individual databases on the Azure SQL DB platform, as well on the Azure Database for Postgres side. SQL DB’s hyperscale implementation is now GA. The Postgres implementation, which is apparently based on Citus Data’s technology, launches in public preview.
Hyperscale allows for the ultimate in elasticity – as data volumes or user load increases, additional infrastructure becomes available. Now, what if you just want to define a Azure SQL DB database, query it on whatever frequency you may need and be charged just for the query activity that takes place (via automatic pause and resume of provisioned compute resources)? In that case, you’d want a serverless implementation of SQL DB, and Microsoft is announcing that — as a public preview for single databases — at Build, too.
There’s one more important development, somewhat stealthily announced last week, regarding the “edge”– that is, the processing of data in remote, sometimes-disconnected environments. Microsoft’s bringing Azure SQL DB to bear here too with a new product called you guessed it Azure SQL Database Edge, launching in private preview. Essentially, this product provides Azure SQL DB engine based technology that will run on both x64 (Intel, AMD) and ARM processors – and provides for data streaming capabilities as well, using an edge implementation of Azure Stream Analytics technology. Time series storage and querying functionality will be added in subsequent releases.
Had enough data news for one conference? Don’t worry; I’m done. But be warned even though I hit most of the the big items, there’s even more Microsoft data news than I’ve covered here. We shouldn’t be surprised by this: as Microsoft orients itself as a cloud and AI company, data and analytics will be right in the company’s sweet spot. And as Microsoft keeps upping its data game, you can certainly expect that Amazon, Google, Oracle and the rest of the vendors in the analytics ecosystem will do likewise.