Databricks unveils innovations for Data Lakehouse Platform

Advanced data warehousing and data governance capabilities highlight the future of the modern data stack.

  • Tuesday, 28th June 2022 Posted 2 years ago in by Phil Alsop

Databricks has unveiled the evolution of the Databricks Lakehouse Platform to a sold-out crowd at the annual Data + AI Summit in San Francisco. New capabilities revealed include best-in-class data warehousing performance and functionality, expanded data governance, new data sharing innovations to include an analytics marketplace and data cleanrooms for secure data collaboration, automatic cost optimisation for ETL operations, and machine learning (ML) lifecycle improvements. 

 

“Our customers want to be able to do business intelligence, AI, and machine learning on one platform, where their data already resides. This requires best-in-class data warehousing capabilities that can run directly on their data lake. Benchmarking ourselves against the highest standards, we have proven time and again that the Databricks Lakehouse Platform gives data teams the best of both worlds on a simple, open, and multi-cloud platform,” said Ali Ghodsi, Co-founder and CEO of Databricks. “Today’s announcements are a significant step forward in advancing our Lakehouse vision, as we are making it faster and easier than ever to maximise the value of data, both within and across companies.”

 

The Best Data Warehouse is the Lakehouse 

Organisations like Amgen, AT&T, Northwestern Mutual and Walgreens, are making the move to the lakehouse because of its ability to deliver analytics on both structured and unstructured data. Today, Databricks unveiled new data warehousing capabilities in its platform to further enhance analytics workloads: 

Databricks SQL Serverless, available in preview on AWS, provides instant, secure, and fully managed elastic compute for improved performance at a lower cost. 

Photon, the record-setting query engine for lakehouse systems, will be generally available on Databricks Workspaces in the coming weeks, further expanding Photon’s reach across the platform. In the two years since Photon was announced, it has processed exabytes of data, run billions of queries, delivered benchmark-setting price/performance at up to 12x better than traditional cloud data warehouses.

Open source connectors for Go, Node.js, and Python now make it even simpler to access the lakehouse from operational applications.

Databricks SQL CLI now enables developers and analysts to run queries directly from their local computers.

Databricks SQL now provides query federation, offering the ability to query remote data sources including PostgreSQL, MySQL, AWS Redshift, and others without the need to first extract and load the data from the source systems.

 

Data Governance Highlighted as a Top Priority with Advanced Capability for Unity Catalog

Unity Catalog, generally available on AWS and Azure in the coming weeks,  offers a centralised governance solution for all data and AI assets, with built-in search and discovery, automated lineage for all workloads, with performance and scalability for a lakehouse on any cloud. Also, Databricks introduced data lineage for Unity Catalog earlier this month, significantly expanding data governance capabilities on the lakehouse and giving businesses a complete view of the entire data lifecycle. With data lineage, customers gain visibility into where data in their lakehouse came from, who created it and when, how it has been modified over time, how it’s being used across data warehousing and data science workloads, and much more. 

 

Enhanced Data Sharing Enabled By Databricks Marketplace and Cleanrooms 

As the first marketplace for all data and AI, available in the coming months, Databricks Marketplace provides an open marketplace to package and distribute data and analytics assets. Going beyond marketplaces that simply offer datasets, Databricks Marketplace enables data providers to securely package and monetise a host of assets such as data tables, files, machine learning models, notebooks and analytics dashboards. Data consumers can easily discover new data and AI assets, jumpstart their analysis and gain insights and value from data faster. For example, instead of acquiring access to a dataset and investing their own time to develop and maintain dashboards to report on it, they can choose to simply subscribe to pre-existing dashboards that already provide the necessary analytics. Databricks Marketplace is powered by Delta Sharing, allowing data providers to share their data without having to move or replicate the data from their cloud storage. This allows providers to deliver data to other clouds, tools, and platforms from a single source.

 

Databricks is also helping customers share and collaborate with data across organisational boundaries. Cleanrooms, available in the coming months, will provide a way to share and join data across organisations with a secure, hosted environment and no data replication required. In the context of media and advertising, for example, two companies may want to understand audience overlap and campaign reach. Existing cleanroom solutions have limitations, as they are commonly restricted to SQL tools and run the risk of data duplication across multiple platforms. With Cleanrooms, organisations can easily collaborate with customers and partners on any cloud and provide them the flexibility to run complex computations and workloads using both SQL and data science-based tools - including Python, R, and Scala - with consistent data privacy controls.

 

MLflow 2.0 Streamlines and Accelerates Production Machine Learning at Scale

Databricks continues to lead the way in MLOps innovation with the introduction of MLflow 2.0. Getting a machine learning pipeline into production requires setting up infrastructure, not just writing code. This can be difficult for new users and tedious for everyone at scale. MLflow Pipelines, made possible by MLflow 2.0, now handles the operational details for users. Instead of setting up orchestration of notebooks, users can simply define the elements of the pipeline in a configuration file and MLflow Pipelines manages execution automatically. Looking beyond MLflow, Databricks also added Serverless Model Endpoints to directly support production model hosting, as well as built-in Model Monitoring dashboards to help teams analyse the real-world model performance.

 

Delta Live Tables Includes Industry First Performance Optimiser for Data Engineering Pipelines 

Delta Live Tables (DLT) is the first ETL framework to use a simple, declarative approach to building reliable data pipelines. Since its launch earlier this year, Databricks continues to expand DLT with new capabilities including the introduction of a new performance optimisation layer designed to speed up execution and reduce costs of ETL. Additionally, new Enhanced Autoscaling is purpose-built to intelligently scale resources with the fluctuations of streaming workloads, and Change Data Capture (CDC) for Slowly Changing Dimensions - Type 2, easily tracks every change in source data for both compliance and machine learning experimentation purposes.