Site Search

Databricks

databricks

Databricks Architecture Overview: From Data to AI

For companies, data is not just something to store, but an important asset for analysis, decision-making, and the use of AI. However, many workplaces face challenges such as "data is scattered both inside and outside the company," "data is inconsistent in format and quality," and "data is simply stored but not utilized."

To solve these challenges, Databricks' data intelligence platform covers the entire process of "collection and integration → processing → utilization → governance and operation" on a single platform. A unique feature is that data is not stored internally in Databricks, but can be handled directly on external storage such as S3 or Azure Data Lake Storage (ADLS) Gen2. Furthermore, by using open formats such as Delta Lake and Apache Iceberg, data can be used over the long term while avoiding vendor lock-in.

Databricks Architecture Overview: From Data to AI

In this article, we will organize Databricks' architecture into four phases and clearly explain the role and main functions of each.

1. Data collection and integration

role:

Databricks collects scattered data from internal core systems, cloud services, IoT devices, etc., and integrates it into a form that can be used for analysis and AI. Databricks can directly reference data from external storage, eliminating the need to move large amounts of data and allowing you to start using it immediately.

1. Data collection and integration

Representative functions:

  • Lakeflow Connect / Lakehouse Federation
    • Lakeflow Connect (managed connector): Connects to a variety of sources, including Oracle, SQL Server, Salesforce, ServiceNow, S3/ADLS/GCS, and GA4, and supports scheduled execution and incremental ingestion.
    • Lakehouse Federation (Query Federation): Push down execution to external DB/warehouse for direct lookup without copying.

2. Data processing

role:

Collected data, as it is, is not suitable for analysis or AI applications because it is in a format that is difficult for humans to understand and is full of noise and variations in format. In this phase, the data is cleansed (removing unnecessary information), transformed, and integrated to improve its quality. Databricks utilizes Delta Lake to support ACID transactions and version control, making it easy to prevent erroneous updates and restore past data.

2. Data processing

Representative functions:

  • Spark/Photon: While maintaining the ease of use of Apache Spark, you can speed up processing simply by switching the execution engine to Databricks' dedicated engine, "Photon." This allows SQL/Delta-centric processing to be completed in a shorter time and can be operated on a smaller cluster. This allows you to perform equivalent tasks such as dashboard and data mart generation and ETL pre-processing at a lower cost.
  • Medallion Architecture: A best practice for data processing that gradually improves quality from bronze to silver to gold.
  • Delta Lake: Its open format ensures high compatibility with other tools, and its support for ACID transactions maintains consistency even when accessed simultaneously by multiple users or jobs.

3. Data visualization and AI utilization

role:

High-quality data that has been processed and formatted creates real business value through analysis, visualization, and deployment in applications.

In this phase, various uses are possible, such as decision support using BI tools, prediction using machine learning models, and automation using generative AI.

Databricks has features that are easy to use not only for engineers but also for business users and workplaces, and has expanded its functionality to accelerate the democratization of data utilization.Recently, it has also added AI assistant functions such as Genie and Agent Bricks.

Representative functions:

  • Agent Bricks: AI agents that support business processes
  • Databricks Apps: An internal application platform that allows you to develop and deploy internal data and AI applications on Databricks
  • Genie: A generative AI assistant that supports analysis and reporting in a conversational manner
  • MLflow: Centralized experimentation, management, and operation of AI models

4. Governance and Operational Management

role:

As data utilization expands, access permissions, sharing methods, and operational systems become increasingly important. In this phase, we centralize the management, protection, and operation of all data assets, and build a foundation for achieving both safety and efficiency. By eliminating inadequate permissions and opaque sharing, we can achieve safe and scalable data utilization.

Representative functions:

  • Unity Catalog: Centrally manage all data assets and enable detailed permission control and auditing.
  • Delta Sharing: Secure data sharing with other clouds and external organizations.
  • Workflows: Automate and schedule data processing and AI jobs.

Summary

Databricks is a platform that combines the flexibility of handling data directly on external storage with the avoidance of vendor lock-in through support for open formats. It provides a seamless process from collection to processing, AI utilization, and governance, and combines natural language and no-code functionality that even beginners can use with advanced functionality for engineers. This accelerates the use of data and AI across the entire company, forming the foundation for long-term competitiveness.

Inquiry/Document request

In charge of Macnica Databricks Co., Ltd.

Weekdays: 9:00-17:00