Architecting an Apache Iceberg Lakehouse you own this product

A scalable, open-source data platform

Alex Merced

MEAP began July 2025
Last updated December 2025
Publication in April 2026 (estimated)

ISBN 9781633435100
358 pages (estimated)

Included with a Manning Online subscription

printed in black & white

resources: Source code Book forum Source code on Github

table of content

PART 1: THE VALUE OF THE APACHE ICEBERG LAKEHOUSE

1 The world of the Apache Iceberg Lakehouse

1.1 What is a data lakehouse

1.1.1 The rise of data warehouses

1.1.2 The move to cloud data warehouses

1.1.3 The data lake and the Hadoop era

1.1.4 Apache Iceberg: The key to the data lakehouse

1.1.5 The data lakehouse: the best of both worlds

1.2 What is Apache Iceberg?

1.2.1 The need for a table format

1.2.2 How Apache Iceberg manages metadata

1.2.3 Key features of Apache Iceberg

1.2.4 Apache Iceberg as an open-source standard

1.3 The benefits of Apache Iceberg

1.3.1 ACID transactions

1.3.2 Table evolution

1.3.3 Time travel & snapshot-based queries

1.3.4 Hidden partitioning for reduced accidental full-table scans

1.3.5 Cost efficiency & optimized query performance

1.4 The components of an Apache Iceberg lakehouse

1.4.1 The storage layer: The foundation of your lakehouse

1.4.2 The ingestion layer: Feeding data into Iceberg tables

1.4.3 The catalog layer: The entry point to your lakehouse

1.4.4 The federation layer: Modeling & accelerating data

1.4.5 The consumption layer: Delivering value to the business

1.5 Summary

2 Hands-on with Apache Iceberg

2.1 Setting up an Apache Iceberg environment

2.1.1 Prerequisites: Install Docker

2.1.2 Creating the Docker compose file

2.1.3 Running the environment

2.1.4 Accessing the services

2.2 Creating Iceberg tables in Spark

2.2.1 Populating the PostgreSQL database

2.2.2 Starting the Apache Spark environment

2.2.3 Configuring Apache Spark for Iceberg

2.2.4 Loading data from PostgreSQL into Iceberg

2.2.5 Verifying data storage in MinIO

2.3 Reading Iceberg tables in Dremio

2.3.1 Starting Dremio

2.3.2 Connecting Dremio to the Nessie Catalog

2.3.3 Querying Iceberg tables in Dremio

2.4 Creating a BI dashboard from your Iceberg tables

2.4.1 Starting Apache Superset

2.4.2 Connecting Superset to Dremio

2.4.3 Creating a dataset from Iceberg tables

2.4.4 Building charts and dashboards

2.5 Summary

PART 2: DESIGNING YOUR ICEBERG ARCHITECTURE

3 Preparing for your move to Apache Iceberg

3.1 Conducting your data platform audit

3.1.1 Who are the stakeholders?

3.1.2 What should you ask stakeholders?

3.1.3 Conducting a technological audit

3.2 Hamerliwa Bank’s audit in action

3.2.1 Hamerliwa Bank interviews their stakeholders

3.2.2 Hamerliwa Bank audits its technology

3.2.3 Hamerliwa Bank summarizes its audit findings

3.3 From audit to requirements: Laying the foundation for design

3.3.1 Defining storage requirements

3.3.2 Defining ingestion requirements

3.3.3 Defining catalog requirements

3.3.4 Defining federation requirements

3.3.5 Defining consumption requirements

3.3.6 Hamerliwa Bank establishes its requirements

3.4 Architectural plan and road show

3.4.1 Hamerliwa Bank creates its architectural plan

3.4.2 Hamerliwa Bank conducts a road show

3.5 Summary

4 Selecting the storage layer

4.1 Storage requirements

4.1.1 File retrieval performance requirements

4.1.2 Security requirements

4.1.3 Integrity requirements

4.1.4 Cost and operational overhead requirements

4.2 Block vs object

4.2.1 Block storage

4.2.2 Object storage

4.3 The standards in the storage layer

4.3.1 Apache Parquet

4.3.2 The S3 API

4.4 Storage solutions

4.4.1 Vendor Comparison Summary

4.4.2 Hadoop (HDFS)

4.4.3 Amazon S3

4.4.4 Google Cloud Storage

4.4.5 Azure Blob Storage and ADLS

4.4.6 MiniO

4.4.7 Ceph

4.4.8 NetApp StorageGRID

4.4.9 Pure Storage

4.4.10 Dell ECS

4.4.11 Wasabi

4.5 Selecting based on requirements

4.5.1 Performance requirements

4.5.2 Security requirements

4.5.3 Integrity requirements

4.5.4 Cost and operational requirements

4.6 Summary

5 Architecting the ingestion layer

5.1 Ingestion requirements

5.1.1 Ingestion throughput and latency

5.1.2 Reliability and fault tolerance

5.1.3 Schema management and evolution

5.1.4 Operational complexity and maintainability

5.2 Ingestion models and architectures

5.2.1 Batch ingestion

5.2.2 Micro-batch and incremental ingestion

5.2.3 Streaming ingestion

5.3 How Iceberg manages writes

5.3.1 Write semantics in Iceberg

5.3.2 Commit protocols and conflict handling

5.4 Tools and frameworks for ingestion

5.4.1 Apache Spark

5.4.2 Apache Flink

5.4.3 Apache NiFi

5.4.4 Fivetran

5.4.5 Qlik

5.4.6 Airbyte

5.4.7 Confluent

5.4.8 Redpanda

5.4.9 Cloud-native ingestion services

5.4.10 Tool selection considerations

5.5 Applying ingestion requirements in context

5.5.1 Prioritizing low latency

5.5.2 Managing high throughput

5.5.3 Supporting complex transformations

5.5.4 Handling schema evolution

5.5.5 Balancing operational overhead

5.5.6 Considering existing cloud environments

5.6 Summary

6 Implementing the catalog layer

6.1 The role of the catalog in Apache Iceberg lakehouses

6.1.1 Responsibilities of the catalog

6.1.2 Catalog interactions with query and processing engines

6.2 Evaluating catalog requirements

6.2.1 Performance, availability, and scale

6.2.2 Metadata governance and lineage

6.2.3 Security and compliance

6.2.4 Deployment flexibility and ecosystem compatibility

6.2.5 Cost and operational overhead

6.2.6 Catalog federation and mesh architectures

6.3 Apache Iceberg REST Catalog Spec

6.3.1 Before the Apache Iceberg REST spec

6.3.2 The solution

6.4 Catalog options: Exploring the ecosystem

6.4.1 Hadoop Catalog

6.4.2 Hive Catalog

6.4.3 JDBC Catalog

6.4.4 Apache Polaris

6.4.5 Project Nessie

6.4.6 Apache Gravitino

6.4.7 Lakekeeper

6.4.8 AWS Glue Data Catalog

6.4.9 Dremio Catalog

6.4.10 Snowflake Open Catalog

6.4.11 Databricks Unity Catalog

6.5 Choosing the right catalog: Evaluating options through scenarios

6.5.1 Scenario: A mid-sized data team migrating from Hive

6.5.2 Scenario: A rapidly scaling cloud-native startup

6.5.3 Scenario: A multinational enterprise with strict data governance

6.5.4 Scenario: SaaS startup prioritizing operational simplicity

6.5.5 Scenario: A large enterprise with multi-cloud and federated governance needs

6.5.6 Scenario: Financial firm requiring daily environment cloning for stress testing

6.5.7 Scenario: Phased Iceberg migration with query federation across legacy systems

6.5.8 Scenario: Lightweight lakehouse adoption with Hadoop catalog and Python

6.6 Summary

7 Designing the federation layer

7.1 What data federation is and why it matters

7.1.1 Common use cases and challenges driving federation needs

7.1.2 How federation aligns with agility and accessibility

7.2 Key requirements for federation

7.2.1 Supporting diverse data sources without duplication

7.2.2 Ensuring consistent semantics and business logic

7.2.3 Providing seamless connectivity for analytics tools

7.2.4 Introducing Dremio and Trino

7.3 Dremio

7.3.1 Dremio architecture

7.3.2 Dremio’s connector ecosystem and Iceberg-centric focus

7.3.3 Dremio’s performance enhancements

7.4 Trino

7.4.1 Modular architecture for wide-source support

7.4.2 Flexibility and configurability for complex environments

7.4.3 Community-led evolution and vendor extensions

7.4.4 Semantic layer considerations in Trino

7.5 Deployment models

7.5.1 Deployment with Dremio

7.5.2 Deployment with Trino

7.6 Federation platform decision scenarios

7.6.1 Fragmented multi-source environment: Trino for connector breadth

7.6.2 Building a native Iceberg lakehouse: Dremio for Iceberg-native features

7.6.3 Empowering business users with UI and governed datasets: Dremio

7.6.4 Lightweight querying of Hudi datasets: Trino via AWS Athena

7.6.5 On-prem Cloudera modernization: Trino replacing Impala for performance

7.6.6 Hybrid cloud Iceberg strategy: Dremio bridging on-prem and ADLS

7.7 Federation Alternatives

7.7.1 Virtualization via shortcuts in OneLake

7.7.2 AI-native data virtualization with Spice.ai

7.7.3 Choosing the right fit

7.8 Summary

8 Understanding the consumption layer

8.1 Revisiting the benefits of the lakehouse for consumption

8.2 Connecting the lakehouse to the people

8.3 Revisiting requirements from our audit

8.3.1 Interpreting requirements for consumption

8.3.2 Requirements for BI tools

8.3.3 Requirements for interactive notebook environments

8.3.4 Requirements for AI and specialized data consumption tools

8.4 Open interfaces for seamless consumption

8.4.1 JDBC and ODBC

8.4.2 Arrow Flight

8.4.3 Model Context Protocol (MCP)

8.5 Business intelligence tools in the lakehouse

8.5.1 Open source BI tools

8.5.2 Commercial BI tools

8.6 Tools for AI and machine learning workloads

8.7 Choosing the right consumption tools: Ten illustrated scenarios

8.7.1 Startup with a data science focus

8.7.2 Large financial institution with strict governance

8.7.3 Mid-sized e-commerce platform building embedded analytics

8.7.4 Decentralized media organization enabling self-service analytics

8.7.5 Government agency balancing public transparency and internal control

8.7.6 Healthcare provider with compliance and data locality constraints

8.7.7 Logistics company unifying real-time operations and historical analysis

8.7.8 SaaS company offering customizable data access to clients

8.7.9 Nonprofit organization supporting collaborative research

8.7.10 Manufacturing company enabling predictive maintenance

8.8 Summary

PART 3: OPERATING YOUR APACHE ICEBERG LAKEHOUSE

9 Maintaining an Iceberg lakehouse

9.1 Problem: Suboptimal data files

9.1.1 Small files

9.1.2 Poorly colocated data

9.1.3 Metadata sprawl

9.1.4 Merge-on-read (MOR) performance hits

9.2 Solution: Compaction

9.2.1 What is compaction?

9.2.2 Target file size

9.2.3 Files to be included

9.2.4 Using filters to scope compaction

9.3 Storage footprint management and data retention

9.3.1 Running snapshot expiration

9.3.2 COW vs MOR: Implications for data retention

9.3.3 Regulatory considerations for data deletion

9.4 Exploring Apache Iceberg’s metadata tables

9.5 Access controls in an Iceberg lakehouse

9.5.1 Storage layer controls

9.5.2 Catalog-level controls

9.5.3 Engine-level access controls

9.6 Summary

10 Operationalizing Apache Iceberg

10.1 Orchestrating the lakehouse

10.1.1 Choosing orchestration tools and patterns

10.1.2 Metadata-driven triggers for proactive maintenance

10.1.3 Per-table maintenance policies

10.1.4 Monitoring and alerting integration

10.1.5 Putting orchestration into practice

10.2 Auditing the lakehouse

10.2.1 Leveraging snapshot history for change tracking

10.2.2 Using branching and tagging for governance

10.2.3 Implementing file and snapshot retention policies

10.2.4 Practical retention policy orchestration

10.2.5 Secure data deletion

10.2.6 Access auditing and governance

10.2.7 Practical auditing with Iceberg: Example workflows

10.3 Disaster recovery in the lakehouse

10.3.1 The role of the metadata catalog in disaster recovery

10.3.2 Protecting against data loss and corruption

10.3.3 Cross-region and multi-environment recovery

10.3.4 Rollback and time travel in incident response

10.3.5 Automating disaster recovery procedures

10.3.6 Validating recovery readiness

10.3.7 Disaster recovery through automation

10.3.8 Practical examples: Automating recovery workflows

10.4 Summary

Appendixes

Appendix A: The metadata tables

A.1 Querying Iceberg metadata tables

A.2 The history metadata table

A.3 The snapshots metadata table

A.4 The metadata_log_entries metadata table

A.5 The manifests metadata table

A.6 The partitions metadata table

A.7 The files metadata table

A.8 The manifests metadata table

A.9 The partitions metadata table

A.10 The position_deletes metadata table

A.11 The all_data_files metadata table

A.12 The all_delete_files metadata table

A.13 The all_entries metadata table

A.14 The all_manifests metadata table

A.15 The refs metadata table

A.16 Monitoring table health with metadata tables

Appendix B: Python for Apache Iceberg

B.1 PyIceberg

B.2 Polars

B.3 DuckDB

B.4 Daft

B.5 Dremio

B.6 Bauplan

B.7 SpiceAI

B.8 Summary and best practices

Appendix C: The Apache Iceberg specification

C.1 Understanding the Iceberg specification

C.1.1 What is a table format specification?

C.1.2 Why Iceberg formalizes table behavior

C.1.3 Evolution of the spec: versioning principles and compatibility

C.2 Iceberg table format versions

C.2.1 Version 1: Foundation for analytical tables

C.2.2 Version 2: Row-level deletes and stricter writes

C.2.3 Version 3: Extended types and advanced capabilities

C.2.4 Version 4: Performance, portability, and real-time readiness

C.3 Snapshot management and table metadata

C.3.1 Table metadata files

C.3.2 Snapshots and the manifest list

C.3.3 Sequence numbers and optimistic concurrency

C.4 The REST Catalog specification

C.4.1 Overview and purpose

C.4.2 Catalog configuration and default endpoints

C.4.3 Namespaces, tables, and views

C.4.4 Table registration, metrics, and transactions

C.4.5 OAuth2 support and security considerations

C.4.6 The scan planning endpoint

C.5 Puffin file format specification

C.5.1 What is a Puffin file?

C.5.2 Storing column-level metrics and custom indexes

C.5.3 Integration with Iceberg table metadata

C.6 Compatibility and migration

C.6.1 Reading and writing across format versions

C.6.2 Upgrading tables to newer spec versions

C.6.3 Handling backward compatibility in practice

Overview

1 The world of the Apache Iceberg Lakehouse

Modern data architecture has evolved through waves of OLTP systems, enterprise and cloud data warehouses, and Hadoop-era data lakes, each trying to balance performance, cost, governance, and flexibility. Warehouses delivered fast analytics but were costly and rigid, while lakes offered cheap, scalable storage but struggled with consistency, schema management, and query performance—often devolving into “data swamps.” The lakehouse emerged to merge these strengths: warehouse-like reliability and performance with the openness, interoperability, and cost-efficiency of data lakes. This chapter frames that evolution and explains why the lakehouse paradigm has become the preferred approach for analytics and AI-ready platforms.

Apache Iceberg is presented as the key enabler of the lakehouse: an open, vendor-agnostic table format that turns files in object storage into reliable, high-performance analytical tables. It introduces a layered metadata model (table metadata, manifest lists, and manifests) that enables pruning and fast planning, supports ACID transactions for safe multi-writer operations, and delivers seamless schema and partition evolution, time travel, and hidden partitioning to prevent accidental full scans. By standardizing how datasets are stored and discovered, Iceberg lets multiple engines work on a single canonical copy of data, reducing ETL and duplication while improving governance and consistency. Compared with alternatives, Iceberg stands out for its flexible partitioning features, broad and growing ecosystem integrations, and community-led governance.

The chapter also outlines the modular components of an Iceberg lakehouse—storage, ingestion, catalog, federation, and consumption—and how this design allows independent scaling, better cost control, and reduced vendor lock-in. Storage keeps data and metadata in durable, low-cost object stores; ingestion supports batch and streaming into Iceberg tables; catalogs provide the entry point and governance; federation unifies and accelerates access across sources; and the consumption layer powers BI, AI, and applications. Organizations adopt Iceberg lakehouses to consolidate data into an open, interoperable platform that delivers performance and governance without sacrificing flexibility. While successful implementations require thoughtful integration with engines and catalogs, the result is a scalable, cost-efficient, and AI-ready architecture built on a single source of truth.

The evolution of data platforms from on-prem warehouses to data lakehouses.

The role of the table format in data lakehouses.

The anatomy of a lakehouse table, metadata files, and data files.

The structure and flow of an Apache Iceberg table read and write operation.

Engines use metadata statistics to eliminate data files from being scanned for faster queries.

Engines can scan older snapshots, which will provide a different list of files to scan, enabling scanning older versions of the data.

The components of a complete data lakehouse implementation

Summary

Data lakehouse architecture combines the scalability and cost-efficiency of data lakes with the performance, ease of use, and structure of data warehouses, solving key challenges in governance, query performance, and cost management.
Apache Iceberg is a modern table format that enables high-performance analytics, schema evolution, ACID transactions, and metadata scalability. It transforms data lakes into structured, mutable, governed storage platforms.
Iceberg eliminates significant pain points of OLTP databases, enterprise data warehouses, and Hadoop-based data lakes, including high costs, rigid schemas, slow queries, and inconsistent data governance.
With features like time travel, partition evolution, and hidden partitioning, Iceberg reduces storage costs, simplifies ETL, and optimizes compute resources, making data analytics more efficient.
Iceberg integrates with query engines (Trino, Dremio, Snowflake), processing frameworks (Spark, Flink), and open lakehouse catalogs (Nessie, Polaris, Gravitino), enabling modular, vendor-agnostic architectures.
The Apache Iceberg Lakehouse has five key components: storage, ingestion, catalog, federation, and consumption.

FAQ

What is a data lakehouse and how does it differ from data lakes and data warehouses?

A data lakehouse combines the cost-efficiency and openness of data lakes with the performance, governance, and reliability of data warehouses. Using open table formats like Apache Iceberg, it lets multiple engines query a single, governed copy of data with warehouse-like capabilities (ACID, schema evolution, indexing-like metadata) without vendor lock-in.

How does Apache Iceberg enable the lakehouse paradigm?

Apache Iceberg is an open table format that makes datasets stored as files behave like managed database tables. It adds a robust metadata layer, ACID transactions, and advanced partitioning so multiple tools can read and write the same datasets reliably and efficiently, turning a data lake into a high-performance analytical platform.

How is Iceberg different from using raw Parquet files in a data lake?

Raw Parquet lacks table-level guarantees and governance: no built-in ACID, versioning, or consistent schema/partition tracking. Iceberg wraps files with multi-layered metadata, enabling fast pruning, schema and partition evolution, time travel, and safe concurrent writes—so engines see a consistent, queryable table instead of a loose collection of files.

What architectural problems does the lakehouse address?

It reduces data duplication and costly ETL by enabling analytics directly on lake-stored data, avoids proprietary lock-in by using open formats, improves performance over traditional Hadoop-era lakes via metadata-driven pruning, and lowers storage/compute costs compared to monolithic warehouses while preserving governance.

When should I implement an Apache Iceberg lakehouse?

Consider Iceberg when you need multi-engine access to the same data, strong ACID guarantees on the lake, evolving schemas/partitions without rewrites, time travel for audit/compliance, and to cut costs from duplicative warehouses, marts, and replication-heavy ETL pipelines.

How does Iceberg manage metadata, and why is it fast?

Iceberg organizes metadata in three layers: table metadata (metadata.json), manifest lists (per-snapshot summaries), and manifests (file-level stats). This structure lets engines prune entire partitions and files before scanning, speeds up planning, and supports snapshot isolation for reliable reads and writes at scale.

What are the key features of Apache Iceberg?

- ACID transactions for reliable concurrent reads/writes - Schema evolution without full rewrites - Partition evolution and hidden partitioning for simpler, faster queries - Time travel and snapshot-based queries for audit and recovery - Open, vendor-agnostic standard with broad engine/catalog support

How do ACID transactions in Iceberg improve reliability?

ACID ensures changes are atomic, consistent, isolated, and durable. Readers see stable snapshots while writers commit safely, preventing partial writes, conflicts, and corruption when multiple jobs or tools modify the same tables.

What are the core components of an Apache Iceberg lakehouse?

- Storage layer: Object storage for data (e.g., Parquet) and Iceberg metadata - Ingestion layer: Batch/stream pipelines (Spark, Flink, Kafka Connect, etc.) - Catalog layer: Table discovery, governance, and atomic updates (e.g., AWS Glue, Nessie, Polaris) - Federation layer: Semantic modeling, data unification, and acceleration (e.g., Dremio, Trino, dbt) - Consumption layer: BI, AI/ML, apps and APIs consuming the same governed data

How does Iceberg compare to Delta Lake, Apache Hudi, and Apache Paimon?

While features overlap, Iceberg stands out for partition evolution and hidden partitioning, broad multi-vendor ecosystem support, and Apache governance that minimizes vendor lock-in. Hudi shines for streaming-centric use cases, and Delta (now LF project) has strong ties to Databricks. Iceberg’s open, widely adopted standard offers a particularly flexible path to a lakehouse.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $23.99

you save $24.00 (50%)

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more