Data Lakehouse Architecture for Indian SMEs: A Practical Guide

Indian SMEs are stuck between expensive data warehouses and unmanageable data lakes. The lakehouse pattern offers a middle path — if you build it right. Here is our field-tested approach.

The Indian SME Data Problem

Most Indian SMEs we work with are in one of two painful states.

State 1: Spreadsheet Hell. The business runs on Excel files shared over email and WhatsApp. Reporting means someone spending two days every month manually consolidating numbers from five different systems. The CEO gets a dashboard — it is a PDF attached to an email, and the numbers are three weeks old.

State 2: Expensive Warehouse, Limited Value. Someone sold them a Snowflake or BigQuery implementation. The monthly bill is ₹2-5 lakh, and the only thing running on it is three dashboards that nobody trusts because the data is stale or inconsistent. The data engineering team (one person, usually) spends all their time fixing pipelines instead of building new capabilities.

The lakehouse architecture solves both problems — but only if you design it for the constraints that Indian SMEs actually face: tight budgets, small teams, and a mix of structured and unstructured data.

What Is a Data Lakehouse?

A data lakehouse combines the best properties of data lakes and data warehouses into a single architecture:

From data lakes — store all data (structured, semi-structured, unstructured) in open file formats (Parquet, Delta, Iceberg) on cheap object storage (S3, GCS, Azure Blob).
From data warehouses — ACID transactions, schema enforcement, SQL query performance, and time travel (versioned data).

The key enabling technologies are table formats: Delta Lake, Apache Iceberg, and Apache Hudi. These formats add a metadata layer on top of Parquet files that enables warehouse-like capabilities without a warehouse's price tag.

Why this matters for Indian SMEs:

Cost — object storage (S3/GCS) costs ₹1.5-2 per GB/month. A comparable amount of data in Snowflake costs ₹15-25 per GB/month. For a 1TB dataset, that is the difference between ₹2,000/month and ₹20,000/month in storage alone.
Flexibility — you are not locked into a vendor's SQL engine. Query the same data with Spark, Trino, DuckDB, or your warehouse of choice.
Future-proof — open formats mean your data is portable. If a better query engine emerges next year, you adopt it without migrating data.

Reference Architecture for Indian SMEs

Key Design Decisions

1. Object storage as the foundation

Everything lands in S3/GCS/Azure Blob in Delta Lake or Iceberg format. This is your single source of truth. Compute is separate and ephemeral — you pay for it only when queries run.

For Indian SMEs on AWS, we typically use S3 in ap-south-1 (Mumbai region). Data residency stays in India, latency is low, and costs are minimal.

2. Airbyte for ingestion

Airbyte is open-source and has connectors for the systems Indian SMEs actually use: Tally (via API), Zoho Books, Zoho CRM, MySQL, PostgreSQL, Google Sheets, and REST APIs. Self-hosted Airbyte on a ₹3,000/month VM handles most SME ingestion needs.

For clients who prefer managed services, Fivetran is the premium option — but the cost (starting at $1/month per MAR) adds up quickly for price-sensitive SMEs.

3. dbt for transformations

dbt is the transformation layer. Write SQL, dbt compiles and runs it against your lakehouse. This is where business logic lives: revenue calculations, customer segmentation, inventory metrics, GST reconciliation.

We covered dbt in depth in our dbt consulting guide. For the lakehouse context, the key point is: dbt works with DuckDB, Trino, Spark, and most warehouse engines — so it fits naturally on top of a lakehouse.

4. DuckDB for ad-hoc queries

DuckDB is a revelation for SMEs. It is an in-process analytical database that reads Parquet and Delta files directly. No server, no cluster, no cost. Your analyst runs a SQL query on their laptop against terabytes of data in S3.

For datasets under 100GB (which covers 90% of Indian SMEs), DuckDB eliminates the need for a separate query engine entirely.

5. Metabase for dashboards

Metabase is open-source, self-hostable, and non-technical users can build their own dashboards. Connect it to DuckDB or Trino and you have a complete BI layer for ₹0 in software licensing.

The Three-Layer Data Model

We structure every lakehouse with three layers, following the medallion architecture:

Bronze Layer (Raw)

Exact copy of source data, no transformations
Append-only with ingestion timestamps
Schema-on-read — store whatever the source sends
Retention: indefinite (storage is cheap)

Silver Layer (Cleaned)

Deduplicated, typed, and validated
Business keys resolved (customer IDs matched across systems)
Soft deletes applied, audit fields added
This is where data quality checks run

Gold Layer (Business)

Business-ready aggregations and metrics
Dimensional models (star/snowflake schema) for BI tools
Pre-computed KPIs: monthly revenue, customer LTV, inventory turnover
Optimised for query performance (partitioned, sorted, compacted)

Why three layers? Debugging. When a dashboard number looks wrong, you trace from Gold → Silver → Bronze to find exactly where the data diverged. Without this structure, debugging data issues is guesswork.

Cost Breakdown: Real Numbers

Here is what a typical Indian SME lakehouse costs per month:

Component	Self-hosted	Managed
Object storage (500GB)	₹1,000	₹1,000
Airbyte (self-hosted on VM)	₹3,000	₹15,000 (Fivetran)
dbt (dbt Core, free)	₹0	₹10,000 (dbt Cloud)
DuckDB	₹0	₹0
Metabase (self-hosted)	₹2,000 (VM)	₹8,000 (Metabase Cloud)
Orchestrator (Dagster/Prefect)	₹3,000 (VM)	₹5,000 (Cloud tier)
Total	₹9,000	₹39,000

Compare this to a managed Snowflake setup which typically runs ₹50,000-2,00,000/month for equivalent workloads. The lakehouse approach is 5-20x cheaper.

The catch: self-hosted requires a data engineer who can manage infrastructure. If you do not have that person, the managed stack at ₹39,000/month is still dramatically cheaper than a warehouse — and you are not locked in.

Common Mistakes We See

1. Starting with the technology instead of the questions

"We need a data lakehouse" is not a business requirement. "We need to know our actual gross margin by product line, updated daily" is. Start with the five business questions your CEO wants answered, then design the lakehouse to answer them.

2. Boiling the ocean on ingestion

Do not connect every system on day one. Start with the two or three sources that answer your priority questions. Get those pipelines stable, dashboards trusted, and then expand. We have seen too many projects collapse under the weight of 15 simultaneous connector integrations.

3. Skipping data quality

Bronze → Silver transformation must include data quality checks. Use dbt tests or Great Expectations to validate row counts, null rates, referential integrity, and business rules. A dashboard built on unchecked data is worse than no dashboard — it gives false confidence.

4. Over-engineering for scale you do not have

If your total data is under 50GB, you do not need Spark, Kubernetes, or a distributed query engine. DuckDB on a single machine handles this effortlessly. Design for 10x your current scale, not 1000x. You can always add distributed compute later — that is the beauty of open table formats.

5. Ignoring data governance from day one

Indian businesses face increasing regulatory requirements — GST data retention, RBI guidelines for financial data, upcoming DPDPA (Digital Personal Data Protection Act) compliance. Build access controls, data cataloguing, and retention policies into the lakehouse from the start, not as an afterthought. We cover this in detail in our data governance guide.

Implementation Timeline

For a typical Indian SME, we deliver a production lakehouse in 8-10 weeks:

Week	Milestone
1-2	Discovery: identify priority questions, audit source systems, design architecture
3-4	Foundation: set up object storage, ingestion (2-3 priority sources), Bronze layer
5-6	Transformations: Silver and Gold layers in dbt, data quality checks
7-8	Serving: dashboards in Metabase, ad-hoc query access via DuckDB
9-10	Hardening: monitoring, alerting, documentation, team training

Our Data Strategy & Architecture and Data Engineering & Pipeline Development services cover this end-to-end. We design the architecture, build the pipelines, and train your team to own it.

When a Lakehouse Is NOT the Answer

Under 10GB of total data — a well-designed PostgreSQL database with dbt and Metabase is simpler and sufficient.
Single-source analytics — if all your data lives in one system (e.g., Zoho Analytics for a Zoho-only shop), the built-in analytics may be enough.
Real-time requirements — if you need sub-second query latency on streaming data, a lakehouse alone is not enough. You need a streaming layer (Kafka + Flink) feeding into the lakehouse. We cover this in our real-time pipelines guide.
No technical ownership — a lakehouse needs at least one person who can write SQL and manage pipelines. If that person does not exist and you cannot hire one, consider a fully managed solution like Hevo Data or Fivetran + Snowflake.

Ready to evaluate whether a lakehouse fits your business? Book a free architecture review — we will assess your data sources, volumes, and business questions, and give you an honest recommendation.

Frequently Asked Questions

Delta Lake or Iceberg — which should we choose?

For Indian SMEs, we default to Delta Lake. It has broader tooling support (Spark, Databricks, DuckDB, Trino all read Delta natively), better documentation, and a larger community. Iceberg is technically excellent and gaining momentum in the enterprise space, but for SMEs who need the simplest path to production, Delta wins on ecosystem maturity.

Can we start with a lakehouse and add a warehouse later?

Yes — this is actually the recommended path. Start with a lakehouse (object storage + Delta/Iceberg + DuckDB). If you outgrow DuckDB's performance (typically beyond 500GB-1TB of frequently queried data), add Snowflake or BigQuery as a query engine on top of the same storage. Your data stays in open formats; you are just adding a faster SQL engine.

How does this compare to Databricks?

Databricks is a lakehouse platform — it runs on the same Delta Lake format but adds managed Spark, ML capabilities, and a collaborative notebook environment. For Indian SMEs, Databricks is typically overkill and expensive (minimum ₹50,000-1,00,000/month). We recommend it for teams with 5+ data engineers and ML workloads. For everyone else, the open-source stack we describe here delivers 80% of the value at 10% of the cost.

What about data security and compliance?

Object storage providers (AWS, GCS, Azure) offer encryption at rest and in transit, IAM-based access control, and audit logging out of the box. For DPDPA compliance, we implement column-level encryption for PII fields, access logging, and data retention policies. The lakehouse architecture actually makes compliance easier than scattered databases because all data flows through a single, auditable pipeline.