← All work/Analysis · Tools/Olist Warehouse

● Analysis· Tools·2026Demo-ready

Olist Warehouse

Modeling 100,000 Brazilian e-commerce orders in Snowflake. Star-schema design, dbt transformations, and three analyst questions answered straight from the marts — with the charts to back them.

View on GitHub ↗Jump to writeup ↓

Featured

Role

Analytics Engineer

Timeline

2026

Status

Demo-ready

The brief

Olist is the largest department-store marketplace in Brazil. In 2018 they released a public anonymized dataset of ~100,000 orders across two years — 8 normalized tables covering customers, orders, line items, sellers, products, payments, reviews, and geolocation.

The dataset is a near-perfect proxy for a real e-commerce analytics workload. Multi-table, partially denormalized, with timestamps that need standardising and free-text columns that need parsing. Exactly the shape that needs a warehouse and a star schema before the business team can ask questions of it.

So I built that warehouse, and then I asked it three questions.

The pipeline

Pipeline

From Kaggle CSVs to a portfolio chart, in five hops

The dotted edge between marts and the portfolio is the seam — live charts on a static-export portfolio can't query Snowflake, so the build step snapshots the aggregates as JSON. Live warehouse, static delivery, no embed dependencies.

The whole thing runs on a Snowflake trial account, costs roughly $4 in credits to build from cold, and refreshes incrementally if you want it to. The transformation logic lives in dbt — staging views that rename and cast, intermediate views that join, marts tables that the analyst layer queries directly. Tests run on every build; sources have freshness checks.

The schema

Schema

fact_orders at the order-line grain, with six conformed dimensions

dim_date is role-played three ways (purchase, delivered, estimated). dim_geography is conformed across customer and seller. dim_customer is SCD Type 2 so historic orders join to the customer state that was current at purchase time — see docs/star_schema.md for why.

SOURCE · marts/* dbt models · snowflake/ddl/*.sql

The grain of fact_orders is one row per order × line item. This was the most decision-heavy part of the project. Two grains that seemed reasonable but were wrong:

Order grain would have averaged multi-seller orders into a single row, destroying the per-seller margin story.
Payment grain would have inflated revenue when joined to product/seller, because a single order line can have multiple payment methods stretched across installments.

Line-item grain keeps revenue additive across every join, and it preserves the seller-attribution thread that runs through every interesting question in the dataset. The full grain-and-SCD justification (including why customer is SCD2 but seller is SCD1) is in docs/star_schema.md in the repo.

Three questions, three queries, three charts

The point of building a warehouse is to be able to answer questions that the raw tables can't. Three I picked from this dataset:

Q1 · Where is the revenue actually coming from?

Q1 · Where is the revenue actually coming from?

Gross revenue by Brazilian region · 2016–2018

Sudeste (São Paulo, Rio, Minas Gerais, Espírito Santo) drives 65% of the R$15.8M gross revenue across 67,662 orders. The Norte region — geographically the largest — contributes 2.6%. Any growth strategy that allocates effort evenly by region is spending four-fifths of it against one-fifth of the opportunity.

SOURCE · marts.fact_orders × marts.dim_geography

This is the question that almost everything else hinges on. Any growth investment has to start by accepting that 65% of the R$15.8M in gross revenue comes from one of Brazil's five regions — the Sudeste, the São Paulo–Rio–Minas axis. The Norte (north) — geographically the largest region — contributes 2.6%. A national growth strategy that allocates equal effort by region is allocating four-fifths of effort against one-fifth of the opportunity.

Q2 · How sticky are sellers once they start selling?

Q2 · How sticky are sellers once they start selling?

Seller retention by quarterly cohort · % still selling N months later

Each row is a cohort of sellers who made their first sale in that month; read across for the share still selling N months later. The story is healthier than marketplace churn lore suggests — roughly half of every cohort is still active at month 6, and the Jan and Jul 2017 cohorts hold above 50% for the full window. The risk isn't a cliff; it's the steady ~40% that drift away in the first quarter, which acquisition has to keep replacing.

SOURCE · sql_highlights/01_seller_retention_cohort.sql

This is the chart that corrected an assumption. I went in expecting a churn cliff and the data pushed back — retention is moderate, not catastrophic. Roughly half of every cohort is still active six months after their first sale, and the strongest cohorts (Jan and Jul 2017) hold above 50% across the whole window. The real risk isn't a collapse; it's the steady ~40% who drift off in the first quarter, a leak acquisition has to keep refilling. The honest version of this finding is less dramatic than "sellers are fleeing" — and being able to say that with a cohort table behind it is the point.

The query behind this one (sql_highlights/01_seller_retention_cohort.sql) uses a LAG() window function over per-seller first-order timestamps to define cohorts, then cross-joins cohort × period to materialize the retention matrix in one GROUP BY. No triangular subquery, no per-period self-join. The shape Snowflake makes easy.

Q3 · Is late delivery a logistics problem or a category problem?

Q3 · Is late delivery a logistics problem or a category problem?

Late-delivery rate by category group · Sudeste vs rest of Brazil

Both bars per category show the % of orders delivered after the promised date. Within Sudeste the rate sits near 7% almost regardless of category — but outside it, the rate climbs, and the gap widens most for fashion (5.3% → 9.1%) and health & beauty (7.4% → 11.1%). The lever isn't product-level supplier SLAs; the late-delivery signal tracks geography far more than category. The same dim_geography that drives the revenue split surfaces this in a single join.

SOURCE · marts.fact_orders × marts.dim_product × marts.dim_geography

A naive read says "fashion has a 9% late rate, fix the fashion suppliers." The chart says otherwise. Inside the Sudeste, the late rate sits near 7% almost regardless of category. Outside it, every category climbs — and the gap widens most for fashion (5.3% → 9.1%) and health & beauty (7.4% → 11.1%). The disparity tracks geography more than product type. A category-by-category fix would chase a weaker signal; the freight-network-by-region one is where the variance actually lives.

This is the value of conformed dimensions. The same dim_geography table that powers the regional revenue split also powers the regional delivery split — the cross-cut "wait, the late rate is mostly geographic, not categorical" is a single join away, not a separate analysis project.

What this project actually shows

It's small data — 100k orders fits in a laptop's memory — so the engineering on display isn't performance at scale. What it shows is the shape of warehouse work:

Source modeling. Each of the 9 raw tables becomes a staging view that renames, casts, and applies dedup logic where the source has it (reviews are 1-many per order in the raw; staging collapses to the latest).
Conformance. dim_geography is shared by customer and seller; dim_date is role-played 3× on fact_orders. These choices keep the schema readable and the join story honest.
SCD2 done correctly. dim_customer keys on a surrogate built from (customer_unique_id, valid_from). Fact rows join through the version that was current at order_purchase_timestamp. Customers who moved between orders produce multiple dim rows; orders correctly attribute to the city they were in at the time.
dbt discipline. schema.yml with column tests (unique, not_null, accepted_values, relationships), source freshness checks, a packages.yml with versioned dbt_utils. The tests run on every build; a regression in the upstream CSV would fail the build before it landed in the marts.
Snowflake patterns that matter. Cluster key on fact_orders.purchase_date_key (the dimension every dashboard filters by first), QUALIFY for top-N-per-group (clean), mode() for a categorical aggregation in dim_geography, role-based access with a WAREHOUSE_DEV role that has CREATE on the schemas but no ACCOUNTADMIN.

Stack

| Layer | Tool | |---|---| | Source | Kaggle Olist CSVs (real, public) | | Warehouse | Snowflake Standard (trial — $400 credit lasts months at this scale) | | Ingest | Snowflake CLI + Python connector (PUT + COPY INTO) | | Transform | dbt Core with dbt_utils | | Tests | dbt schema tests + source freshness | | Presentation | Hand-coded SVG charts on this page (no third-party iframe) |

Where this goes next

The honest follow-ups, in priority order:

Incremental materialisation on fact_orders. At 112k rows the full rebuild is sub-second; at 10M it wouldn't be. Strategy: unique_key = (order_id, order_item_id), partition on purchased_at, rebuild only the trailing 90 days.
An accumulating_snapshot for the order lifecycle. Right now the lifecycle (purchase → approve → ship → deliver → review) is burned into columns on fact_orders. An accumulating snapshot would surface the per-stage latency story directly.
A fact_review at the comment grain. Currently reviews are denormalized onto fact_orders. If the analysis turned toward review-text mining (sentiment, topic clustering), that grain change becomes load-bearing.

The repo lives at github.com/nathanaelhub/olist-warehouse. All the SQL, DDL, dbt models, and the three highlight queries are there; the README walks through standing up the trial account and running the loads.