← All work/Analysis · Healthcare/Medicare Cost & Quality

● Analysis· Healthcare·2026Demo-ready

Medicare Cost & Quality

What the same procedure costs across hospitals, how far charges sit above what Medicare actually pays, and whether spending more buys better outcomes — a two-fact star schema on public CMS data.

View on GitHub ↗Jump to writeup ↓

Featured

Role

Analytics Engineer

Timeline

2026

Status

Demo-ready

The brief

Healthcare prices are famously incoherent. The same routine procedure can carry wildly different "charges" at two hospitals across town, and almost nobody pays the charge anyway — Medicare pays its own administered rate. So a fair question for anyone who pays for care (an insurer, an employer, a policy team) is: does paying a hospital more actually buy better outcomes?

That question needs two datasets joined together, which is exactly the kind of work I did as a BI analyst in healthcare — so I built the warehouse to answer it from public CMS data.

The data

Two extracts from the Centers for Medicare & Medicaid Services, both public, no authentication:

Medicare Inpatient Hospitals by Provider and Service — for every hospital and every MS-DRG (procedure group): total discharges, average covered charge (what the hospital bills), average total payment, average Medicare payment. ~150,000 hospital × DRG rows.
Hospital Readmissions Reduction Program (HRRP) — for every hospital: the risk-adjusted 30-day excess-readmission ratio across six conditions.

The two files share one column under two different names: HRRP's Facility ID is the inpatient file's Rndrng_Prvdr_CCN, the CMS Certification Number. That single shared key is the whole project — it's what lets cost and quality be compared at the same hospital.

The pipeline

Pipeline

Two public CMS extracts → Snowflake → a two-fact star

The inpatient charges file resolves from the CMS DCAT catalog (so it never points at a stale release URL); HRRP pages through the Provider Data Catalog API. No credentials — the CMS endpoints are open. Both land in RAW, get cast in STG, and conform into MARTS where the two facts meet through dim_provider.

The fetch step resolves the inpatient file from the CMS DCAT catalog rather than a hardcoded URL (CMS rotates the download link each release year; the catalog lookup survives that). HRRP pages through the Provider Data Catalog API. Both land in Snowflake, get cast and cleaned in dbt staging, and conform into the marts.

The schema

Schema

Two facts at different grains, conformed through dim_provider

fact_inpatient_charges (hospital × DRG) holds the cost side; fact_readmissions (hospital × condition) holds the quality side. They share dim_provider — keyed on the CCN that appears in both CMS files — which is the only reason cost and quality can be compared at the same hospital. A dbt relationships test enforces that join so a future CMS release can't silently break it.

SOURCE · docs/star_schema.md

This is a two-fact model, and that's the deliberate part. The cost data and the quality data live at genuinely different grains — one row per hospital × DRG for charges, one row per hospital × condition for readmissions. Forcing them into a single table would either explode into a meaningless DRG × condition cross-join or throw away the per-DRG charge detail.

So they stay as separate facts at their natural grains, sharing a conformed dim_provider. A dbt relationships test enforces the CCN join as a contract — if a future CMS release ships an HRRP hospital that isn't in the charges file, the build fails loudly instead of silently dropping it from every cost-vs-quality comparison. The full grain/conformance reasoning is in docs/star_schema.md.

Three questions

Q1 · Same procedure, different price

Q1 · Same procedure, different price

Average covered charge for DRG 470 (joint replacement) across 1,212 hospitals

The same routine knee/hip replacement: the cheapest hospital's average charge is about $19k, the median is $80k, and the most expensive tops $384k — a 20× spread from floor to ceiling. Meanwhile the actual Medicare payment varies only ~1.7× across the very same hospitals. The "charge" is a list price almost decoupled from what the care costs or what anyone pays.

SOURCE · sql_highlights/01_price_variation_one_drg.sql

DRG 470 — major joint replacement without complications — is about as standardized as inpatient care gets. And yet across 1,212 hospitals the average charge runs from $19k to $384k — a 20× spread, median $80k — while the actual Medicare payment over the same hospitals varies only ~1.7×. The charge isn't tracking the cost of care; it's a negotiating anchor. The query behind this uses PERCENTILE_CONT window functions to get the distribution shape in a single pass — see sql_highlights/01.

Q2 · The markup gap

Q2 · The markup gap

Hospital charge ÷ Medicare payment, by service line (discharge-weighted)

Every bar is how many times higher the average billed charge is than what Medicare actually pays for that service line. The striking part is the consistency: from nervous-system procedures down to transplants, the markup sits in a tight ~5.6–6.4× band. The list price isn't tracking the complexity or cost of the care — it's a roughly fixed multiple applied across the board. Discharge-weighted, so a high-volume DRG counts more than a rare one — the rollup a real analyst builds, not a naïve average of averages.

SOURCE · sql_highlights/02_markup_by_service_line.sql

Zoom out from one DRG to whole service lines, and the markup — billed charge over Medicare payment — is remarkably uniform: nervous-system and digestive procedures top out at 6.4×, transplants sit at 5.6×, and everything else falls in between. The list price isn't tracking the complexity of the care; it's a roughly fixed multiple applied across the board. The number that matters here is discharge-weighted: a high-volume DRG counts more than a rare one. That weighting (sum of weighted measures over sum of discharges) is the difference between a real rollup and the naïve average-of-averages that quietly lies. sql_highlights/02 shows it.

Q3 · Does spending buy quality?

Q3 · Does spending buy quality?  (the two-fact join)

30-day excess-readmission ratio by hospital Medicare-payment quintile

2,789 hospitals bucketed into five groups by what Medicare pays them per discharge (cost fact), plotted against their average 30-day excess-readmission ratio (quality fact), joined on the shared provider dimension. There's a faint downward tilt — the best-paid quintile (avg $21.9k/discharge) readmits slightly below expectation while the lowest ($8.4k) sits slightly above — but it's barely there: the correlation is r = -0.081, meaning hospital payment explains well under 1% of the variance in readmissions. A ratio of 1.0 is "exactly as expected after risk adjustment." Paying a hospital 2.6× more buys, at most, a rounding error of better outcomes. This is the answer payers keep arriving at.

SOURCE · sql_highlights/03_cost_vs_quality.sql

This is the join the whole schema was built for. Bucket 2,789 hospitals into five groups by what Medicare pays them per discharge (the cost fact), then plot each group's average 30-day excess-readmission ratio (the quality fact), joined through the shared provider dimension. There's a faint downward tilt — the best-paid quintile (avg $21.9k/discharge) lands just below the risk-adjusted expectation, the lowest-paid ($8.4k) just above — but the effect is barely there: the correlation is r = −0.081, meaning hospital payment explains well under 1% of the variance in readmissions.

That near-flatness is the finding. Paying a hospital 2.6× more buys, at most, a rounding error of better outcomes. It's the result payers and policy researchers keep arriving at, and being able to reproduce it from public data with a clean two-fact join — rather than asserting it — is the point of the project.

What this shows

Two-fact conformed-dimension modeling — the part of dimensional modeling that's easy to describe and hard to do right.
Honest aggregation — discharge-weighting non-additive averages, dropping suppressed cells, risk-adjusted ratios used correctly.
SQL that earns its keep — percentile windows, NTILE bucketing, a cross-fact correlation, QUALIFY in the repo's highlights.
dbt discipline — staging/marts layering, a cross-fact relationships test that encodes the CCN contract, accepted_range and accepted_values tests, a schema-name macro for clean MARTS/STG.
Healthcare-domain fluency — DRGs, service lines, the charge/payment/Medicare-payment distinction, HRRP risk adjustment. The domain my résumé already lives in, now on the portfolio.

Stack

| Layer | Tool | |---|---| | Source | CMS data.cms.gov (public API, no auth) | | Warehouse | Snowflake Standard (trial) | | Ingest | Snowflake CLI + Python (DCAT-resolved fetch) | | Transform | dbt Core with dbt_utils | | Tests | dbt schema tests incl. cross-fact relationships | | Presentation | Hand-coded SVG charts on this page |

Where this goes next

Multiple data years — adding 2019–2023 introduces a dim_date and turns both facts into periodic snapshots, at which point SCD2 on dim_provider earns its keep (hospitals open, close, and merge across years).
Geographic cost-of-living adjustment — Medicare already wage-indexes payments; normalizing charges the same way would separate "expensive market" from "expensive hospital."
A fact_review-style patient-experience fact — CMS publishes HCAHPS survey scores keyed on the same CCN, a third fact that drops straight into the conformed model.

Repo: github.com/nathanaelhub/medicare-cost-quality.