Here’s a fundamental contradiction at the heart of modern ETL: you’re being charged for row counts when analytical processing doesn’t work that way.
Modern query engines process data in columnar fashion for performance, using vectorisation, predicate pushdown, and parallel execution. Analytics workloads aggregate across many rows but typically access only specific columns. Performance optimisation targets compute efficiency and resource utilisation, not data volume metrics.
Yet ETL vendors still charge based on row counts, a metric that ignores how analytical processing actually works and scales costs with data volume rather than computational complexity or infrastructure consumption.
This isn’t just a pricing quirk. It’s a fundamental misalignment that forces data engineering teams to optimise for billing metrics rather than analytical performance. When your query engine is designed for efficient columnar processing but your ETL vendor charges per row processed, you’re optimising for vendor convenience rather than technical reality.
The disconnect becomes starker when you consider that data engineers understand infrastructure costs. They optimise for compute efficiency, memory utilisation, and query performance. But row-based pricing forces optimisation around arbitrary metrics that don’t correlate with actual system performance or resource consumption.
Let’s examine why this processing misalignment has created both technical debt and engineering overhead for data teams.
Every data engineering team knows that analytical performance comes from how query engines process data, not how much data exists.
Modern analytics engines use columnar processing techniques: vectorisation for batch operations, predicate pushdown to filter early, and parallel execution across distributed nodes. When you run analytical queries, systems optimise for processing efficiency rather than data volume.
Why processing optimisation matters:
Analytics workloads typically aggregate across many rows but access only specific columns. A revenue analysis might scan millions of transactions but only process date, amount, and category fields. Performance depends on compute efficiency and memory utilisation, not the number of rows involved.
Your entire analytical stack assumes this processing foundation. dbt models optimise for query performance. BI tools leverage analytical processing patterns. Data warehouse billing aligns with compute and storage consumption.
Except your ETL vendor charges for rows.
Monthly Active Rows sound deceptively simple: count the rows, charge accordingly. But the technical reality creates a labyrinth of unpredictable costs that even experienced data engineers struggle to navigate.
The business transaction disconnect:
The core problem isn’t technical complexity, it’s the disconnect between business logic and billing metrics. One logical business transaction can generate multiple database rows, creating no obvious link between what your customers pay you for and what you pay your ETL vendor for.
Real user experience: “syncing 1 million rows of nested JSON could result in 10+ million MARs after normalisation.”
Step 1: Your source data (the unit of work you’re interested in)
json
{
“customer_id”: 12345,
“items”: [
{“product”: “laptop”, “amount”: 999},
{“product”: “mouse”, “amount”: 25}
]
}
You expect to be charged for: 1 unit
Step 2: After Order Processing (what actually gets created)
Table: customers
customer_id |
12345 |
Table: orders
customer_id | order_id | total |
12345 | 50001 | 1024 |
Table: orders_items
customer_id | product | amount |
12345 | laptop | 999 |
12345 | mouse | 25 |
Table: order_metadata (often auto-generated)
customer_id | sync_timestamp | record_hash |
12345 | 2025-06-30 | abc123def |
You’re actually charged for: 5+ units (rows)
The scale impact:
The predictability challenge:
Users consistently report: “the active rows are internal and opaque, there’s no clear way to estimate usage.” This creates a translation gap between business operations and technical billing where cost spikes can’t be traced to specific business decisions.
The fundamental issue isn’t vendor opacity—it’s that row-based billing creates no predictable relationship between business value and infrastructure costs. Teams also report paying for “rows replicated as a result of resetting Replication Keys” when upstream systems change schemas, triggering full table reloads rather than incremental updates.
The contrast between how analytical engines process data and how ETL vendors bill creates technical debt that compounds over time.
Analytical processing enables vectorised operations, efficient memory utilisation, and optimised query execution. Row-based pricing penalises these optimisations through volume multiplication and processing efficiency punishment.
The engineering contradiction:
Data engineers optimise for analytical performance and compute efficiency whilst managing row-based billing constraints. These optimisations often conflict: efficient data modelling increases row counts, real-time processing improvements spike costs, and query performance optimisations can trigger expensive reconciliation cycles.
The most insidious technical cost is how row-based pricing redirects engineering focus from system optimisation to vendor cost management.
The automation paradox:
ETL platforms promise to eliminate data pipeline maintenance. But row-based pricing forces the opposite outcome: “more engineering effort on optimising syncs and data flows just to control costs, reducing the automation benefits that ETL tools are supposed to provide.”
Technical debt patterns:
Teams merge data sources into fewer connectors to regain bulk pricing, creating monolithic pipelines that are harder to maintain. Engineers adjust replication schedules based on cost rather than data freshness requirements. Custom extraction logic gets built to pre-aggregate data before ETL processing, duplicating vendor platform functionality.
Resource allocation impact:
One user captured this perfectly: “I developed an in-house solution for all our pipelines and achieved real-time replication for much less.”
When senior data engineers choose to build custom ETL solutions rather than use vendor platforms, the pricing model has fundamentally broken the value proposition.
Row-based pricing makes technical experimentation financially risky. Real-time features get delayed because teams can’t predict MAR impact. Schema improvements get postponed because they could reset replication keys.
The pricing complexity and architectural misalignment have triggered widespread alternative evaluation across data engineering teams.
Teams are accepting increased maintenance overhead by moving to combinations like “dlt + Airflow” rather than continuing with unpredictable commercial pricing. Organisations are abandoning vendor solutions entirely, with users reporting they “transitioned all ETL connectors to Python scripts” despite the increased development burden.
The technical trade-off analysis:
When data engineering teams evaluate build vs buy decisions, row-based pricing tips the scales toward building:
Platform evaluation criteria:
Teams seeking alternatives consistently prioritise:
Performance-based pricing models align costs with how modern data infrastructure actually operates, particularly columnar storage architectures.
At Matatika, our performance-based approach charges for actual infrastructure consumption:
The processing benefits:
Resource correlation: Costs scale with actual infrastructure consumption patterns that data engineers understand and can optimise for.
Efficiency rewards: Optimising pipelines, queries, and processing workflows immediately reduces costs rather than fighting against arbitrary row count metrics.
Technical alignment: Pricing reflects how analytical engines actually charge for compute and storage, creating consistency across your data stack.
Operational predictability: Infrastructure-based metrics allow capacity planning using the same frameworks you use for data warehouse and cloud infrastructure costs.
Engineering focus restoration:
Performance-based pricing eliminates the contradiction between technical optimisation and cost management. When billing aligns with infrastructure consumption, engineering teams can focus on system performance rather than vendor cost management.
Row-based pricing creates hidden technical costs that compound beyond the monthly invoice.
Development overhead:
Senior engineer time spent on cost optimisation rather than feature development. Architecture compromises that increase maintenance burden. Technical debt accumulation from workarounds designed to manage billing rather than improve performance.
System performance impact:
Suboptimal data freshness due to sync frequency adjustments for cost control. Monolithic pipeline design forced by connector consolidation. Schema evolution constraints due to replication reset penalties.
Row-based pricing incentivises technical decisions that reduce system resilience: fewer data sources due to connector penalties, consolidated pipelines that create single points of failure, and delayed schema updates that accumulate technical debt.
How do I audit the technical impact of row-based pricing?
Analyse engineering time allocation: track hours spent on ETL cost optimisation vs feature development. Identify architectural compromises made for billing rather than technical reasons. Calculate technical debt accumulation from billing-driven workarounds.
What’s the performance impact of MAR optimisation?
Common patterns include delayed schema updates (increasing query complexity), denormalised data structures (reducing analytical efficiency), and suboptimal sync frequencies (degrading data freshness). These optimisations often contradict columnar storage best practices.
How does performance-based pricing change technical decision-making?
It eliminates the conflict between cost optimisation and technical optimisation. Engineering teams can focus on system performance, data freshness, and architectural merit rather than managing arbitrary billing metrics.
What infrastructure metrics should ETL pricing track?
Compute consumption (CPU hours), storage utilisation (data volume and retention), network bandwidth (transfer costs), and execution time (processing duration). These align with how columnar databases and cloud platforms charge for resources.
How do I evaluate technical alternatives to row-based pricing?
Assess cost predictability (can you forecast based on infrastructure metrics?), architectural alignment (does pricing reflect how your systems actually work?), and engineering focus (are technical decisions driven by performance or billing considerations?).
The evidence is overwhelming: row-based pricing creates architectural misalignment that generates technical debt, redirects engineering resources, and constrains system design decisions.
Modern data infrastructure uses columnar storage for performance. Cloud platforms charge for actual resource consumption. ETL pricing should follow the same technical logic.
Matatika’s performance-based pricing eliminates this architectural contradiction. By aligning costs with infrastructure consumption rather than arbitrary metrics, we enable engineering teams to optimise for system performance rather than vendor billing convenience.
Tired of optimising for billing metrics instead of technical merit?
The ETL Escape Plan provides the technical framework to evaluate alternatives and escape row-based pricing constraints. Inside, you’ll find:
Whether you’re facing architecture constraints from billing considerations or engineering resource drain from cost management, the Escape Plan provides the technical roadmap to align your ETL costs with modern data infrastructure realities.
Download the ETL Escape Plan →
Stop optimising for billing metrics. Get the technical framework to align ETL costs with modern analytical processing and restore engineering focus to performance rather than pricing.
Stay up to date with the latest news and insights for data leaders.