Cover image of the blog — Building a Workflow Engine That Handles 10M Runs Per Day

Engineering

15 Mins Read

Building a Workflow Engine That Handles 10M Runs Per Day

Sofia Andersen

Co-founder & CTO

When we started building Flux, our workflow engine handled about 1,000 runs per day. Today, it processes over 10 million. The architecture that got us from zero to one thousand was completely different from what we needed at ten million.

This post is a deep dive into how we built, broke, and rebuilt our execution engine — and the lessons we learned along the way.

The early days: simple and synchronous

Our first workflow engine was embarrassingly simple. A webhook came in, triggered a function that executed each step sequentially, and wrote the result to a database. It worked perfectly for our first 50 customers.

Then customer 51 signed up with a workflow that triggered 200 downstream events per execution. Our synchronous engine processed them one by one, each step waiting for the previous one to complete. A workflow that should have taken 3 seconds took 45. The queue backed up. Other customers' workflows started timing out.

We needed to go asynchronous.

The first rewrite: event-driven architecture

We moved to an event-driven model using a distributed message queue. Each workflow step became an independent event that could be processed by any available worker. Steps no longer waited for the previous step to complete before starting — they executed as soon as their dependencies were met.

This solved our throughput problem immediately. We went from 1,000 runs per day to 100,000 within a month. But it introduced a new challenge: observability. When a workflow failed at step 7 of a 15-step sequence, tracing the failure across distributed workers was a nightmare.

We built a tracing system that assigned a unique execution ID to every workflow run and propagated it through every event. Each step logged its input, output, duration, and any errors against this execution ID. Suddenly, debugging a failed workflow meant looking up one ID and seeing the complete timeline.

The scale challenge: 1 million to 10 million

Hitting one million daily runs exposed our next bottleneck — the database. Every workflow step wrote its state to a centralized PostgreSQL instance. At one million runs with an average of 5 steps per workflow, that's 5 million database writes per day. Our database was struggling.

We implemented two changes. First, we moved to an event-sourcing model where we stored the raw events (what happened) rather than the computed state (what's the current status). This turned expensive read-modify-write operations into cheap append-only writes. Second, we introduced a caching layer that held the computed state of active workflows in memory, only persisting to the database when a workflow completed or failed.

These changes reduced our database load by 80% and brought average execution latency from 340 milliseconds down to 95.

Error handling at scale

At 10 million runs per day, even a 0.1% failure rate means 10,000 failed workflow runs. We couldn't afford to let failures cascade.

We built a graduated retry system. Transient failures (network timeouts, rate limits) get automatic exponential backoff retries — up to 5 attempts with increasing delays. Permanent failures (invalid credentials, deleted resources) are caught immediately and routed to the user's error dashboard with specific guidance on how to fix the issue.

We also implemented circuit breakers for third-party integrations. If a connected service starts returning errors at an elevated rate, Flux automatically pauses workflows that depend on that service and notifies affected users — rather than flooding the failing service with retries and making the problem worse.

What's next

We're currently working on predictive execution — using historical performance data to pre-warm workers and pre-fetch data for workflows that run on predictable schedules. Early testing shows this can reduce execution latency by another 40%.

We're also exploring edge execution for workflows with strict latency requirements. Instead of routing everything through our central infrastructure, latency-sensitive workflows could execute on edge nodes closer to the user's connected services.

Building infrastructure that handles 10 million daily executions reliably is a never-finished project. Every scale milestone reveals the next bottleneck. But that's what makes this work interesting — and it's why we wake up every morning excited to build.

Get automation insights in your inbox.

One email per week. No spam. Unsubscribe anytime.

Get automation insights in your inbox.

One email per week. No spam. Unsubscribe anytime.

Start a free trial

AI-powered workflow automation for modern teams. Connect your tools, eliminate busywork, ship faster.

Get automation insights in your inbox.

One email per week. Workflow tips, product updates, and zero spam.

Pages

Features

Pricing

Customers

About

Contact

Resources

Blog

Connect

Twitter/X

Dribbble

Framer

AI-powered workflow automation for modern teams. Connect your tools, eliminate busywork, ship faster.

Get automation insights in your inbox.

One email per week. Workflow tips, product updates, and zero spam.

Pages

Features

Pricing

Customers

About

Contact

Resources

Blog

Connect

Twitter/X

Dribbble

Framer

AI-powered workflow automation for modern teams. Connect your tools, eliminate busywork, ship faster.

Get automation insights in your inbox.

One email per week. Workflow tips, product updates, and zero spam.

Pages

Features

Pricing

Customers

About

Contact

Resources

Blog

Connect

Twitter/X

Dribbble

Framer