Goldilocks Designs

Learn how to create a minimum viable architecture.

As Staff+, you’re making architectural decisions that shape how fast your team ships, how painful on-call is, and how much the company spends on infrastructure.

That means balancing two opposing forces:

  • Under-engineering leads to fragile systems that break under pressure.

  • Over-engineering buries your team in complexity and slows product delivery.

Staff+ engineers are expected to steer toward the middle: solutions that are robust enough for today’s needs, and extensible enough for tomorrow’s growth.

In this lesson, you’ll learn how to apply the Goldilocks principle to System Design—through a real-world walk-through of a notifications platform that evolves from V1 to global scale.

John loves overbuilding because it makes him seem irreplaceable. You’ll do the opposite—build just enough and document why. That’s the difference between ego-driven architecture and business-driven design.

Goldilocks: Notifications platform

Let’s say you’re building a notifications platform with email, push, and SMS. How do you design it without going too far either way?

We’ll walk through four versions of the same system—from scrappy V1 to global scale—to show what the Goldilocks principle looks like in real life.

Version 1: Keep it simple

One worker process polls a database table for jobs (e.g., send email, SMS, or push). Whenever a new notification is created, it’s inserted into a notifications table. A background process (cron or worker) polls every few seconds, picks unsent jobs, sends them, and marks them complete.

  • Pros:

    • Easy to build: One database table, one worker.

    • Cheap: Runs on a single server or cloud function.

    • Fast to ship: Ideal for MVPs or low-volume apps.

  • Cons:

    • No retry logic: If an email API call fails, it’s gone.

    • Performance ceiling: Polling adds latency and can overwhelm the DB under load.

    • Limited visibility: Hard to track progress or debug failures.

  • When to use: For prototypes, internal tools, or early-stage products with low traffic.
    When speed to market matters more than robustness.

Press + to interact
Version 1: Fast to build, fragile under stress
Version 1: Fast to build, fragile under stress

Version 2: Add reliability

Introduce a message queue (RabbitMQ, SQS, or Kafka Lite). Producers push notification jobs into a queue. Workers consume and process them asynchronously.

Instead of polling the DB, each new notification is pushed to a queue. Workers listen for jobs, process them, and on failure, retry (with exponential backoff). The DB remains the source of truth but isn’t constantly queried.

Pros:

  • Retry logic: No more dropped notifications.

  • Scalable: Add more workers for higher throughput.

  • Clear separation: Producers create jobs, consumers process them.

Cons:

  • More infra: Need to deploy and monitor a queue service.

  • Ops overhead: Dead-letter queues, visibility, and alerting required.

When to use: When you’ve hit reliability or latency issues with v1. Your system now needs to handle spikes or guarantee delivery.

Press + to interact
Version 2: Reliable and horizontally scalable
Version 2: Reliable and horizontally scalable

Version 3: Optimize performance

Introduce Redis cache for metadata and a CDN for static assets. User preferences, rate limits, and template data are cached. CDN serves static assets like logos or templates used in notifications, reducing latency. Workers fetch everything they need with minimal DB load.

Pros:

  • Lower latency: Cached metadata means faster lookups.

  • Reduced DB stress: Heavy reads move to Redis.

  • Optimized delivery: CDNs accelerate static content.

Cons:

  • More complexity: Cache invalidation is hard.

  • Potential inconsistency: Data may be stale.

  • Requires discipline: You must monitor cache hit/miss rates.

When to use: When notifications volume is in the tens of thousands per hour. You’re optimizing for speed and efficiency, not just reliability.

Press + to interact
Version 3: Tuned for performance at scale
Version 3: Tuned for performance at scale

Version 4: Global scale

Multi-region setup with active-active replication for queues and databases. Each region has its own workers close to users for low latency. Traffic is routed to the nearest region (via DNS or load balancer). Each region has mirrored infrastructure. Data replicates across regions with conflict resolution or an active-passive fallback.

Pros:

  • High availability: Survives regional outages.

  • Low latency: Notifications delivered from nearest region.

  • Fault tolerance: Systems keep running even during failures.

Cons:

  • Extreme complexity: Managing replication, failover, and consistency.

  • Expensive: Infra and maintenance costs grow exponentially.

  • Strong ops maturity needed.

When to use: When you’re a global product with strict SLAs and millions of events/day. Now you’re solving distributed systems problems, not just feature delivery.

Press + to interact
Version 4: Build for global reliability
Version 4: Build for global reliability

Each stage makes sense only when usage justifies it. Building v4 on day one is ego-driven. Moving step by step is business-driven.

The Goldilocks zone

Here’s a quick summary of the 4 major versions we talked about.

Version

Focus

Key Additions

Trade-Offs

1

Ship fast

Simple worker and DB

Fragile

2

Reliable

Queue and retries

More ops

3

Fast

Cache and CDN

Complexity

4

Global

Multi-region and replication

Expensive

Each layer of complexity—queues, caches, global replication—adds power and cost.

So which one is in the Goldilocks zone? Any of them can be, depending on your situation.

But for most common scenarios, you might be looking at:

  • Version 2 for fast-growing products: it introduces queues and retries for reliability, without the overhead of full-on scale engineering.

  • Version  3 when you’ve proven reliability and are starting to feel performance pain. It optimizes for speed and efficiency without crossing into distributed‑systems territory.

As a Staff+ engineer, your job is to resist jumping to Version 4 until the pain is real. Your system should earn its complexity.

TL;DR: The Goldilocks principle in architecture means scaling only when the pain is real. To stay out of infra hell, start small and only scale when forced.

Ask