Goldilocks Designs
Learn how to create a minimum viable architecture.
As Staff+, you’re making architectural decisions that shape how fast your team ships, how painful on-call is, and how much the company spends on infrastructure.
That means balancing two opposing forces:
Under-engineering leads to fragile systems that break under pressure.
Over-engineering buries your team in complexity and slows product delivery.
Staff+ engineers are expected to steer toward the middle: solutions that are robust enough for today’s needs, and extensible enough for tomorrow’s growth.
In this lesson, you’ll learn how to apply the Goldilocks principle to System Design—through a real-world walk-through of a notifications platform that evolves from V1 to global scale.
John loves overbuilding because it makes him seem irreplaceable. You’ll do the opposite—build just enough and document why. That’s the difference between ego-driven architecture and business-driven design.
Goldilocks: Notifications platform
Let’s say you’re building a notifications platform with email, push, and SMS. How do you design it without going too far either way?
We’ll walk through four versions of the same system—from scrappy V1 to global scale—to show what the Goldilocks principle looks like in real life.
Version 1: Keep it simple
One worker process polls a database table for jobs (e.g., send email, SMS, or push). Whenever a new notification is created, it’s inserted into a notifications table. A background process (cron or worker) polls every few seconds, picks unsent jobs, sends them, and marks them complete.
Pros:
Easy to build: One database table, one worker.
Cheap: Runs on a single server or cloud function.
Fast to ship: Ideal for MVPs or low-volume apps.
Cons:
No retry logic: If an email API call fails, it’s gone.
Performance ceiling: Polling adds latency and can overwhelm the DB under load.
Limited visibility: Hard to track progress or debug failures.
When to use: For prototypes, internal tools, or early-stage products with low traffic.
When speed to market matters more than robustness.
Version 2: Add reliability
Introduce a message queue (RabbitMQ, SQS, or Kafka Lite). Producers push notification jobs into a queue. Workers consume and process them asynchronously.
Instead of polling the DB, each new notification is pushed to a queue. Workers listen for jobs, process them, and on failure, retry (with exponential backoff). The DB remains the source of truth but isn’t constantly queried.
Pros:
Retry logic: No more dropped notifications.
Scalable: Add more workers for higher throughput.
Clear separation: Producers create jobs, consumers process them.
Cons:
More infra: Need to deploy and monitor a queue service.
Ops overhead: Dead-letter queues, visibility, and alerting required.
When to use: When you’ve hit reliability or latency issues with v1. Your system now needs to handle spikes or guarantee delivery.
Version 3: Optimize performance
Introduce Redis cache for metadata and a CDN for static assets. User preferences, rate limits, and template data are cached. CDN serves static assets like logos or templates used in notifications, reducing latency. Workers fetch everything they need with minimal DB load.
Pros:
Lower latency: Cached metadata means faster lookups.
Reduced DB stress: Heavy reads move to Redis.
Optimized delivery: CDNs accelerate static content.
Cons:
More complexity: Cache invalidation is hard.
Potential inconsistency: Data may be stale.
Requires discipline: You must monitor cache hit/miss rates.
When to use: When notifications volume is in the tens of thousands per hour. You’re optimizing for speed and efficiency, not just reliability.
Version 4: Global scale
Multi-region setup with active-active replication for queues and databases. Each region has its own workers close to users for low latency. Traffic is routed to the nearest region (via DNS or load balancer). Each region has mirrored infrastructure. Data replicates across regions with conflict resolution or an active-passive fallback.
Pros:
High availability: Survives regional outages.
Low latency: Notifications delivered from nearest region.
Fault tolerance: Systems keep running even during failures.
Cons:
Extreme complexity: Managing replication, failover, and consistency.
Expensive: Infra and maintenance costs grow exponentially.
Strong ops maturity needed.
When to use: When you’re a global product with strict SLAs and millions of events/day. Now you’re solving distributed systems problems, not just feature delivery.
Each stage makes sense only when usage justifies it. Building v4 on day one is ego-driven. Moving step by step is business-driven.
The Goldilocks zone
Here’s a quick summary of the 4 major versions we talked about.
Version | Focus | Key Additions | Trade-Offs |
1 | Ship fast | Simple worker and DB | Fragile |
2 | Reliable | Queue and retries | More ops |
3 | Fast | Cache and CDN | Complexity |
4 | Global | Multi-region and replication | Expensive |
Each layer of complexity—queues, caches, global replication—adds power and cost.
So which one is in the Goldilocks zone? Any of them can be, depending on your situation.
But for most common scenarios, you might be looking at:
Version 2 for fast-growing products: it introduces queues and retries for reliability, without the overhead of full-on scale engineering.
Version 3 when you’ve proven reliability and are starting to feel performance pain. It optimizes for speed and efficiency without crossing into distributed‑systems territory.
As a Staff+ engineer, your job is to resist jumping to Version 4 until the pain is real. Your system should earn its complexity.
TL;DR: The Goldilocks principle in architecture means scaling only when the pain is real. To stay out of infra hell, start small and only scale when forced.