Why I Build This ?

I didn’t want to just deploy an application.

I wanted to understand what actually happens between writing code and serving users in production — because that gap is where most real-world failures occur.

On a local machine, things are simple:

The application runs on localhost , There’s no traffic routing , There’s no health validation , There’s no failure handling.

But in production, the same application becomes part of a distributed system( Multiple machines working together like one system).

The Real Problem I Wanted to Solve While learning DevOps, I noticed something:

Most You Tube tutorials teach how to deploy, but not why deployments fail.

They show when you are pushing code pipeline run and app is visible.

But they don’t explain:

What if your container shows as “running,” but you still can’t reach it?

What happens when health checks start failing behind the scenes?

What if there’s a networking misconfiguration quietly breaking communication?

Or worse—what if the system is just dropping traffic without throwing any obvious errors?

I didn’t want to stop at “it works on my machine.” That’s surface-level.

I wanted to actually understand what’s going on under the hood how deployment pipelines actually make sure a new release isn’t secretly broken before calling it “good,” and how systems can catch issues on the fly and react in real time instead of just waiting for something to crash. I was also curious about how traffic gets routed intelligently based on what’s healthy versus what’s not, and how all those behind-the-scenes infrastructure decisions end up shaping how an application actually behaves in production.

One of the biggest takeaways for me was understanding the separation of concerns across the system. Structuring things into distinct layers—services/ for application logic, deployments/ for container and ECS configurations, cicd/ for pipeline orchestration, and infrastructure/ for IAM, networking, and load balancing—made the entire setup feel a lot more intentional and easier to reason about. It helped me see that code, deployment, and infrastructure aren’t all tangled together; they operate as independent layers with their own responsibilities. And more importantly, a lot of real-world issues don’t come from within a single layer, but from how these layers interact—those boundaries are where things tend to break in subtle ways.

System Architecture:

At a high level, this system is a cloud-native deployment pipeline that connects code changes to live production traffic through a series of controlled stages.

Developer → CodeCommit → CodePipeline → CodeBuild → ECR → ECS (Fargate) → ALB → Users

Users But this is just a surface-level view. The actual system works as a set of cooperating layers, each responsible for a specific part of the deployment lifecycle.

The system didn’t fail because of one bug. It failed because of mismatches between different layers of the architecture.

I didn’t go too deep into every architectural detail while building this—I focused more on getting the system working end-to-end. The setup is pretty straightforward: CodePipeline handles the flow by triggering build and deployment whenever there’s a change, acting more like a coordinator than doing any actual work. CodeBuild takes over to build the Docker image, run tests, and push it to ECR, where the image is stored as a versioned, immutable artifact. From there, ECS Fargate runs the container and manages scaling, making sure the service is up.

But what really matters isn’t just how it’s built—it’s where it can fail. ECS might show your container as running, but that doesn’t mean it’s actually usable. That decision is made by the ALB(Application Load Balancer), which routes traffic only if the health checks pass. If /health fails, traffic gets blocked—that’s where deployments actually succeed or fail in real-world terms. And when something breaks (which it will), CloudWatch is what helps you figure out why, with logs and metrics giving you visibility into what’s going on.

So yeah, the architecture is important—but as a DevOps engineer, understanding where the system fails and how to debug it matters way more than just knowing how it was set up.

This is why focusing on fundamentals is much more important: Concepts like Networking , Linux , Databases etc.

Where Things Broke Actually ?

ECS said everything was fine because the container was running. From its point of view, the process was alive, so job done. But ALB disagreed—it couldn’t get a valid response from the app, so it marked the target unhealthy. Same system, two different definitions of “healthy,” and that gap is where things started breaking.

Solution:

Health Endpoint: Introduced a lightweight /health endpoint returning 200 OK—no DB calls, no heavy logic. This gave ALB a deterministic signal to decide if the app is ready, instead of relying on indirect checks.

Then came the networking confusion. The app was happily running on localhost:3000 inside the container, which worked perfectly in isolation. But to the outside world—especially the load balancer—it was basically invisible. The app assumed a local environment, while the system expected it to be reachable over the network.

Solution:

Switched from localhost to 0.0.0.0, allowing the container to accept external traffic. This bridged the gap between a “running container” and an actually reachable service.

Startup behavior added another layer to this. The application needed a bit of time to initialize, but the health checks were impatient. ALB started checking too early, didn’t get a proper response, and immediately flagged it as unhealthy. The system expected a clear readiness signal, but the app wasn’t built to provide one.

Solution:

Decoupled server startup from initialization. The server now starts immediately, while heavy tasks run asynchronously—so the app becomes reachable first, fully ready shortly after.

What made it more interesting was that the CI/CD pipeline showed everything as successful—build passed, deployment completed, no errors anywhere. But in reality, the application still wasn’t accessible. That’s when it clicked: pipelines confirm that steps executed, not that the system actually works as intended.

Solution:

To fix this, you tuned the health check configuration by adjusting the grace period, interval, and timeout to better match real startup behavior. This prevented the system from marking the application as unhealthy before it even had a chance to respond.

At the core, the application itself was designed like a simple local app—start up, run, and assume it’s ready. But the infrastructure expected something more explicit: a /health endpoint, fast responses, and clear signals of readiness. They were essentially speaking different languages.

Finally, you cleaned up duplicate server starts and removed blocking operations, resulting in a more predictable startup and stable runtime behavior.

Key Learning:

Health Checks Aren’t Just Monitoring — They’re Control Signals , I used to think health checks were just for observing system status.

Now I see them differently:

They actively control traffic flow, not just report it

They decide whether your app should receive traffic or be isolated

They determine if a deployment stays alive or gets replaced

Networking Is a First-Class Design Concern.

Binding to localhost restricts container accessibility.

In distributed systems, reachability = availability.

An unreachable service is effectively a non-existent service.

CI/CD Validates Execution, Not Runtime Behavior.

Pipeline success guarantees: Build completion , Deployment execution.

It does not guarantee application correctness in production.

Runtime validation is enforced by: Load balancers , Health checks , Real user traffic.

System Layers Must Share a Common Health Contract

Each layer evaluates health differently:

ECS: process-level (container running)

ALB: network-level (valid HTTP response)

Application: internal logic/state.

Misalignment leads to false positives/negatives in system health

Stability requires a consistent, cross-layer definition of “healthy”

Startup design matters more than you think. If your app tries to do everything at the start, it becomes slow and the system thinks it’s not working. That’s why you see failures, restarts, and unstable deployments.

Simple rule: start fast and respond. You can finish the rest later.

Closing Words:

I started this project with a simple goal to deploy an application using AWS Services.

Along the way, I realized something important—when things break in production, it’s rarely just because of bad code. Most of the issues came from gaps between different layers of the system: the app thought it was ready, ECS thought it was running, but the load balancer disagreed. That mismatch is where real problems live.

It made me understand that writing application logic is only one part of the job. Knowing how infrastructure behaves—how it validates, routes, and sometimes rejects your app—is just as critical if you want things to actually work in production.

GITHUB LINK:

https://github.com/shreyaabaranwal/SafeDeploy

Command Palette

System Architecture:

Where Things Broke Actually ?

Key Learning:

Closing Words:

Comments