Back to blog

Cloud

Cloud Infrastructure with AWS: What We've Learned

9 min readEnviaIT Engineering

Three years ago, our deployment process looked like this: SSH into a server, pull the latest code, restart the process, and pray. Today, every project we ship runs on AWS with infrastructure defined in code, automated pipelines, and monitoring that alerts us before our clients notice anything is wrong.

This article is a practical overview of what we use, why we chose it, and what we'd do differently if we started over today.

Our AWS stack at a glance

Not every project needs the same infrastructure. Over time, we've developed a decision framework based on the type of application:

| Application Type | Compute | Database | Storage | CDN | |---|---|---|---|---| | SaaS web app | ECS Fargate | RDS PostgreSQL | S3 | CloudFront | | Event platform | Lambda + API Gateway | DynamoDB | S3 | CloudFront | | Static marketing site | - | - | S3 | CloudFront | | Data pipeline | Lambda + Step Functions | RDS / Redshift | S3 | - | | Internal tooling | ECS Fargate | RDS PostgreSQL | S3 | - |

The guiding principle is simple: use managed services wherever possible. We stopped managing our own databases, load balancers, and container orchestration years ago. The cost premium is worth the operational sanity.

ECS Fargate: our default for web applications

For most web applications, we run containers on ECS Fargate. No EC2 instances to patch, no cluster capacity to manage. You define your container, set the resource limits, and AWS handles the rest.

Here's a simplified task definition for a typical Next.js application:

{
  "family": "webapp-production",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "containerDefinitions": [
    {
      "name": "webapp",
      "image": "123456789.dkr.ecr.eu-west-1.amazonaws.com/webapp:latest",
      "portMappings": [
        {
          "containerPort": 3000,
          "protocol": "tcp"
        }
      ],
      "environment": [
        { "name": "NODE_ENV", "value": "production" }
      ],
      "secrets": [
        {
          "name": "DATABASE_URL",
          "valueFrom": "arn:aws:secretsmanager:eu-west-1:123456789:secret:webapp/db-url"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/webapp-production",
          "awslogs-region": "eu-west-1",
          "awslogs-stream-prefix": "webapp"
        }
      }
    }
  ]
}

Key decisions we've made with ECS:

  • Fargate over EC2 launch type. The price difference is minimal for our workloads, and we never have to think about instance patching or capacity.
  • Secrets Manager for environment variables. Never bake secrets into images or task definitions. Secrets Manager rotates credentials automatically and integrates directly with ECS.
  • eu-west-1 (Ireland) as our default region. Closest AWS region to Spain with full service availability.

Auto-scaling that actually works

We configure auto-scaling based on CPU and request count, not just memory:

resource "aws_appautoscaling_target" "webapp" {
  max_capacity       = 10
  min_capacity       = 2
  resource_id        = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.webapp.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

resource "aws_appautoscaling_policy" "cpu" {
  name               = "webapp-cpu-scaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.webapp.resource_id
  scalable_dimension = aws_appautoscaling_target.webapp.scalable_dimension
  service_namespace  = aws_appautoscaling_target.webapp.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    target_value       = 60.0
    scale_in_cooldown  = 300
    scale_out_cooldown = 60
  }
}

Notice the asymmetric cooldowns: we scale out fast (60 seconds) and in slow (300 seconds). This prevents flapping during variable traffic patterns.

Lambda: when functions make more sense

Lambda is not our default, but it's the right choice for specific workloads:

  • Webhook handlers that receive external events (Stripe, GitHub, etc.)
  • Scheduled tasks like nightly report generation or data cleanup
  • Image processing triggered by S3 uploads
  • Event-driven pipelines connected to EventBridge or SQS

One real example: for a client's e-commerce platform, we use Lambda to generate PDF invoices. An order completion event triggers a function that renders the PDF, stores it in S3, and sends a notification. The entire flow costs less than $2/month for thousands of invoices.

// Lambda handler for invoice generation
import { S3Client, PutObjectCommand } from "@aws-sdk/client-s3";
import { renderInvoicePDF } from "./pdf-renderer";

const s3 = new S3Client({ region: "eu-west-1" });

export const handler = async (event: EventBridgeEvent) => {
  const order = event.detail;

  const pdfBuffer = await renderInvoicePDF({
    orderId: order.id,
    customer: order.customer,
    items: order.items,
    total: order.total,
  });

  await s3.send(
    new PutObjectCommand({
      Bucket: process.env.INVOICES_BUCKET,
      Key: `invoices/${order.id}.pdf`,
      Body: pdfBuffer,
      ContentType: "application/pdf",
    })
  );

  return { statusCode: 200, body: `Invoice generated for order ${order.id}` };
};

Lambda pitfall we learned the hard way: cold starts matter for user-facing endpoints. If your Lambda sits behind an API Gateway and users wait for the response, a 2-3 second cold start is unacceptable. We use provisioned concurrency for critical paths or simply move those workloads to ECS.

Infrastructure as Code with Terraform

We standardized on Terraform for infrastructure management. Every environment — development, staging, production — is defined in code and applied through CI/CD.

Our typical project structure:

infrastructure/
  modules/
    networking/      # VPC, subnets, security groups
    ecs/             # Cluster, services, task definitions
    rds/             # Database instances
    cdn/             # CloudFront distributions
    monitoring/      # CloudWatch dashboards and alarms
  environments/
    dev/
      main.tf
      variables.tf
      terraform.tfvars
    staging/
      main.tf
      variables.tf
      terraform.tfvars
    production/
      main.tf
      variables.tf
      terraform.tfvars

Why Terraform over CDK? We've used both. CDK is excellent if your team is deeply invested in TypeScript and you want to express complex logic in your infrastructure definitions. Terraform is more portable, has a larger module ecosystem, and its plan/apply workflow gives us more confidence in production changes. For most projects, Terraform wins on simplicity.

The best infrastructure code is boring infrastructure code. No clever abstractions, no dynamic resource generation, no meta-programming. Just straightforward resource definitions that anyone on the team can read and understand.

State management

We store Terraform state in S3 with DynamoDB locking:

terraform {
  backend "s3" {
    bucket         = "enviait-terraform-state"
    key            = "webapp/production/terraform.tfstate"
    region         = "eu-west-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

This setup prevents concurrent modifications and keeps state encrypted at rest. We've been bitten by corrupted local state files before. Remote state is non-negotiable.

CI/CD pipelines

Every project follows the same deployment flow, implemented in GitHub Actions:

# .github/workflows/deploy.yml (simplified)
name: Deploy to Production

on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: "20"
      - run: npm ci
      - run: npm run lint
      - run: npm run test
      - run: npm run build

  deploy:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_DEPLOY_ROLE_ARN }}
          aws-region: eu-west-1

      - name: Login to ECR
        uses: aws-actions/amazon-ecr-login@v2

      - name: Build and push image
        run: |
          docker build -t $ECR_REPO:$GITHUB_SHA .
          docker push $ECR_REPO:$GITHUB_SHA

      - name: Deploy to ECS
        run: |
          aws ecs update-service \
            --cluster production \
            --service webapp \
            --force-new-deployment

Important detail: we use OIDC federation for AWS authentication in GitHub Actions, not long-lived access keys. The role-to-assume parameter uses a temporary session, which is both more secure and easier to audit.

Deployment strategy

We use rolling deployments with health checks. ECS starts new containers, waits for them to pass health checks, and then drains the old ones. Zero downtime, automatic rollback if the new version fails health checks.

For databases, we run migrations as a separate ECS task before the deployment:

  1. CI builds and pushes the new image
  2. A migration task runs against the database
  3. If migration succeeds, the service is updated with the new image
  4. If migration fails, the pipeline stops and alerts the team

Monitoring and alerting with CloudWatch

We set up monitoring before the first deployment, not after the first incident. Every project gets a baseline dashboard with these metrics:

  • ECS: CPU utilization, memory utilization, running task count
  • RDS: CPU, free storage, connection count, read/write latency
  • ALB: Request count, target response time (P50, P95, P99), 5xx error rate
  • Lambda: Invocation count, error count, duration (P95), throttles

Alarms that matter

We've learned to be selective with alarms. Too many alerts leads to alert fatigue and ignored notifications. Our standard alarm set:

resource "aws_cloudwatch_metric_alarm" "high_5xx_rate" {
  alarm_name          = "webapp-high-5xx-rate"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "HTTPCode_Target_5XX_Count"
  namespace           = "AWS/ApplicationELB"
  period              = 300
  statistic           = "Sum"
  threshold           = 10
  alarm_description   = "More than 10 5xx errors in 5 minutes"

  alarm_actions = [aws_sns_topic.alerts.arn]

  dimensions = {
    LoadBalancer = aws_lb.main.arn_suffix
    TargetGroup  = aws_lb_target_group.webapp.arn_suffix
  }
}

We send critical alarms to Slack and PagerDuty, and informational ones only to Slack. The distinction is important: a 5xx spike at 3 AM wakes someone up; high CPU utilization at 3 AM is a Slack message reviewed in the morning.

Cost optimization: lessons from real projects

AWS bills can spiral quickly. Here are the strategies that have saved our clients the most money:

1. Right-size everything

The most common waste we see: oversized RDS instances. A db.r6g.xlarge running at 5% CPU because someone provisioned "just in case." We start small and scale up based on actual metrics.

2. Use Savings Plans for predictable workloads

For clients with stable production workloads, Compute Savings Plans typically save 30-40% compared to on-demand pricing. We review and adjust these quarterly.

3. S3 lifecycle policies

Logs, backups, and old assets don't need to live in S3 Standard forever:

resource "aws_s3_bucket_lifecycle_configuration" "assets" {
  bucket = aws_s3_bucket.assets.id

  rule {
    id     = "archive-old-assets"
    status = "Enabled"

    transition {
      days          = 90
      storage_class = "STANDARD_IA"
    }

    transition {
      days          = 365
      storage_class = "GLACIER"
    }
  }
}

4. CloudFront for everything static

Every static asset goes through CloudFront. The bandwidth savings alone justify it, but the real benefit is latency. Users in Latin America accessing a site hosted in Ireland get sub-100ms response times because CloudFront edge locations serve the cached content.

5. Review and kill unused resources

We run a monthly audit of all AWS accounts we manage. Unattached EBS volumes, unused Elastic IPs, idle load balancers, forgotten Lambda functions — they add up. A simple script that checks for resources with zero traffic over 30 days has saved clients hundreds of euros per month.

What we'd do differently

If we were starting fresh today:

  1. Adopt AWS CDK for greenfield projects. While we standardized on Terraform, CDK's type safety and ability to express patterns as constructs is compelling for teams already deep in TypeScript.

  2. Use SST (Serverless Stack) for serverless projects. SST provides a significantly better development experience for Lambda-based architectures, with local debugging that actually works.

  3. Invest in platform engineering earlier. We should have built our internal deployment platform — reusable Terraform modules, standardized pipeline templates, shared monitoring dashboards — before we had ten projects to maintain, not after.

  4. Multi-account strategy from day one. Separating production, staging, and development into different AWS accounts provides better security isolation and cleaner cost tracking.

The bottom line

Cloud infrastructure isn't about using the newest services or the most complex architecture. It's about building reliable, observable, and cost-effective systems that your team can maintain without heroics.

Every decision should be driven by the question: "Will this be easy to debug at 2 AM when something goes wrong?" If the answer is no, simplify.


Ready to move your infrastructure to the cloud — or optimize what you already have? Let's talk about building an AWS setup that fits your needs and budget.