Skip to main content
This guide provides a comprehensive, step-by-step walkthrough for self-hosting Flexprice on AWS in a production-ready setup. It covers VPC networking, ECS compute (EC2 with ARM64), Aurora PostgreSQL, Amazon MSK (Kafka), EKS with ClickHouse, ElastiCache Redis, DynamoDB, IAM, secrets management, and observability.

Prerequisites

Before you begin, ensure you have the following:
An AWS account with administrator or equivalent permissions to create VPCs, ECS, RDS, MSK, EKS, S3, IAM roles, and CloudWatch resources
AWS CLI v2 installed and configured with credentials (aws configure)
Docker installed (for building and pushing images to ECR)
kubectl installed (for EKS/ClickHouse management)
eksctl installed (optional but recommended for EKS cluster creation)
Helm installed (for ClickHouse deployment)

Region selection

Choose an AWS region that:
  • Has all required services (see Cost estimation for the list)
  • Is geographically close to your users for lower latency
  • Meets your compliance requirements (e.g., GDPR for EU data)
This guide uses us-east-1 as the example region. Replace with your preferred region.

Cost estimation

We provide two configurations: a development setup for testing and a production setup for high-throughput workloads (100M+ events/month).
ComponentConfigurationMonthly Cost
EC2 for ECS10x m6g.xlarge (ARM64/Graviton)~$1,030
Aurora PostgreSQL2x db.r8g.xlarge (Writer + Reader)~$650
Amazon MSK2 brokers, kafka.m5.large (4 vCPU, 8 GB), 1 TB storage per broker~$350
EKS + ClickHouseControl plane + m5.8xlarge nodes~$1,900
ElastiCache RedisMulti-node cluster (cache.r6g.large, cluster mode)~$650
DynamoDBOn-demand, ~100M events~$50
Storage (EBS)3,000 GB across components (gp3)~$290
ALB + NAT Gateway2x NAT for HA~$130
S3, CloudWatch, SecretsStorage + logs~$50
AWS Subtotal~$5,100
Third-party servicesTemporal Cloud, Supabase, Svix, Grafana~$400
Total~$5,500/month
Costs vary by region and usage. Use the AWS Pricing Calculator for accurate estimates. ARM64/Graviton instances provide ~20% cost savings over x86.

Sizing for 100M events/month

ComponentDevelopmentProduction (100M events/month)
ECS API1 task, 0.5 vCPU, 1 GB6 tasks, 0.75 vCPU, 1.5 GB each
ECS Consumer1 task, 0.5 vCPU, 1 GB30 tasks, 1 vCPU, 1.75 GB each
ECS Temporal Worker1 task, 1 vCPU, 2 GB3 tasks, 2 vCPU, 4 GB each
DatabaseRDS db.t3.smallAurora 2x db.r8g.xlarge
Kafka2x kafka.t3.small, 100 GB2 brokers, kafka.m5.large, 1 TB per broker
ClickHouse2x m5.large (8 GB)m5.8xlarge node(s)
Rediscache.t3.microcache.r6g.large, multi-node cluster mode
Traffic and storage estimates:
  • 100M events/month = ~38.5 events/second average
  • Peak traffic: 150-200 events/second (4-5x burst)
  • ClickHouse storage: ~50 GB/month growth
  • DynamoDB: ~20 GB/month growth

Architecture overview

Flexprice on AWS runs with the following production architecture: AWS architecture for Flexprice Data flow:
  • ClientsCloudflare (DNS, WAF, rate limiting) → ALBECS (API, Consumer, Temporal Worker)
  • API writes to Aurora PostgreSQL, publishes events to MSK (Kafka) and DynamoDB
  • Consumer reads from Kafka and writes to ClickHouse (on EKS) for analytics
  • Temporal Worker connects to Temporal Cloud for workflow orchestration
  • ElastiCache Redis provides caching in cluster mode
  • S3 stores invoice PDFs; CloudWatch and Grafana Cloud collect logs and metrics
This guide uses Temporal Cloud (recommended for production). You can also self-host Temporal, but it requires additional infrastructure. Cloudflare is optional but recommended for DNS and WAF.

Component summary

ComponentAWS ServicePurpose
ComputeECS on EC2 (ARM64)API, Consumer, Temporal Worker services
Primary DatabaseAurora PostgreSQLTransactional data, subscriptions, customers
Analytics DatabaseClickHouse on EKSEvent analytics, usage aggregation
Message QueueAmazon MSKEvent streaming between services
CacheElastiCache RedisSession cache, rate limiting
Event StoreDynamoDBDurable event storage
Object StorageS3Invoice PDFs, exports
Workflow EngineTemporal CloudBilling workflows, scheduled jobs
AuthenticationSupabaseUser authentication (optional)
WebhooksSvixWebhook delivery (optional)

Step 1: VPC and networking

Create a VPC with public and private subnets across two Availability Zones for high availability. Unless otherwise specified, create each resource in this guide via AWS Console, CLI, or IaC using the configuration described in the tables.

VPC configuration

SettingValuePurpose
VPC CIDR10.0.0.0/1665,536 IP addresses
Availability Zones2 (e.g., us-east-1a, us-east-1b)High availability
Public subnets2 (10.0.1.0/24, 10.0.2.0/24)ALB, NAT Gateway
Private subnets (compute)2 (10.0.10.0/24, 10.0.20.0/24)ECS tasks
Private subnets (data)2 (10.0.100.0/24, 10.0.200.0/24)RDS, MSK, EKS
NAT Gateway1 (or 2 for HA)Private subnet internet access
Internet Gateway1Public subnet internet access

Create VPC with AWS CLI

Create the VPC, enable DNS hostnames, and attach an Internet Gateway.

Create subnets

Create public and private subnets in two Availability Zones using the CIDRs in the VPC configuration table.

Create NAT Gateway

Create an Elastic IP and NAT Gateway in a public subnet.

Create route tables

Create public and private route tables and associate subnets (public: default route to Internet Gateway; private: default route to NAT Gateway).

Create security groups

Create security groups for ALB, ECS, RDS, MSK, and EKS. Use the rules in the summary table below.

Security group rules summary

Security GroupInboundSourcePort(s)Purpose
flexprice-alb-sgHTTPS0.0.0.0/0443Public API access
flexprice-alb-sgHTTP0.0.0.0/080Redirect to HTTPS
flexprice-ecs-sgTCPalb-sg8080ALB to API
flexprice-ecs-sgTCPecs-sgAllInter-task communication
flexprice-rds-sgTCPecs-sg5432PostgreSQL access
flexprice-msk-sgTCPecs-sg9092, 9094, 9096Kafka access
flexprice-eks-sgTCPecs-sg9000, 8123ClickHouse access
For production, consider restricting the ALB security group to only Cloudflare IP ranges if you’re using Cloudflare for DNS and WAF.

Step 2: IAM roles and policies

Create IAM roles for ECS task execution and task runtime permissions.

ECS Task Execution Role

This role allows ECS to pull container images and write logs. Create the role and attach the managed policy AmazonECSTaskExecutionRolePolicy plus an inline policy for Secrets Manager access.

ECS Task Role

This role grants permissions for the Flexprice application at runtime (S3, CloudWatch Logs, Secrets Manager). Create the task role and attach the inline policy.

Step 3: Secrets Manager

Store sensitive configuration in AWS Secrets Manager.

Create secrets

Create secrets for PostgreSQL, ClickHouse, Kafka (SASL), auth, and Temporal Cloud. Store postgres (host, username, password, database), clickhouse (username, password), kafka (username, password), auth (64-char hex secret), and temporal (API key, key name, namespace) as needed.
Replace placeholder values with strong, unique credentials. Use a password generator for production secrets.

Step 4: Aurora PostgreSQL

Create an Aurora PostgreSQL cluster for Flexprice’s primary database. Aurora provides higher availability and performance compared to standard RDS.
Create a DB subnet group, Aurora cluster (with Secrets Manager managed credentials), writer instance (db.r8g.xlarge), and reader instance in the other AZ. Retrieve the cluster writer and reader endpoints for application configuration.

Aurora configuration summary

SettingDevelopmentProduction
EnginePostgreSQL 15.4Aurora PostgreSQL 17.4
Instance classdb.t3.smalldb.r8g.xlarge (4 vCPU, 32 GB)
Instances1 (Single-AZ)2 (Writer + Reader, Multi-AZ)
Storage100 GB gp3Aurora I/O-Optimized (auto-scaling)
Multi-AZNoYes (2 zones)
EncryptionEnabledEnabled
Backup retention7 days7 days
Monthly cost~$30~$650

Update Secrets Manager with Aurora endpoints

Update the postgres secret in Secrets Manager with the Aurora writer and reader endpoints and the managed master password ARN.
Aurora with Secrets Manager managed credentials automatically rotates the master password. Use the MasterUserSecret ARN to retrieve the current password.

Run database migrations

You can run migrations using a one-off ECS task or from a bastion host. Create a migration task definition and run it via ECS (or run flexprice migrate up from a host with DB access) using the configuration described above.

Step 5: Amazon MSK (Kafka)

Create an Amazon MSK cluster for event streaming.

Create MSK configuration

Create an MSK configuration (server properties) and register it.

Create MSK cluster

Create the MSK cluster with 2 brokers (1 per AZ), kafka.m5.large (4 vCPU, 8 GB) instance type, and 1024 GB (1 TB) storage per broker. Enable SASL/SCRAM, TLS, encryption at rest, and enhanced monitoring.

Create SASL/SCRAM secret for MSK

Create a secret in Secrets Manager with the prefix AmazonMSK_ and associate it with the MSK cluster.

Get MSK bootstrap brokers

Retrieve the SASL/SCRAM bootstrap broker string from the MSK cluster (AWS Console or CLI) for application configuration.

Create Kafka topics

Use a bastion host or an EC2 instance with Kafka CLI tools to create the events and events-dlq topics (e.g. 6 partitions, replication factor 2). Use SASL_SSL and SCRAM-SHA-512 in client configuration.

MSK configuration summary

SettingDevelopmentProduction
Kafka version3.5.13.8.1
Broker typekafka.t3.smallkafka.m5.large (4 vCPU, 8 GB)
Number of brokers22 (1 per AZ)
Storage per broker100 GB1024 GB (1 TB)
AuthenticationSASL/SCRAMSASL/SCRAM + IAM
EncryptionTLS in transitTLS in transit + at rest
MonitoringBasicEnhanced partition-level + Prometheus
Monthly cost~$90~$350
For development, use kafka.t3.small with 100 GB storage. For production (100M+ events/month), use 2 brokers, kafka.m5.large, and 1 TB storage per broker.

Step 6: EKS with ClickHouse

Create an EKS cluster and deploy ClickHouse for analytics storage. For production (100M+ events/month), use m5.8xlarge nodes for the ClickHouse node group.

Create EKS cluster with eksctl

Create an EKS cluster with a managed node group (m5.8xlarge for production) via eksctl or IaC. Use private subnets and attach the EKS security group.

Create gp3 StorageClass

Create a gp3 StorageClass (EBS CSI driver, encrypted, Retain, WaitForFirstConsumer) via kubectl or IaC.

Create ClickHouse namespace and secrets

Create the clickhouse namespace and a Kubernetes secret with credentials from Secrets Manager via kubectl or IaC.

Deploy ClickHouse with Helm

Add the Altinity ClickHouse Helm repo and install the ClickHouse Operator in the clickhouse namespace via Helm.

Create ClickHouse cluster

Deploy a ClickHouseInstallation (Altinity operator) with the credentials secret, gp3 storage, and appropriate resources via kubectl or Helm.

Create ClickHouse service for ECS access

Create a ClusterIP Service for ClickHouse (ports 9000, 8123) targeting the ClickHouse installation via kubectl or IaC.

Get ClickHouse endpoint

For ECS tasks to access ClickHouse, you have several options:
  1. Internal NLB (recommended): Create an internal Network Load Balancer pointing to the ClickHouse service
  2. VPC peering/Transit Gateway: If ECS and EKS are in separate VPCs
  3. AWS PrivateLink: For cross-account access
Create the internal NLB (type LoadBalancer with internal annotation) and use its DNS name as the ClickHouse endpoint (port 9000) for ECS configuration.

Initialize ClickHouse database

Connect to ClickHouse (e.g. via port-forward or the NLB) and create the flexprice database using clickhouse-client.

Step 7: ElastiCache Redis

Create an ElastiCache Redis cluster for caching and session management.

Create Redis subnet group

Create a cache subnet group in the data subnets.

Create Redis security group

Create a security group for Redis allowing TCP 6379 from the ECS security group.

Create Redis replication group (cluster mode)

Create a Redis replication group with cache.r6g.large, cluster mode, multi-node (e.g. multiple node groups for ~$600/month), TLS and at-rest encryption, and multi-AZ.

Redis configuration summary

SettingDevelopmentProduction (multi-node cluster)
Node typecache.t3.microcache.r6g.large (2 vCPU, 13 GB)
Cluster modeDisabledEnabled
Replicas01 per shard
Multi-AZNoYes
EncryptionOptionalTLS in transit + at rest
Monthly cost~$15~$600

Step 8: DynamoDB

Create a DynamoDB table for durable event storage alongside ClickHouse.

Create events table

Create a DynamoDB table named events with partition key pk (String) and sort key sk (String), on-demand billing.

Enable Point-in-Time Recovery

Enable point-in-time recovery (continuous backups) on the events table.

DynamoDB configuration summary

SettingValueNotes
Billing modeOn-demandPay per request, auto-scales
Partition keypk (String)Tenant/customer ID
Sort keysk (String)Event timestamp
PITREnabledPoint-in-time recovery
EncryptionAWS managedDefault encryption
Monthly cost~$50For ~100M events/month
DynamoDB is used alongside ClickHouse for durable event storage. Events are written to both DynamoDB (for durability) and ClickHouse (for analytics).

Step 9: S3 and CloudWatch

Create S3 bucket for invoices

Create an S3 bucket for invoice PDFs with versioning, AES256 encryption, block public access, and optional lifecycle rules (e.g. transition to STANDARD_IA after 90 days).

Create CloudWatch log groups

Create log groups for ECS services (api, worker, temporal-worker, migration) with a retention policy (e.g. 30 days).

Create CloudWatch alarms

Create alarms for ECS API CPU, RDS CPU, and RDS connections (e.g. threshold 80%, 2 evaluation periods) and associate with an SNS topic for alerts.

Step 10: ECR and container images

Create ECR repositories

Create ECR repositories for api, worker, and temporal-worker with scan-on-push and AES256 encryption.

Build and push images

Build Flexprice container images (api, worker, temporal-worker), authenticate to ECR, tag and push to your ECR repositories.

Step 11: ECS cluster and services

Create ECS cluster

For production (100M+ events/month), create an ECS cluster with EC2 capacity: launch template with m6g.xlarge (ARM64/Graviton), Auto Scaling Group with 10 nodes (min/max as needed), capacity provider with managed scaling, and associate with the cluster.

Create API task definition

Register an ECS task definition for the API service: production uses EC2/ARM64 (768 CPU, 1536 memory) with bridge network; development uses Fargate (1024 CPU, 2048 memory). Include environment variables and secrets from Secrets Manager (auth, postgres, clickhouse, kafka, temporal). Set FLEXPRICE_DEPLOYMENT_MODE=api, health check on :8080/health, and CloudWatch log group. See Step 12 for the full environment variable reference.

Create Worker task definition

Register an ECS task definition for the Consumer (worker) service: FLEXPRICE_DEPLOYMENT_MODE=consumer, postgres/clickhouse/kafka secrets from Secrets Manager. For production use 30 tasks (100M events/month). See Step 12 for environment variables.

Create Temporal Worker task definition

Register an ECS task definition for the Temporal Worker: FLEXPRICE_DEPLOYMENT_MODE=temporal_worker, postgres/clickhouse/kafka/temporal secrets. See Step 12 for environment variables.

Create Application Load Balancer

Create an internet-facing Application Load Balancer in the public subnets, a target group (HTTP 8080, health check /health), an HTTPS listener with an ACM certificate, and an HTTP listener that redirects to HTTPS.

Create ECS services

Create ECS services for API (desired count 6 for production), Worker/Consumer (desired count 30 for production), and Temporal Worker (e.g. 3 tasks). Attach the API service to the ALB target group. Use private subnets and the ECS security group.

Configure Auto Scaling

Register scalable targets and target-tracking scaling policies for the API (and optionally Worker) services (e.g. min/max desired count, CPU target 70%).

Step 12: Environment variables reference

Below is a complete reference of environment variables for each service. Variables marked with (secret) should be stored in AWS Secrets Manager.

API service

VariableValueSource
FLEXPRICE_DEPLOYMENT_MODEapiEnvironment
FLEXPRICE_SERVER_ADDRESS:8080Environment
FLEXPRICE_AUTH_SECRET64-char hexSecret
FLEXPRICE_POSTGRES_HOSTRDS endpointSecret
FLEXPRICE_POSTGRES_PORT5432Environment
FLEXPRICE_POSTGRES_USERflexpriceSecret
FLEXPRICE_POSTGRES_PASSWORDDB passwordSecret
FLEXPRICE_POSTGRES_DBNAMEflexpriceEnvironment
FLEXPRICE_POSTGRES_SSLMODErequireEnvironment
FLEXPRICE_CLICKHOUSE_ADDRESSClickHouse NLB endpointEnvironment
FLEXPRICE_CLICKHOUSE_USERNAMEflexpriceSecret
FLEXPRICE_CLICKHOUSE_PASSWORDClickHouse passwordSecret
FLEXPRICE_CLICKHOUSE_DATABASEflexpriceEnvironment
FLEXPRICE_CLICKHOUSE_TLSfalseEnvironment
FLEXPRICE_KAFKA_BROKERSMSK bootstrap brokersEnvironment
FLEXPRICE_KAFKA_USE_SASLtrueEnvironment
FLEXPRICE_KAFKA_SASL_MECHANISMSCRAM-SHA-512Environment
FLEXPRICE_KAFKA_SASL_USERflexpriceSecret
FLEXPRICE_KAFKA_SASL_PASSWORDKafka passwordSecret
FLEXPRICE_KAFKA_TOPICeventsEnvironment
FLEXPRICE_KAFKA_CONSUMER_GROUPflexprice-consumer-prodEnvironment
FLEXPRICE_TEMPORAL_ADDRESSTemporal Cloud endpointEnvironment
FLEXPRICE_TEMPORAL_TLStrueEnvironment
FLEXPRICE_TEMPORAL_NAMESPACEYour namespaceEnvironment
FLEXPRICE_TEMPORAL_TASK_QUEUEbilling-task-queueEnvironment
FLEXPRICE_TEMPORAL_API_KEYTemporal API keySecret
FLEXPRICE_LOGGING_LEVELinfoEnvironment

Worker and Temporal Worker services

Worker and Temporal Worker use the same variables as API, with FLEXPRICE_DEPLOYMENT_MODE set to consumer or temporal_worker respectively; omit FLEXPRICE_SERVER_ADDRESS for both.

Additional environment variables (Production)

These variables are used in production deployments:
VariableDescriptionExample
FLEXPRICE_DYNAMODB_IN_USEEnable DynamoDB for eventstrue
FLEXPRICE_DYNAMODB_REGIONAWS region for DynamoDBus-west-2
FLEXPRICE_DYNAMODB_EVENT_TABLE_NAMEDynamoDB table nameevents
FLEXPRICE_REDIS_HOSTElastiCache Redis endpointclustercfg.xxx.cache.amazonaws.com
FLEXPRICE_REDIS_PORTRedis port6379
FLEXPRICE_REDIS_CLUSTER_MODEEnable cluster modetrue
FLEXPRICE_REDIS_USE_TLSEnable TLStrue
FLEXPRICE_REDIS_KEY_PREFIXKey prefixflexprice:prod
FLEXPRICE_EVENT_PUBLISH_DESTINATIONWhere to publish eventsall (Kafka + DynamoDB)
FLEXPRICE_LOGGING_FORMATLog formatjson
FLEXPRICE_POSTGRES_READER_HOSTAurora reader endpointxxx.cluster-ro-xxx.rds.amazonaws.com

Step 13: Temporal Cloud configuration

Temporal Cloud is the recommended workflow orchestration service for production deployments.

Sign up for Temporal Cloud

  1. Go to temporal.io/cloud
  2. Create an account and organization
  3. Create a namespace (e.g., flexprice-prod-usa)

Create service account and API key

  1. In Temporal Cloud console, go to Settings > API Keys
  2. Create a new API key with appropriate permissions
  3. Note the API key and key name

Store Temporal credentials

aws secretsmanager create-secret \
  --name flexprice/${ENV}/temporal \
  --description "Flexprice Temporal Cloud credentials" \
  --secret-string '{
    "address": "us-west-2.aws.api.temporal.io:7233",
    "namespace": "your-namespace.your-account-id",
    "api_key": "YOUR_TEMPORAL_API_KEY",
    "api_key_name": "your-service-account-name"
  }'

Temporal environment variables

VariableValueDescription
FLEXPRICE_TEMPORAL_ADDRESSus-west-2.aws.api.temporal.io:7233Temporal Cloud endpoint
FLEXPRICE_TEMPORAL_NAMESPACEyour-namespace.account-idYour namespace
FLEXPRICE_TEMPORAL_TLStrueTLS is required
FLEXPRICE_TEMPORAL_TASK_QUEUEbilling-task-queueTask queue name
FLEXPRICE_TEMPORAL_API_KEY(from Secrets Manager)API key
FLEXPRICE_TEMPORAL_API_KEY_NAMEService account nameKey identifier
Temporal Cloud provides managed infrastructure, automatic upgrades, and 99.99% SLA. For self-hosted Temporal, refer to the Temporal documentation.

Step 14: Third-party integrations (Optional)

Configure optional third-party services for enhanced functionality.

Supabase (Authentication)

If using Supabase for authentication:
aws secretsmanager create-secret \
  --name flexprice/${ENV}/supabase \
  --secret-string '{
    "base_url": "https://your-project.supabase.co",
    "service_key": "YOUR_SUPABASE_SERVICE_KEY"
  }'
VariableValue
FLEXPRICE_AUTH_PROVIDERsupabase
FLEXPRICE_AUTH_SUPABASE_BASE_URLSupabase project URL
FLEXPRICE_AUTH_SUPABASE_SERVICE_KEYService role key

Svix (Webhooks)

For webhook delivery via Svix:
aws secretsmanager create-secret \
  --name flexprice/${ENV}/svix \
  --secret-string '{
    "auth_token": "YOUR_SVIX_AUTH_TOKEN",
    "base_url": "https://api.us.svix.com"
  }'
VariableValue
FLEXPRICE_WEBHOOK_SVIX_CONFIG_ENABLEDtrue
FLEXPRICE_WEBHOOK_SVIX_CONFIG_AUTH_TOKENSvix auth token
FLEXPRICE_WEBHOOK_SVIX_CONFIG_BASE_URLhttps://api.us.svix.com

Sentry (Error Tracking)

For error tracking with Sentry:
VariableValue
FLEXPRICE_SENTRY_ENABLEDtrue
FLEXPRICE_SENTRY_DSNYour Sentry DSN
FLEXPRICE_SENTRY_ENVIRONMENTproduction
FLEXPRICE_SENTRY_SAMPLE_RATE1 (100% sampling)

Grafana Cloud (Observability)

For profiling with Pyroscope on Grafana Cloud:
VariableValue
FLEXPRICE_PYROSCOPE_ENABLEDtrue
FLEXPRICE_PYROSCOPE_SERVER_ADDRESShttps://profiles-prod-xxx.grafana.net
FLEXPRICE_PYROSCOPE_APPLICATION_NAMEflexprice-prod-api
FLEXPRICE_PYROSCOPE_BASIC_AUTH_USERGrafana user ID
FLEXPRICE_PYROSCOPE_BASIC_AUTH_PASSWORDGrafana API key

FluentD (Log Aggregation)

For centralized logging with FluentD:
VariableValue
FLEXPRICE_LOGGING_FLUENTD_ENABLEDtrue
FLEXPRICE_LOGGING_FLUENTD_HOSTFluentD service IP
FLEXPRICE_LOGGING_FLUENTD_PORT30242
FLEXPRICE_LOGGING_FORMATjson

Resend (Email)

For transactional emails via Resend:
VariableValue
FLEXPRICE_EMAIL_ENABLEDtrue
FLEXPRICE_EMAIL_RESEND_API_KEYYour Resend API key
FLEXPRICE_EMAIL_FROM_ADDRESSSender email
FLEXPRICE_EMAIL_REPLY_TOReply-to email

Third-party cost summary

Breakdown below; production total is in the Cost estimation table above.
ServicePurposeMonthly Cost
Temporal CloudWorkflow orchestration~$200
SupabaseAuthentication~$25
SvixWebhooks~$50
Grafana CloudObservability~$50
ResendEmail~$20
SentryError tracking$0-29
Total~$345-375

Deployment checklist

Use this checklist to verify your deployment:
1

VPC and Networking

  • VPC created with correct CIDR
  • 2 public subnets created
  • 4 private subnets created (2 compute, 2 data)
  • Internet Gateway attached
  • NAT Gateway(s) created and running
  • Route tables configured correctly
  • Security groups created with correct rules
2

IAM

  • ECS Task Execution Role created
  • ECS Task Role created
  • Policies attached correctly (S3, Secrets Manager, CloudWatch, DynamoDB)
3

Secrets Manager

  • PostgreSQL/Aurora credentials stored
  • ClickHouse credentials stored
  • Kafka SASL credentials stored
  • Auth secret stored
  • Temporal Cloud credentials stored
  • Third-party credentials stored (Supabase, Svix, etc.)
4

Aurora PostgreSQL

  • DB subnet group created
  • Aurora cluster created and available
  • Writer and Reader instances running
  • Security group allows ECS access
  • Secrets Manager updated with endpoints
  • Database migrations completed
5

Amazon MSK

  • MSK cluster created and active
  • SASL/SCRAM secret associated
  • Topics created (events, events_lazy, events-dlq)
  • Security group allows ECS access
  • Prometheus exporters enabled
6

EKS and ClickHouse

  • EKS cluster created
  • Node group running
  • gp3 StorageClass created
  • ClickHouse operator installed
  • ClickHouse cluster deployed
  • NLB created for ClickHouse access
  • Database initialized
7

ElastiCache Redis

  • Redis subnet group created
  • Redis replication group created
  • Cluster mode enabled (production)
  • TLS encryption enabled
  • Security group allows ECS access
8

DynamoDB

  • Events table created
  • Point-in-time recovery enabled
  • IAM policy allows ECS access
9

S3 and CloudWatch

  • S3 bucket created with encryption
  • CloudWatch log groups created
  • CloudWatch alarms configured
10

ECR and Images

  • ECR repositories created
  • Container images built and pushed
11

ECS

  • ECS cluster created
  • Task definitions registered
  • ALB created with HTTPS listener
  • Target group configured
  • Services created and healthy
  • Auto Scaling configured
12

Verification

  • API health check passing
  • Worker consuming from Kafka
  • Temporal workflows executing
  • Logs appearing in CloudWatch

Troubleshooting

API unreachable

  1. Check ALB health checks:
    aws elbv2 describe-target-health --target-group-arn $TG_ARN
    
  2. Check ECS task status:
    aws ecs describe-services \
      --cluster flexprice-${ENV} \
      --services flexprice-api-${ENV}
    
  3. Check ECS task logs:
    aws logs tail /ecs/flexprice-api-${ENV} --follow
    
  4. Verify security groups:
    • ALB SG allows inbound 443 from internet
    • ECS SG allows inbound 8080 from ALB SG
    • ECS SG allows outbound to RDS, MSK, ClickHouse

Worker not consuming

  1. Check Kafka connectivity:
    # From a bastion or EC2 instance with Kafka tools
    kafka-consumer-groups.sh \
      --bootstrap-server $MSK_BOOTSTRAP \
      --command-config client.properties \
      --group flexprice-consumer-${ENV} \
      --describe
    
  2. Check consumer lag in MSK CloudWatch metrics
  3. Verify SASL credentials:
    • Ensure AmazonMSK_ prefixed secret is associated with cluster
    • Verify username/password match in Secrets Manager
  4. Check security group:
    • MSK SG allows inbound 9094/9096 from ECS SG

Temporal workflows failing

  1. Check Temporal Worker logs:
    aws logs tail /ecs/flexprice-temporal-worker-${ENV} --follow
    
  2. Verify Temporal Cloud connection:
    • Correct FLEXPRICE_TEMPORAL_ADDRESS
    • Valid API key and namespace
    • TLS enabled
  3. Check Temporal Cloud UI for workflow history and errors

ClickHouse connection errors

  1. Verify ClickHouse pods are running:
    kubectl get pods -n clickhouse
    
  2. Check ClickHouse logs:
    kubectl logs -n clickhouse -l clickhouse.altinity.com/chi=flexprice
    
  3. Verify NLB is healthy:
    kubectl get svc clickhouse-nlb -n clickhouse
    
  4. Test connectivity from ECS:
    • Ensure EKS SG allows inbound 9000 from ECS SG
    • Verify NLB DNS resolves correctly

RDS connection issues

  1. Verify RDS is available:
    aws rds describe-db-instances \
      --db-instance-identifier flexprice-${ENV} \
      --query 'DBInstances[0].DBInstanceStatus'
    
  2. Check security group:
    • RDS SG allows inbound 5432 from ECS SG
  3. Verify credentials:
    • Check Secrets Manager values match RDS configuration
  4. Test from bastion:
    psql -h $RDS_ENDPOINT -U flexprice -d flexprice
    

Scaling guidelines

Scale when metrics exceed the thresholds below.
ComponentMetricThresholdAction
ECSCPU utilization> 70% sustainedScale out
ECSMemory utilization> 80% sustainedScale out or increase task memory
ECSAPI latency (p99)> 500msScale out API tasks
ECSKafka consumer lagGrowingScale out Worker tasks
RDSCPU utilization> 80% sustainedUpgrade instance class
RDSDatabase connections> 80% of maxUpgrade instance or add read replica
RDSRead IOPSHitting limitsUpgrade to gp3 with higher IOPS
RDSStorage> 80% usedIncrease allocated storage
MSKBroker CPU> 60% sustainedAdd brokers
MSKConsumer lagGrowing over timeAdd partitions and consumers
MSKStorage> 80% usedIncrease broker storage
ClickHouseQuery latencyDegradingAdd replicas or upgrade nodes
ClickHouseDisk usage> 80%Expand PVCs or add shards
ClickHouseMemory pressureOOM eventsIncrease node memory

Cost optimization

Reserved instances

  • RDS: Purchase Reserved Instances for 1-3 year commitment (up to 72% savings)
  • MSK: Not available; consider Kafka on EC2 with Reserved Instances for significant savings

Fargate Spot

Use Fargate Spot for non-critical workloads:
# Update service to use Fargate Spot
aws ecs update-service \
  --cluster flexprice-${ENV} \
  --service flexprice-worker-${ENV} \
  --capacity-provider-strategy capacityProvider=FARGATE_SPOT,weight=2 capacityProvider=FARGATE,weight=1

S3 lifecycle policies

Already configured to transition to IA after 90 days. Consider:
  • Glacier for archives > 1 year
  • Intelligent-Tiering for unpredictable access patterns

CloudWatch log retention

Set appropriate retention periods:
  • Production: 30-90 days
  • Development: 7-14 days
  • Archive to S3 for long-term storage

Additional resources

Need help?

If you encounter issues during deployment: