Prerequisites
Before you begin, ensure you have the following:An AWS account with administrator or equivalent
permissions to create VPCs, ECS, RDS, MSK, EKS, S3, IAM roles, and CloudWatch
resources
AWS CLI v2 installed and configured with
credentials (
aws configure)Docker installed (for building and pushing images
to ECR)
kubectl installed (for
EKS/ClickHouse management)
eksctl installed (optional but recommended for EKS
cluster creation)
Helm installed (for ClickHouse deployment)
Region selection
Choose an AWS region that:- Has all required services (see Cost estimation for the list)
- Is geographically close to your users for lower latency
- Meets your compliance requirements (e.g., GDPR for EU data)
us-east-1 as the example region. Replace with your preferred region.
Cost estimation
We provide two configurations: a development setup for testing and a production setup for high-throughput workloads (100M+ events/month).- Production (~$5,500/month)
- Development ($550-700/month)
| Component | Configuration | Monthly Cost |
|---|---|---|
| EC2 for ECS | 10x m6g.xlarge (ARM64/Graviton) | ~$1,030 |
| Aurora PostgreSQL | 2x db.r8g.xlarge (Writer + Reader) | ~$650 |
| Amazon MSK | 2 brokers, kafka.m5.large (4 vCPU, 8 GB), 1 TB storage per broker | ~$350 |
| EKS + ClickHouse | Control plane + m5.8xlarge nodes | ~$1,900 |
| ElastiCache Redis | Multi-node cluster (cache.r6g.large, cluster mode) | ~$650 |
| DynamoDB | On-demand, ~100M events | ~$50 |
| Storage (EBS) | 3,000 GB across components (gp3) | ~$290 |
| ALB + NAT Gateway | 2x NAT for HA | ~$130 |
| S3, CloudWatch, Secrets | Storage + logs | ~$50 |
| AWS Subtotal | ~$5,100 | |
| Third-party services | Temporal Cloud, Supabase, Svix, Grafana | ~$400 |
| Total | ~$5,500/month |
Costs vary by region and usage. Use the AWS Pricing
Calculator for accurate estimates. ARM64/Graviton
instances provide ~20% cost savings over x86.
Sizing for 100M events/month
| Component | Development | Production (100M events/month) |
|---|---|---|
| ECS API | 1 task, 0.5 vCPU, 1 GB | 6 tasks, 0.75 vCPU, 1.5 GB each |
| ECS Consumer | 1 task, 0.5 vCPU, 1 GB | 30 tasks, 1 vCPU, 1.75 GB each |
| ECS Temporal Worker | 1 task, 1 vCPU, 2 GB | 3 tasks, 2 vCPU, 4 GB each |
| Database | RDS db.t3.small | Aurora 2x db.r8g.xlarge |
| Kafka | 2x kafka.t3.small, 100 GB | 2 brokers, kafka.m5.large, 1 TB per broker |
| ClickHouse | 2x m5.large (8 GB) | m5.8xlarge node(s) |
| Redis | cache.t3.micro | cache.r6g.large, multi-node cluster mode |
- 100M events/month = ~38.5 events/second average
- Peak traffic: 150-200 events/second (4-5x burst)
- ClickHouse storage: ~50 GB/month growth
- DynamoDB: ~20 GB/month growth
Architecture overview
Flexprice on AWS runs with the following production architecture:
- Clients → Cloudflare (DNS, WAF, rate limiting) → ALB → ECS (API, Consumer, Temporal Worker)
- API writes to Aurora PostgreSQL, publishes events to MSK (Kafka) and DynamoDB
- Consumer reads from Kafka and writes to ClickHouse (on EKS) for analytics
- Temporal Worker connects to Temporal Cloud for workflow orchestration
- ElastiCache Redis provides caching in cluster mode
- S3 stores invoice PDFs; CloudWatch and Grafana Cloud collect logs and metrics
This guide uses Temporal Cloud (recommended for production). You can also
self-host Temporal, but it requires additional infrastructure. Cloudflare is
optional but recommended for DNS and WAF.
Component summary
| Component | AWS Service | Purpose |
|---|---|---|
| Compute | ECS on EC2 (ARM64) | API, Consumer, Temporal Worker services |
| Primary Database | Aurora PostgreSQL | Transactional data, subscriptions, customers |
| Analytics Database | ClickHouse on EKS | Event analytics, usage aggregation |
| Message Queue | Amazon MSK | Event streaming between services |
| Cache | ElastiCache Redis | Session cache, rate limiting |
| Event Store | DynamoDB | Durable event storage |
| Object Storage | S3 | Invoice PDFs, exports |
| Workflow Engine | Temporal Cloud | Billing workflows, scheduled jobs |
| Authentication | Supabase | User authentication (optional) |
| Webhooks | Svix | Webhook delivery (optional) |
Step 1: VPC and networking
Create a VPC with public and private subnets across two Availability Zones for high availability. Unless otherwise specified, create each resource in this guide via AWS Console, CLI, or IaC using the configuration described in the tables.VPC configuration
| Setting | Value | Purpose |
|---|---|---|
| VPC CIDR | 10.0.0.0/16 | 65,536 IP addresses |
| Availability Zones | 2 (e.g., us-east-1a, us-east-1b) | High availability |
| Public subnets | 2 (10.0.1.0/24, 10.0.2.0/24) | ALB, NAT Gateway |
| Private subnets (compute) | 2 (10.0.10.0/24, 10.0.20.0/24) | ECS tasks |
| Private subnets (data) | 2 (10.0.100.0/24, 10.0.200.0/24) | RDS, MSK, EKS |
| NAT Gateway | 1 (or 2 for HA) | Private subnet internet access |
| Internet Gateway | 1 | Public subnet internet access |
Create VPC with AWS CLI
Create the VPC, enable DNS hostnames, and attach an Internet Gateway.Create subnets
Create public and private subnets in two Availability Zones using the CIDRs in the VPC configuration table.Create NAT Gateway
Create an Elastic IP and NAT Gateway in a public subnet.Create route tables
Create public and private route tables and associate subnets (public: default route to Internet Gateway; private: default route to NAT Gateway).Create security groups
Create security groups for ALB, ECS, RDS, MSK, and EKS. Use the rules in the summary table below.Security group rules summary
| Security Group | Inbound | Source | Port(s) | Purpose |
|---|---|---|---|---|
flexprice-alb-sg | HTTPS | 0.0.0.0/0 | 443 | Public API access |
flexprice-alb-sg | HTTP | 0.0.0.0/0 | 80 | Redirect to HTTPS |
flexprice-ecs-sg | TCP | alb-sg | 8080 | ALB to API |
flexprice-ecs-sg | TCP | ecs-sg | All | Inter-task communication |
flexprice-rds-sg | TCP | ecs-sg | 5432 | PostgreSQL access |
flexprice-msk-sg | TCP | ecs-sg | 9092, 9094, 9096 | Kafka access |
flexprice-eks-sg | TCP | ecs-sg | 9000, 8123 | ClickHouse access |
Step 2: IAM roles and policies
Create IAM roles for ECS task execution and task runtime permissions.ECS Task Execution Role
This role allows ECS to pull container images and write logs. Create the role and attach the managed policyAmazonECSTaskExecutionRolePolicy plus an inline policy for Secrets Manager access.
ECS Task Role
This role grants permissions for the Flexprice application at runtime (S3, CloudWatch Logs, Secrets Manager). Create the task role and attach the inline policy.Step 3: Secrets Manager
Store sensitive configuration in AWS Secrets Manager.Create secrets
Create secrets for PostgreSQL, ClickHouse, Kafka (SASL), auth, and Temporal Cloud. Store postgres (host, username, password, database), clickhouse (username, password), kafka (username, password), auth (64-char hex secret), and temporal (API key, key name, namespace) as needed.Step 4: Aurora PostgreSQL
Create an Aurora PostgreSQL cluster for Flexprice’s primary database. Aurora provides higher availability and performance compared to standard RDS.- Production (Aurora)
- Development (RDS)
Create a DB subnet group, Aurora cluster (with Secrets Manager managed
credentials), writer instance (db.r8g.xlarge), and reader instance in the
other AZ. Retrieve the cluster writer and reader endpoints for application
configuration.
Aurora configuration summary
| Setting | Development | Production |
|---|---|---|
| Engine | PostgreSQL 15.4 | Aurora PostgreSQL 17.4 |
| Instance class | db.t3.small | db.r8g.xlarge (4 vCPU, 32 GB) |
| Instances | 1 (Single-AZ) | 2 (Writer + Reader, Multi-AZ) |
| Storage | 100 GB gp3 | Aurora I/O-Optimized (auto-scaling) |
| Multi-AZ | No | Yes (2 zones) |
| Encryption | Enabled | Enabled |
| Backup retention | 7 days | 7 days |
| Monthly cost | ~$30 | ~$650 |
Update Secrets Manager with Aurora endpoints
Update the postgres secret in Secrets Manager with the Aurora writer and reader endpoints and the managed master password ARN.Run database migrations
You can run migrations using a one-off ECS task or from a bastion host. Create a migration task definition and run it via ECS (or runflexprice migrate up from a host with DB access) using the configuration described above.
Step 5: Amazon MSK (Kafka)
Create an Amazon MSK cluster for event streaming.Create MSK configuration
Create an MSK configuration (server properties) and register it.Create MSK cluster
Create the MSK cluster with 2 brokers (1 per AZ), kafka.m5.large (4 vCPU, 8 GB) instance type, and 1024 GB (1 TB) storage per broker. Enable SASL/SCRAM, TLS, encryption at rest, and enhanced monitoring.Create SASL/SCRAM secret for MSK
Create a secret in Secrets Manager with the prefixAmazonMSK_ and associate it with the MSK cluster.
Get MSK bootstrap brokers
Retrieve the SASL/SCRAM bootstrap broker string from the MSK cluster (AWS Console or CLI) for application configuration.Create Kafka topics
Use a bastion host or an EC2 instance with Kafka CLI tools to create theevents and events-dlq topics (e.g. 6 partitions, replication factor 2). Use SASL_SSL and SCRAM-SHA-512 in client configuration.
MSK configuration summary
| Setting | Development | Production |
|---|---|---|
| Kafka version | 3.5.1 | 3.8.1 |
| Broker type | kafka.t3.small | kafka.m5.large (4 vCPU, 8 GB) |
| Number of brokers | 2 | 2 (1 per AZ) |
| Storage per broker | 100 GB | 1024 GB (1 TB) |
| Authentication | SASL/SCRAM | SASL/SCRAM + IAM |
| Encryption | TLS in transit | TLS in transit + at rest |
| Monitoring | Basic | Enhanced partition-level + Prometheus |
| Monthly cost | ~$90 | ~$350 |
Step 6: EKS with ClickHouse
Create an EKS cluster and deploy ClickHouse for analytics storage. For production (100M+ events/month), use m5.8xlarge nodes for the ClickHouse node group.Create EKS cluster with eksctl
Create an EKS cluster with a managed node group (m5.8xlarge for production) via eksctl or IaC. Use private subnets and attach the EKS security group.Create gp3 StorageClass
Create a gp3 StorageClass (EBS CSI driver, encrypted, Retain, WaitForFirstConsumer) via kubectl or IaC.Create ClickHouse namespace and secrets
Create theclickhouse namespace and a Kubernetes secret with credentials from Secrets Manager via kubectl or IaC.
Deploy ClickHouse with Helm
Add the Altinity ClickHouse Helm repo and install the ClickHouse Operator in theclickhouse namespace via Helm.
Create ClickHouse cluster
Deploy a ClickHouseInstallation (Altinity operator) with the credentials secret, gp3 storage, and appropriate resources via kubectl or Helm.Create ClickHouse service for ECS access
Create a ClusterIP Service for ClickHouse (ports 9000, 8123) targeting the ClickHouse installation via kubectl or IaC.Get ClickHouse endpoint
For ECS tasks to access ClickHouse, you have several options:- Internal NLB (recommended): Create an internal Network Load Balancer pointing to the ClickHouse service
- VPC peering/Transit Gateway: If ECS and EKS are in separate VPCs
- AWS PrivateLink: For cross-account access
Initialize ClickHouse database
Connect to ClickHouse (e.g. via port-forward or the NLB) and create theflexprice database using clickhouse-client.
Step 7: ElastiCache Redis
Create an ElastiCache Redis cluster for caching and session management.Create Redis subnet group
Create a cache subnet group in the data subnets.Create Redis security group
Create a security group for Redis allowing TCP 6379 from the ECS security group.Create Redis replication group (cluster mode)
- Production (Cluster Mode)
- Development (Single Node)
Create a Redis replication group with cache.r6g.large, cluster mode,
multi-node (e.g. multiple node groups for ~$600/month), TLS and at-rest
encryption, and multi-AZ.
Redis configuration summary
| Setting | Development | Production (multi-node cluster) |
|---|---|---|
| Node type | cache.t3.micro | cache.r6g.large (2 vCPU, 13 GB) |
| Cluster mode | Disabled | Enabled |
| Replicas | 0 | 1 per shard |
| Multi-AZ | No | Yes |
| Encryption | Optional | TLS in transit + at rest |
| Monthly cost | ~$15 | ~$600 |
Step 8: DynamoDB
Create a DynamoDB table for durable event storage alongside ClickHouse.Create events table
Create a DynamoDB table namedevents with partition key pk (String) and sort key sk (String), on-demand billing.
Enable Point-in-Time Recovery
Enable point-in-time recovery (continuous backups) on the events table.DynamoDB configuration summary
| Setting | Value | Notes |
|---|---|---|
| Billing mode | On-demand | Pay per request, auto-scales |
| Partition key | pk (String) | Tenant/customer ID |
| Sort key | sk (String) | Event timestamp |
| PITR | Enabled | Point-in-time recovery |
| Encryption | AWS managed | Default encryption |
| Monthly cost | ~$50 | For ~100M events/month |
DynamoDB is used alongside ClickHouse for durable event storage. Events are
written to both DynamoDB (for durability) and ClickHouse (for analytics).
Step 9: S3 and CloudWatch
Create S3 bucket for invoices
Create an S3 bucket for invoice PDFs with versioning, AES256 encryption, block public access, and optional lifecycle rules (e.g. transition to STANDARD_IA after 90 days).Create CloudWatch log groups
Create log groups for ECS services (api, worker, temporal-worker, migration) with a retention policy (e.g. 30 days).Create CloudWatch alarms
Create alarms for ECS API CPU, RDS CPU, and RDS connections (e.g. threshold 80%, 2 evaluation periods) and associate with an SNS topic for alerts.Step 10: ECR and container images
Create ECR repositories
Create ECR repositories for api, worker, and temporal-worker with scan-on-push and AES256 encryption.Build and push images
Build Flexprice container images (api, worker, temporal-worker), authenticate to ECR, tag and push to your ECR repositories.Step 11: ECS cluster and services
Create ECS cluster
- Production (EC2/ARM64)
- Development (Fargate)
For production (100M+ events/month), create an ECS cluster with EC2
capacity: launch template with m6g.xlarge (ARM64/Graviton), Auto Scaling
Group with 10 nodes (min/max as needed), capacity provider with managed
scaling, and associate with the cluster.
Create API task definition
Register an ECS task definition for the API service: production uses EC2/ARM64 (768 CPU, 1536 memory) with bridge network; development uses Fargate (1024 CPU, 2048 memory). Include environment variables and secrets from Secrets Manager (auth, postgres, clickhouse, kafka, temporal). Set FLEXPRICE_DEPLOYMENT_MODE=api, health check on :8080/health, and CloudWatch log group. See Step 12 for the full environment variable reference.Create Worker task definition
Register an ECS task definition for the Consumer (worker) service: FLEXPRICE_DEPLOYMENT_MODE=consumer, postgres/clickhouse/kafka secrets from Secrets Manager. For production use 30 tasks (100M events/month). See Step 12 for environment variables.Create Temporal Worker task definition
Register an ECS task definition for the Temporal Worker: FLEXPRICE_DEPLOYMENT_MODE=temporal_worker, postgres/clickhouse/kafka/temporal secrets. See Step 12 for environment variables.Create Application Load Balancer
Create an internet-facing Application Load Balancer in the public subnets, a target group (HTTP 8080, health check /health), an HTTPS listener with an ACM certificate, and an HTTP listener that redirects to HTTPS.Create ECS services
Create ECS services for API (desired count 6 for production), Worker/Consumer (desired count 30 for production), and Temporal Worker (e.g. 3 tasks). Attach the API service to the ALB target group. Use private subnets and the ECS security group.Configure Auto Scaling
Register scalable targets and target-tracking scaling policies for the API (and optionally Worker) services (e.g. min/max desired count, CPU target 70%).Step 12: Environment variables reference
Below is a complete reference of environment variables for each service. Variables marked with (secret) should be stored in AWS Secrets Manager.API service
| Variable | Value | Source |
|---|---|---|
FLEXPRICE_DEPLOYMENT_MODE | api | Environment |
FLEXPRICE_SERVER_ADDRESS | :8080 | Environment |
FLEXPRICE_AUTH_SECRET | 64-char hex | Secret |
FLEXPRICE_POSTGRES_HOST | RDS endpoint | Secret |
FLEXPRICE_POSTGRES_PORT | 5432 | Environment |
FLEXPRICE_POSTGRES_USER | flexprice | Secret |
FLEXPRICE_POSTGRES_PASSWORD | DB password | Secret |
FLEXPRICE_POSTGRES_DBNAME | flexprice | Environment |
FLEXPRICE_POSTGRES_SSLMODE | require | Environment |
FLEXPRICE_CLICKHOUSE_ADDRESS | ClickHouse NLB endpoint | Environment |
FLEXPRICE_CLICKHOUSE_USERNAME | flexprice | Secret |
FLEXPRICE_CLICKHOUSE_PASSWORD | ClickHouse password | Secret |
FLEXPRICE_CLICKHOUSE_DATABASE | flexprice | Environment |
FLEXPRICE_CLICKHOUSE_TLS | false | Environment |
FLEXPRICE_KAFKA_BROKERS | MSK bootstrap brokers | Environment |
FLEXPRICE_KAFKA_USE_SASL | true | Environment |
FLEXPRICE_KAFKA_SASL_MECHANISM | SCRAM-SHA-512 | Environment |
FLEXPRICE_KAFKA_SASL_USER | flexprice | Secret |
FLEXPRICE_KAFKA_SASL_PASSWORD | Kafka password | Secret |
FLEXPRICE_KAFKA_TOPIC | events | Environment |
FLEXPRICE_KAFKA_CONSUMER_GROUP | flexprice-consumer-prod | Environment |
FLEXPRICE_TEMPORAL_ADDRESS | Temporal Cloud endpoint | Environment |
FLEXPRICE_TEMPORAL_TLS | true | Environment |
FLEXPRICE_TEMPORAL_NAMESPACE | Your namespace | Environment |
FLEXPRICE_TEMPORAL_TASK_QUEUE | billing-task-queue | Environment |
FLEXPRICE_TEMPORAL_API_KEY | Temporal API key | Secret |
FLEXPRICE_LOGGING_LEVEL | info | Environment |
Worker and Temporal Worker services
Worker and Temporal Worker use the same variables as API, withFLEXPRICE_DEPLOYMENT_MODE set to consumer or temporal_worker respectively; omit FLEXPRICE_SERVER_ADDRESS for both.
Additional environment variables (Production)
These variables are used in production deployments:| Variable | Description | Example |
|---|---|---|
FLEXPRICE_DYNAMODB_IN_USE | Enable DynamoDB for events | true |
FLEXPRICE_DYNAMODB_REGION | AWS region for DynamoDB | us-west-2 |
FLEXPRICE_DYNAMODB_EVENT_TABLE_NAME | DynamoDB table name | events |
FLEXPRICE_REDIS_HOST | ElastiCache Redis endpoint | clustercfg.xxx.cache.amazonaws.com |
FLEXPRICE_REDIS_PORT | Redis port | 6379 |
FLEXPRICE_REDIS_CLUSTER_MODE | Enable cluster mode | true |
FLEXPRICE_REDIS_USE_TLS | Enable TLS | true |
FLEXPRICE_REDIS_KEY_PREFIX | Key prefix | flexprice:prod |
FLEXPRICE_EVENT_PUBLISH_DESTINATION | Where to publish events | all (Kafka + DynamoDB) |
FLEXPRICE_LOGGING_FORMAT | Log format | json |
FLEXPRICE_POSTGRES_READER_HOST | Aurora reader endpoint | xxx.cluster-ro-xxx.rds.amazonaws.com |
Step 13: Temporal Cloud configuration
Temporal Cloud is the recommended workflow orchestration service for production deployments.Sign up for Temporal Cloud
- Go to temporal.io/cloud
- Create an account and organization
- Create a namespace (e.g.,
flexprice-prod-usa)
Create service account and API key
- In Temporal Cloud console, go to Settings > API Keys
- Create a new API key with appropriate permissions
- Note the API key and key name
Store Temporal credentials
Temporal environment variables
| Variable | Value | Description |
|---|---|---|
FLEXPRICE_TEMPORAL_ADDRESS | us-west-2.aws.api.temporal.io:7233 | Temporal Cloud endpoint |
FLEXPRICE_TEMPORAL_NAMESPACE | your-namespace.account-id | Your namespace |
FLEXPRICE_TEMPORAL_TLS | true | TLS is required |
FLEXPRICE_TEMPORAL_TASK_QUEUE | billing-task-queue | Task queue name |
FLEXPRICE_TEMPORAL_API_KEY | (from Secrets Manager) | API key |
FLEXPRICE_TEMPORAL_API_KEY_NAME | Service account name | Key identifier |
Temporal Cloud provides managed infrastructure, automatic upgrades, and 99.99%
SLA. For self-hosted Temporal, refer to the Temporal
documentation.
Step 14: Third-party integrations (Optional)
Configure optional third-party services for enhanced functionality.Supabase (Authentication)
If using Supabase for authentication:| Variable | Value |
|---|---|
FLEXPRICE_AUTH_PROVIDER | supabase |
FLEXPRICE_AUTH_SUPABASE_BASE_URL | Supabase project URL |
FLEXPRICE_AUTH_SUPABASE_SERVICE_KEY | Service role key |
Svix (Webhooks)
For webhook delivery via Svix:| Variable | Value |
|---|---|
FLEXPRICE_WEBHOOK_SVIX_CONFIG_ENABLED | true |
FLEXPRICE_WEBHOOK_SVIX_CONFIG_AUTH_TOKEN | Svix auth token |
FLEXPRICE_WEBHOOK_SVIX_CONFIG_BASE_URL | https://api.us.svix.com |
Sentry (Error Tracking)
For error tracking with Sentry:| Variable | Value |
|---|---|
FLEXPRICE_SENTRY_ENABLED | true |
FLEXPRICE_SENTRY_DSN | Your Sentry DSN |
FLEXPRICE_SENTRY_ENVIRONMENT | production |
FLEXPRICE_SENTRY_SAMPLE_RATE | 1 (100% sampling) |
Grafana Cloud (Observability)
For profiling with Pyroscope on Grafana Cloud:| Variable | Value |
|---|---|
FLEXPRICE_PYROSCOPE_ENABLED | true |
FLEXPRICE_PYROSCOPE_SERVER_ADDRESS | https://profiles-prod-xxx.grafana.net |
FLEXPRICE_PYROSCOPE_APPLICATION_NAME | flexprice-prod-api |
FLEXPRICE_PYROSCOPE_BASIC_AUTH_USER | Grafana user ID |
FLEXPRICE_PYROSCOPE_BASIC_AUTH_PASSWORD | Grafana API key |
FluentD (Log Aggregation)
For centralized logging with FluentD:| Variable | Value |
|---|---|
FLEXPRICE_LOGGING_FLUENTD_ENABLED | true |
FLEXPRICE_LOGGING_FLUENTD_HOST | FluentD service IP |
FLEXPRICE_LOGGING_FLUENTD_PORT | 30242 |
FLEXPRICE_LOGGING_FORMAT | json |
Resend (Email)
For transactional emails via Resend:| Variable | Value |
|---|---|
FLEXPRICE_EMAIL_ENABLED | true |
FLEXPRICE_EMAIL_RESEND_API_KEY | Your Resend API key |
FLEXPRICE_EMAIL_FROM_ADDRESS | Sender email |
FLEXPRICE_EMAIL_REPLY_TO | Reply-to email |
Third-party cost summary
Breakdown below; production total is in the Cost estimation table above.| Service | Purpose | Monthly Cost |
|---|---|---|
| Temporal Cloud | Workflow orchestration | ~$200 |
| Supabase | Authentication | ~$25 |
| Svix | Webhooks | ~$50 |
| Grafana Cloud | Observability | ~$50 |
| Resend | ~$20 | |
| Sentry | Error tracking | $0-29 |
| Total | ~$345-375 |
Deployment checklist
Use this checklist to verify your deployment:VPC and Networking
- VPC created with correct CIDR
- 2 public subnets created
- 4 private subnets created (2 compute, 2 data)
- Internet Gateway attached
- NAT Gateway(s) created and running
- Route tables configured correctly
- Security groups created with correct rules
IAM
- ECS Task Execution Role created
- ECS Task Role created
- Policies attached correctly (S3, Secrets Manager, CloudWatch, DynamoDB)
Secrets Manager
- PostgreSQL/Aurora credentials stored
- ClickHouse credentials stored
- Kafka SASL credentials stored
- Auth secret stored
- Temporal Cloud credentials stored
- Third-party credentials stored (Supabase, Svix, etc.)
Aurora PostgreSQL
- DB subnet group created
- Aurora cluster created and available
- Writer and Reader instances running
- Security group allows ECS access
- Secrets Manager updated with endpoints
- Database migrations completed
Amazon MSK
- MSK cluster created and active
- SASL/SCRAM secret associated
- Topics created (events, events_lazy, events-dlq)
- Security group allows ECS access
- Prometheus exporters enabled
EKS and ClickHouse
- EKS cluster created
- Node group running
- gp3 StorageClass created
- ClickHouse operator installed
- ClickHouse cluster deployed
- NLB created for ClickHouse access
- Database initialized
ElastiCache Redis
- Redis subnet group created
- Redis replication group created
- Cluster mode enabled (production)
- TLS encryption enabled
- Security group allows ECS access
S3 and CloudWatch
- S3 bucket created with encryption
- CloudWatch log groups created
- CloudWatch alarms configured
ECS
- ECS cluster created
- Task definitions registered
- ALB created with HTTPS listener
- Target group configured
- Services created and healthy
- Auto Scaling configured
Troubleshooting
API unreachable
-
Check ALB health checks:
-
Check ECS task status:
-
Check ECS task logs:
-
Verify security groups:
- ALB SG allows inbound 443 from internet
- ECS SG allows inbound 8080 from ALB SG
- ECS SG allows outbound to RDS, MSK, ClickHouse
Worker not consuming
-
Check Kafka connectivity:
- Check consumer lag in MSK CloudWatch metrics
-
Verify SASL credentials:
- Ensure
AmazonMSK_prefixed secret is associated with cluster - Verify username/password match in Secrets Manager
- Ensure
-
Check security group:
- MSK SG allows inbound 9094/9096 from ECS SG
Temporal workflows failing
-
Check Temporal Worker logs:
-
Verify Temporal Cloud connection:
- Correct
FLEXPRICE_TEMPORAL_ADDRESS - Valid API key and namespace
- TLS enabled
- Correct
- Check Temporal Cloud UI for workflow history and errors
ClickHouse connection errors
-
Verify ClickHouse pods are running:
-
Check ClickHouse logs:
-
Verify NLB is healthy:
-
Test connectivity from ECS:
- Ensure EKS SG allows inbound 9000 from ECS SG
- Verify NLB DNS resolves correctly
RDS connection issues
-
Verify RDS is available:
-
Check security group:
- RDS SG allows inbound 5432 from ECS SG
-
Verify credentials:
- Check Secrets Manager values match RDS configuration
-
Test from bastion:
Scaling guidelines
Scale when metrics exceed the thresholds below.| Component | Metric | Threshold | Action |
|---|---|---|---|
| ECS | CPU utilization | > 70% sustained | Scale out |
| ECS | Memory utilization | > 80% sustained | Scale out or increase task memory |
| ECS | API latency (p99) | > 500ms | Scale out API tasks |
| ECS | Kafka consumer lag | Growing | Scale out Worker tasks |
| RDS | CPU utilization | > 80% sustained | Upgrade instance class |
| RDS | Database connections | > 80% of max | Upgrade instance or add read replica |
| RDS | Read IOPS | Hitting limits | Upgrade to gp3 with higher IOPS |
| RDS | Storage | > 80% used | Increase allocated storage |
| MSK | Broker CPU | > 60% sustained | Add brokers |
| MSK | Consumer lag | Growing over time | Add partitions and consumers |
| MSK | Storage | > 80% used | Increase broker storage |
| ClickHouse | Query latency | Degrading | Add replicas or upgrade nodes |
| ClickHouse | Disk usage | > 80% | Expand PVCs or add shards |
| ClickHouse | Memory pressure | OOM events | Increase node memory |
Cost optimization
Reserved instances
- RDS: Purchase Reserved Instances for 1-3 year commitment (up to 72% savings)
- MSK: Not available; consider Kafka on EC2 with Reserved Instances for significant savings
Fargate Spot
Use Fargate Spot for non-critical workloads:S3 lifecycle policies
Already configured to transition to IA after 90 days. Consider:- Glacier for archives > 1 year
- Intelligent-Tiering for unpredictable access patterns
CloudWatch log retention
Set appropriate retention periods:- Production: 30-90 days
- Development: 7-14 days
- Archive to S3 for long-term storage
Additional resources
Configuration Reference
Complete list of Flexprice environment variables
Architecture Overview
Understand Flexprice’s internal architecture
Monitoring
Set up monitoring and observability
Troubleshooting
Common issues and solutions
Need help?
If you encounter issues during deployment:- Check our GitHub Issues for similar problems
- Join our Slack community for real-time support
- Contact us at [email protected]

