AWS Compute Best Practices: The Complete Well-Architected Guide
Master EC2, Lambda, and ECS/EKS compute architectures. Learn right-sizing, auto-scaling, serverless patterns, and cost optimization strategies aligned with AWS Well-Architected Framework.
Technical TL;DR
Compute Pillar = 60-70% of most AWS bills. Key takeaways:
---
1. Choose the Right Compute Service
AWS offers multiple compute services. Selecting the wrong one costs money and creates operational overhead.
Decision Framework
| Use Case | Best Service | Why |
|----------|--------------|-----|
| **Web servers with steady traffic** | EC2 + Auto Scaling | Predictable performance, full OS control |
| **Event-driven tasks (API triggers, file processing)** | Lambda | Pay-per-use, zero provisioning, auto-scales |
| **Containerized microservices** | ECS/Fargate | Managed containers, no cluster management |
| **Kubernetes workloads with complex orchestration** | EKS | Kubernetes consistency, hybrid portability |
| **Batch jobs, data processing** | Lambda or Batch | Fault-tolerant spot pricing, pay-per-job |
Anti-Patterns to Avoid
---
2. EC2 Best Practices
2.1 Right-Size Your Instances
Never default to large instance types. Most workloads run efficiently on smaller instances than expected.
Sizing Methodology:
1. Benchmark in non-production with CloudWatch metrics
2. Target 70-80% CPU utilization at peak
3. Use memory-optimized instances only when verified needed
4. Consider burstable instances (T3/T4g) for dev/test
Tools:
2.2 Implement Auto Scaling Groups
Auto Scaling Groups (ASGs) are non-negotiable for production EC2.
Required Configuration:
```yaml
Minimum: 2 instances # HA across AZs
Maximum: 10-20x baseline # Handle traffic spikes
Desired: Match steady-state load
Health Check: ELB + EC2
Scaling Policies:
- Target tracking: 70% CPU
- Step scaling: +1 instance per 10% CPU above threshold
- Scheduled scaling: Predictable patterns (business hours)
```
Advanced Patterns:
2.3 Leverage Spot Instances
Spot instances offer up to **90% savings** for fault-tolerant workloads.
Ideal Workloads:
Pattern: Spot Fleet + On-Demand Base
```yaml
Spot Allocation Strategy: capacity-optimized
On-Demand Base: 30% # Maintain minimum capacity
Spot Percentage: 70% # Maximize savings
Instance Pools: 10+ # Diversify for interruption resilience
```
2.4 Use AWS Graviton Processors
Graviton (ARM-based) instances deliver **20-40% better price-performance** for most Linux workloads.
Migration Path:
1. Test workloads on r6g, c6g, m6g instances
2. Verify ARM compatibility (most Linux apps work out-of-box)
3. Rebuild or recompile if needed (minimal effort for most)
When NOT to use Graviton:
---
3. Lambda Best Practices
3.1 Lambda Anti-Patterns
| Anti-Pattern | Why It's Problematic |
|--------------|---------------------|
| **Monolithic functions** (500+ lines) | Hard to test, cold starts, timeout risks |
| **Synchronous orchestration** | Chained functions accumulate latency |
| **Ignoring memory = CPU** | Lambda CPU scales with memory; 256MB = slow |
| **No dead-letter queue** | Failed events lost forever |
| **Provisioned concurrency for everything** | Defeats cost benefits of serverless |
3.2 Memory Configuration
Rule: Memory is a performance dial. CPU, network, and disk all scale with memory.
Optimal Sizing:
```python
# Use AWS Lambda Power Tuning to find optimal memory
# Most functions perform best at 1024-1792 MB
# Cost sweet spot: 1024-1536 MB
```
Power Tuning Pattern:
1. Deploy with 128 MB, test execution time
2. Increase to 1024 MB, test again
3. Find inflection point where time decrease < cost increase
4. Set to optimal memory (usually 1024-1792 MB)
3.3 Control Cold Starts
Cold starts add latency when Lambda scales out.
Mitigation Strategies:
```yaml
1. Keep deployment packages small (<50 MB zip)
2. Minimize layers and dependencies
3. Use Provisioned Concurrency for critical paths
4. Implement keep-alive warming (scheduled pings)
5. Choose Python/Go over Java/Cold-start-heavy runtimes
```
Package Optimization:
3.4 Event-Driven Architecture
Lambda excels when triggered by events, not invoked synchronously.
Recommended Event Sources:
Pattern: SQS Buffer for Throttling
```yaml
API Gateway -> Lambda -> SQS -> Lambda Worker
```
Prevents throttling, enables retry logic, decouples producers/consumers.
---
4. Container Best Practices (ECS/EKS)
4.1 ECS vs. EKS Decision
| Factor | ECS/Fargate | EKS |
|--------|-------------|-----|
| **Complexity** | Low (AWS-managed) | High (self-managed control plane) |
| **Startup Time** | Seconds | Minutes |
| **Cost** | Higher per vCPU | Lower at scale |
| **Portability** | AWS-only | Kubernetes everywhere |
| **Use Case** | Simple containerized apps | Complex orchestration needs |
4.2 Fargate Best Practices
Fargate eliminates server management. Use it unless you need custom kernel modules.
Configuration:
```yaml
Task CPU: Match application requirements
Task Memory: Include application + container overhead
Task Role: Least-privilege IAM per task
Network Mode: awsvpc (enables security groups)
```
Cost Optimization:
4.3 EKS Best Practices
EKS is for teams committed to Kubernetes with complex orchestration needs.
Cluster Design:
```yaml
Managed Node Groups: Prefer over self-managed
Cluster Autoscaler: Required for cost efficiency
Multiple AZs: Required for HA
Pod Disruption Budgets: Prevent disruption during updates
Horizontal Pod Autoscaler: Scale pods based on demand
```
Cost Control:
---
5. High Availability & Disaster Recovery
5.1 Multi-AZ Deployments
Requirement: All production workloads must span multiple Availability Zones.
Implementation:
```yaml
EC2: ASG with subnet in multiple AZs
Lambda: Regional service (automatic AZ redundancy)
ECS: Tasks distributed across AZs
EKS: Nodes in multiple AZs, pod anti-affinity
```
5.2 Disaster Recovery Patterns
| RTO/RPO | Strategy | Cost | Complexity |
|---------|----------|------|------------|
| **Minutes / Zero data loss** | Multi-Region Active-Active | High | High |
| **Hours / Minimal loss** | Pilot Light (warm standby) | Medium | Medium |
| **Days / 24hr loss** | Backup & Restore | Low | Low |
Pilot Light Implementation:
```yaml
Primary Region: Full production stack
DR Region:
- Minimal resources (single AZ, small instances)
- Automated DNS failover (Route 53 health checks)
- Database read replica (promote to primary)
- S3 cross-region replication
```
---
6. Cost Optimization Checklist
Immediate Actions (Week 1)
Short-Term (Month 1)
Long-Term (Quarter 1)
---
7. Monitoring & Observability
Required Metrics
```yaml
Compute Metrics:
- CPU/Memory Utilization (CloudWatch)
- Network In/Out (bottleneck detection)
- Lambda errors and durations
- ASG scaling events
Alerting:
- CPU > 80% for 5 minutes
- Memory > 85% for 5 minutes
- Lambda error rate > 1%
- ASG at max capacity
```
Recommended Tools
---
8. Security Best Practices
Compute Security Checklist
```yaml
EC2:
☐ IMDSv2 required (prevent SSRF attacks)
☐ Security groups restrict ingress/egress
☐ IAM roles, never access keys
☐ AWS Systems Manager Session Manager (no SSH keys)
Lambda:
☐ Least-privilege execution roles
☐ VPC configuration for private resources
☐ Environment variables for secrets (no plaintext)
☐ Code signing in production
Containers:
☐ Scan images for vulnerabilities (Amazon ECR)
☐ Run as non-root user
☐ Read-only root filesystems
☐ Secrets via AWS Secrets Manager (not env vars)
```
---
Summary: Compute Excellence Pillars
1. **Right-size everything** (use data, not guesses)
2. **Auto-scale or go serverless** (manual scaling is obsolete)
3. **Leverage Spot/Graviton** (20-90% savings)
4. **Multi-AZ by default** (HA is non-negotiable)
5. **Monitor continuously** (you can't optimize what you don't measure)
---
Need Help Architecting Your Compute?
Our AWS-certified solutions architects can design scalable, cost-optimized compute architectures tailored to your workload patterns.
<a href="/contact" className="text-aws-orange font-semibold hover:text-aws-light">
Schedule a Free Architecture Review →
</a>
Internal Linking Strategy:
---
*Last updated: January 5, 2025*
Related Articles
AWS Networking Best Practices: VPC, CloudFront, and Connectivity Guide
Master AWS networking architecture. Learn VPC design, CloudFront CDN, Route 53 DNS, Direct Connect hybrid networking, and network security best practices aligned with AWS Well-Architected Framework.
AWS Storage Best Practices: S3, EBS, and EFS Architecture Guide
Master AWS storage architectures. Learn S3 lifecycle policies, EBS provisioning, EFS patterns, data durability strategies, and cost optimization aligned with AWS Well-Architected Framework.
AWS AI/ML Best Practices: SageMaker and Bedrock Architecture Guide
Master AWS AI/ML services. Learn SageMaker training/deployment, Bedrock generative AI, foundation model selection, MLOps patterns, and cost optimization for machine learning workloads.