AWS Lambda Function: Check List

AWS Lambda has become a cornerstone of modern serverless architectures, but writing a function that "just works" is very different from writing one that is performant, cost-effective, secure, and ready for production at scale. For .NET developers, the distance between a naive implementation and a well-engineered one is surprisingly wide — it spans everything from cold start times and memory allocation, to secret management, deployment safety, and event loss prevention.
This guide covers ten areas of Lambda excellence for .NET, each derived from real-world production patterns and common failure modes. It assumes you are comfortable with the basics of Lambda and the AWS .NET SDK, and focuses on the why behind each practice as much as the how.
Runtime & Deployment Model
Use .NET 10 (or Latest LTS) on arm64
What it is: AWS Graviton2 processors power arm64 Lambda functions. They offer better performance-per-dollar for most workloads, and AWS charges approximately 20% less per GB-second compared to x86.
Why it matters: For a high-traffic function with 500ms average duration at 512 MB memory, switching from x86 to arm64 can save over $15/month per million daily invocations with zero code changes.
# template.yaml
Globals:
Function:
Runtime: dotnet10
Architectures:
- arm64
MemorySize: 512
Resources:
MyFunction:
Type: AWS::Serverless::Function
Properties:
Handler: MyFunction::MyFunction.Function::FunctionHandler
CodeUri: src/MyFunction/
Minimize Deployment Package Size
What it is: The deployment package is the ZIP artifact containing your compiled application and its dependencies, uploaded to Lambda via S3. Its size directly affects how long Lambda takes to extract and load your code during a cold start.
Why it matters: Lambda extracts and initializes your deployment package on every cold start. A 50 MB package takes significantly longer to initialize than a 5 MB one. Smaller packages also reduce S3 storage costs and deployment time.
Use Native AOT Compilation
What it is: Ahead-of-Time (AOT) compilation converts your .NET code directly into a native binary at build time, eliminating the Just-In-Time (JIT) compiler that normally runs when a .NET application starts. For Lambda, this means the initialization phase — the most expensive part of a cold start — is dramatically faster.
Why it matters: Lambda bills in 1ms increments. Cold starts are fully billed duration. On a JIT-based .NET function, initialization can take 400–1200ms depending on the number of loaded assemblies and SDK clients. With Native AOT, this drops to 30–80ms.
Use Lambda SnapStart (if not using AOT)
What it is: SnapStart takes a snapshot of the initialized execution environment after the Init phase completes and caches it. Subsequent cold starts restore from the snapshot instead of re-running initialization, reducing cold start latency by up to 90%.
Why it matters: If you have an existing JIT-based .NET function that you cannot easily migrate to AOT (e.g., it uses libraries incompatible with trimming), SnapStart delivers most of the cold start benefit without a code rewrite.
Memory & CPU Sizing
Profile with AWS Lambda Power Tuning
What it is: The AWS Lambda Power Tuning tool is an open-source Step Functions state machine that runs your function at multiple memory configurations (e.g., 128 MB, 256 MB, 512 MB, 1024 MB, 1769 MB, 3008 MB) and returns a cost and performance graph showing the optimal setting.
Why it matters: The relationship between memory and cost is not linear because CPU allocation scales with memory. A function that runs in 1000ms at 256 MB might run in 200ms at 1024 MB, making the higher-memory option cheaper despite a higher per-ms rate.
Don't Under-Allocate Memory
What it is: Lambda allocates CPU proportionally to memory. Setting memory too low starves your function of CPU, which can make it run slower and cost more despite the lower per-GB-second rate.
| Memory | CPU Allocation |
|---|---|
| 128 MB | ~5% of a vCPU |
| 512 MB | ~25% of a vCPU |
| 1769 MB | 1 full vCPU |
| 3584 MB | 2 full vCPUs |
Why it matters: A .NET function with heavy computation (JSON parsing, data transformation, LINQ operations) starved of CPU at 256 MB may run 4–8x slower than at 1769 MB, making it cost more despite the lower RAM pricing.
Set Memory Based on Actual Measurements
What it is: Every Lambda invocation emits a REPORT log line containing the actual peak memory used. This measured value is the correct baseline for sizing your memory allocation, rather than guessing.
Why it matters: Setting memory allocation to peak measured usage plus 15% headroom prevents out-of-memory errors while avoiding wasteful over-provisioning. Both extremes cost money: OOM errors cause failed invocations and retries; excessive headroom means you pay for memory you never use.
REPORT RequestId: abc123 Duration: 245.12 ms Billed Duration: 246 ms
Memory Size: 512 MB Max Memory Used: 178 MB
Set your memory allocation to peakMB * 1.15 (15% headroom above peak measured usage).
Cold Start Optimization
Move Heavy Initialization Outside the Handler
What it is: In Lambda, code that runs at class construction or in static initializers executes only once per execution environment lifecycle — during the Init phase — not on every invocation. By moving expensive setup (SDK client construction, config loading, connection establishment) outside the handler method, you ensure it runs once and is reused.
Why it matters: Lambda reuses execution environments across invocations and freezes them between calls, preserving static state. Initialization that runs during the Init phase is billed as part of the cold start, which occurs infrequently. The same code inside the handler runs on every invocation, making every call slower and more expensive.
Use Provisioned Concurrency for Latency-Critical Paths
What it is: Provisioned Concurrency pre-initializes a specified number of execution environments, keeping them warm and ready to handle requests with no cold start. The environments are fully initialized — your static constructors have already run.
Why it matters: For synchronous, user-facing functions (e.g., API endpoints powering a mobile app), cold start latency can directly degrade user experience. Provisioned Concurrency eliminates cold starts entirely for those pre-warmed environments. It should not be used for async workers, background jobs, or event-processing functions, where latency is not user-visible and the added cost is unjustified.
Lazy-Load Non-Critical Dependencies
What it is: For dependencies only needed in specific code paths, use Lazy<T> to defer their initialization until the first time that code path is actually executed, rather than initializing them unconditionally at startup.
Why it matters: Eagerly initializing every dependency at startup contributes to cold start duration and memory usage, even for code paths that may never be invoked in a given execution environment. Lazy loading ensures you only pay the initialization cost for dependencies that are actually used.
Avoid Heavy DI Containers in the Hot Path
What it is: Reflection-based dependency injection frameworks (Microsoft.Extensions.DependencyInjection with assembly scanning, Autofac, etc.) perform extensive reflection during container construction. This conflicts directly with both Native AOT (which requires all types to be known at compile time) and cold start minimization goals.
Why it matters: Reflection-based DI scanning can add hundreds of milliseconds to your Init phase and is incompatible with Native AOT trimming. Preferred alternatives are manual wiring (fastest, recommended for simple functions) and source-generated DI (for more complex compositions).
Execution & Compute Efficiency
Use async/await Throughout — Avoid .Result and .Wait()
What it is: async/await enables non-blocking I/O in .NET — when a function awaits a network call, the thread is released to do other work. .Result and .Wait() are synchronous blocking calls that hold the thread until the operation completes.
Why it matters: Lambda bills for wall-clock time, not CPU time. A thread blocked on .Result or .Wait() holds the execution environment hostage while waiting for I/O, consuming billed duration even though it is doing nothing. It can also cause deadlocks in certain synchronization contexts.
Reuse HttpClient and AWS SDK Clients Across Invocations
What it is: HttpClient and AWS SDK service clients (e.g., AmazonDynamoDBClient) are designed to be long-lived and thread-safe. They should be created once as static fields and reused across invocations, not constructed per-request.
Why it matters: Each new HttpClient() creates a new socket pool. Creating one per invocation rapidly exhausts available ports (the default ephemeral port range is ~30,000), causing socket exhaustion under load — a notoriously difficult production bug to diagnose. AWS SDK clients hold connection pools internally; re-creating them per invocation defeats connection reuse and adds DNS resolution overhead on every call.
Set Appropriate Timeout
What it is: Lambda's timeout setting defines the maximum duration an invocation is allowed to run before it is forcibly terminated. It is configurable from 1 second to 15 minutes per function.
Why it matters: Lambda's default timeout for new functions is 3 seconds — too short for most real workloads. But setting it to the maximum (15 minutes) "just in case" means a function that hangs due to a slow downstream service will bill the full 15 minutes before being terminated. Calculate your timeout based on measured performance:
Timeout = (p99 measured duration) × 2 + network overhead margin
For example, if your function normally completes in 800ms and has a p99 of 1.2 seconds, set timeout to 3–5 seconds, not 15 minutes.
MyFunction:
Type: AWS::Serverless::Function
Properties:
Timeout: 10 # Seconds — not 900 (15 min)
Use System.Text.Json over Newtonsoft.Json
What it is: System.Text.Json (STJ) is the built-in .NET JSON library introduced in .NET Core 3.1. It is allocation-friendly, supports Span<T> for zero-copy parsing, and is fully compatible with Native AOT via source generation. Newtonsoft.Json is the older, widely used third-party library that predates STJ.
Why it matters: Newtonsoft.Json relies on reflection, is incompatible with Native AOT, adds ~500 KB to your deployment package, and is measurably slower for typical Lambda payloads. STJ delivers better throughput, lower memory pressure, and AOT compatibility with no external dependency.
Avoid Boxing and Unnecessary Allocations in Hot Paths
What it is: Boxing occurs when a value type (e.g., int, struct) is implicitly converted to object, causing a heap allocation. Unnecessary allocations include creating objects, strings, arrays, or closures that are immediately discarded after a single use.
Why it matters: Lambda functions that process thousands of events per second generate enormous GC pressure if they produce many short-lived heap objects per invocation. GC pauses directly extend billed duration and increase tail latency.
Concurrency & Scaling
Set Reserved Concurrency to Avoid Noisy-Neighbor Throttling
What it is: Reserved concurrency is a per-function setting that both guarantees a minimum concurrency allocation for a function and caps its maximum concurrency. By default, all Lambda functions in an AWS account share a regional concurrency pool (default: 1,000 concurrent executions).
Why it matters: Without reserved concurrency, a single function experiencing a traffic spike can consume the entire regional pool, throttling completely unrelated functions in the same account. Reserved concurrency prevents this by isolating a function's allocation from the shared pool.
Configure SQS Batch Size and Batch Window
What it is: When Lambda polls an SQS queue, it can retrieve and process multiple messages in a single invocation (a batch). BatchSize controls the maximum number of messages per invocation. MaximumBatchingWindowInSeconds controls how long Lambda waits to fill the batch before invoking. ReportBatchItemFailures tells Lambda which specific messages in a batch failed, so only those are retried.
Why it matters: Larger batches mean fewer Lambda invocations, which directly reduces cost. Without ReportBatchItemFailures, if one message in a batch of 100 fails, all 100 are returned to the queue and reprocessed — a 100x amplification of failed work.
Resources:
MyFunction:
Type: AWS::Serverless::Function
Events:
SQSTrigger:
Type: SQS
Properties:
Queue: !GetAtt MyQueue.Arn
BatchSize: 100
MaximumBatchingWindowInSeconds: 5
FunctionResponseTypes:
- ReportBatchItemFailures
Message visibility timeout must be greater than max function execution time.
Use Event Filtering on SQS/DynamoDB/Kinesis Triggers
What it is: Lambda event source filtering lets you define a filter pattern at the AWS service level. Messages that don't match the filter are discarded by the service before they ever invoke your function.
Why it matters: You are not billed for filtered-out events, and your function doesn't need to contain logic to discard irrelevant messages. This reduces invocation count, cost, and code complexity.
Events:
SQSTrigger:
Type: SQS
Properties:
Queue: !GetAtt MyQueue.Arn
FilterCriteria:
Filters:
- Pattern: '{"body": {"eventType": ["ORDER_PLACED"]}}'
Tune Maximum Concurrency on Event Source Mappings
What it is: The ScalingConfig.MaximumConcurrency property on an event source mapping caps how many concurrent Lambda executions can be processing from that specific source simultaneously, independent of the function's overall reserved concurrency.
Why it matters: Without this limit, a large SQS queue backlog can scale Lambda to hundreds of concurrent executions simultaneously. This can overwhelm downstream databases or APIs not designed for that level of parallelism, causing cascading failures.
Events:
SQSTrigger:
Type: SQS
Properties:
Queue: !GetAtt MyQueue.Arn
BatchSize: 10
ScalingConfig:
MaximumConcurrency: 10
Networking & Integrations
Avoid VPC Unless Strictly Necessary
What it is: Placing a Lambda function inside a VPC attaches an Elastic Network Interface (ENI) to the function's execution environment, giving it access to private VPC resources. This is required for accessing services that have no public endpoint, such as RDS or ElastiCache.
Why it matters: ENI attachment adds 100–500ms of cold start latency and introduces capacity constraints in subnets. Lambda functions outside a VPC still run in AWS-managed, isolated infrastructure with no public inbound access — the security benefit of VPC placement is often overstated. Services like DynamoDB, S3, SQS, and SNS are all accessible without a VPC (or via VPC endpoints if the function is already in one).
Use VPC Endpoints for AWS Services
What it is: VPC Endpoints (Gateway endpoints for S3/DynamoDB, Interface endpoints for most other services) route traffic between your Lambda function and AWS services privately within the AWS network, bypassing the public internet and NAT Gateway.
Why it matters: Without VPC endpoints, traffic from a VPC-based Lambda to AWS services routes through a NAT Gateway, which charges $0.045 per GB of data processed. For a high-throughput function reading large S3 objects, this adds up quickly. VPC endpoints eliminate the NAT Gateway data charge (Interface endpoints have a small hourly fee, but it is offset by avoided NAT costs above roughly 10 GB/month).
Enable Connection Pooling and Keep-Alive for HTTP
What it is: HTTP connection pooling reuses established TCP connections across multiple requests rather than opening a new connection for each one. In Lambda, SocketsHttpHandler manages this pool, and it persists across invocations as long as the execution environment stays warm.
Why it matters: Without connection reuse, each Lambda invocation (or each HTTP call within it) incurs TCP handshake and TLS negotiation overhead. Tuning pool settings prevents stale connections from causing errors while keeping connections alive long enough to benefit from reuse.
private static readonly HttpClient HttpClient = new HttpClient(new SocketsHttpHandler
{
PooledConnectionLifetime = TimeSpan.FromMinutes(15),
PooledConnectionIdleTimeout = TimeSpan.FromMinutes(2),
MaxConnectionsPerServer = 50,
UseCookies = false
})
{
Timeout = TimeSpan.FromSeconds(10)
};
Configure AWS SDK Retry and Timeout
What it is: The AWS SDK has a built-in retry policy that automatically retries failed requests with exponential backoff. The default configuration retries up to 3 times. Both the retry count and per-attempt timeout can be configured on each SDK client.
Why it matters: For a Lambda function with a 5-second timeout calling a struggling DynamoDB table, the SDK's default aggressive retry policy may consume the entire timeout window retrying, billing you for the full duration and still returning an error. Setting explicit retry limits and per-call timeouts gives you control over this behavior.
Observability & Cost Visibility
Enable Lambda Insights
What it is: Lambda Insights is an enhanced CloudWatch monitoring feature that captures per-invocation metrics including memory used, CPU time, init duration, and network I/O. It is enabled by attaching the managed LambdaInsightsExtension Lambda layer to your function.
Why it matters: The default Lambda CloudWatch metrics (invocations, duration, errors) don't surface resource-level detail like memory pressure or CPU utilization. Lambda Insights fills that gap, making it possible to identify memory leaks, CPU-bound invocations, and slow init phases without adding custom instrumentation.
Cost note: Lambda Insights charges for custom CloudWatch metrics (~$0.30 per metric per month) and additional log data. For high-traffic functions, filter what you emit.
Use Structured Logging with Log Level Filtering
What it is: Structured logging emits log entries as JSON objects with consistent fields (timestamp, level, message, correlation IDs, etc.) rather than plain text strings. Log level filtering suppresses verbose logs (e.g., Debug, Trace) in production while keeping Warning and Error output.
Why it matters: Plain string logs are hard to query at scale. Structured JSON logs can be filtered and aggregated efficiently with CloudWatch Logs Insights. Log level control reduces log volume and CloudWatch ingestion costs in production.
To accomplish this easily, use Powertools for AWS Lambda (.NET).
Set CloudWatch Log Retention Policy
What it is: Each Lambda function automatically gets a CloudWatch Log Group. The default retention policy is "Never Expire," meaning logs accumulate indefinitely. You can configure a retention period (e.g., 14 days) via CloudFormation or the console.
Why it matters: CloudWatch Logs storage is billed at $0.03/GB. With "Never Expire," logs from high-volume functions accumulate indefinitely, and the bill grows silently. Explicitly setting a retention period caps storage costs and keeps log groups manageable.
Resources:
MyFunctionLogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: !Sub "/aws/lambda/${MyFunction}"
RetentionInDays: 14
Use X-Ray or OpenTelemetry for Distributed Tracing
What it is: AWS X-Ray is a distributed tracing service that instruments calls across Lambda invocations, AWS service calls (DynamoDB, S3, SQS), and outbound HTTP requests, producing service maps and per-segment latency breakdowns. OpenTelemetry is the vendor-neutral alternative that can export to X-Ray, Jaeger, or other backends.
Why it matters: Standard CloudWatch metrics tell you that a function is slow; distributed tracing tells you which downstream call is responsible. This is invaluable for identifying the specific DynamoDB query or external API call causing p99 latency spikes.
Globals:
Function:
Tracing: Active
To accomplish this easily, use Powertools for AWS Lambda (.NET).
Tag All Lambda Functions with Cost Allocation Tags
What it is: AWS resource tags are key-value pairs attached to resources. Cost Allocation Tags are tags you activate in the Billing console, after which AWS Cost Explorer can group and filter Lambda spending by those tag values.
Why it matters: Without consistent tagging, Lambda costs in a shared account appear as a single line item, making it impossible to attribute spending to specific teams, features, or environments. Tags enable per-team or per-service cost accountability.
Globals:
Function:
Tags:
Team: payments
Environment: production
Service: order-processing
Architecture & Design
Keep Functions Single-Purpose
What it is: A single-purpose Lambda function handles one specific operation (e.g., process an order event, resize an image, send a notification) rather than branching across multiple unrelated operations based on input type.
Why it matters: A function that branches heavily based on input type ends up sized for its most demanding branch, wasting memory and compute for all other branches. Single-purpose functions are easier to size, tune independently, monitor, and debug.
Consider Lambda URLs vs API Gateway
What it is: Lambda Function URLs are built-in HTTPS endpoints directly on a Lambda function, requiring no additional AWS service. API Gateway is a fully managed service that sits in front of Lambda and adds features like request validation, authorizers, usage plans, WAF integration, and stage variables.
Why it matters: For simple HTTP endpoints that don't need API Gateway features, Lambda Function URLs eliminate the $1.00 per million request charge that HTTP API Gateway adds, reducing cost to Lambda invocation charges only. The right choice depends on whether you need the additional API Gateway capabilities.
| Feature | API Gateway HTTP | Lambda URL |
|---|---|---|
| Cost per million requests | $1.00 | $0 (Lambda invocation only) |
| CORS support | Yes | Yes |
| Auth | IAM, Cognito, Lambda authorizer | IAM, none |
| Custom domain | Yes | No (but works behind CloudFront) |
| WebSocket | Yes | No |
Use Step Functions for Orchestration over Chained Lambdas
What it is: AWS Step Functions is a serverless orchestration service that coordinates multi-step workflows. The alternative — chaining Lambda functions by having one function synchronously invoke another and wait for its response — keeps the calling function's execution environment alive and billing during the entire wait.
Why it matters: Synchronously chaining Lambda functions doubles or triples your Lambda bill because the caller is billed for the full wait duration. Step Functions handles orchestration at the service level; your Lambda functions execute independently and are only billed for their own compute time.
Security
Apply Least-Privilege IAM Execution Roles
What it is: Every Lambda function runs under an IAM execution role. This role defines exactly which AWS API calls the function is permitted to make. Many teams default to broad managed policies like AmazonDynamoDBFullAccess or even AdministratorAccess during development and never revisit them.
Why it matters: If your function is compromised through a vulnerability in your code or a dependency, the attacker inherits every permission in that execution role. A function that can only call dynamodb:GetItem on one specific table does far less damage than one with write access to all tables in the account.
Never Store Secrets in Environment Variables or Source Code
What it is: Lambda environment variables are convenient but are stored in plaintext in the function configuration, visible to anyone with lambda:GetFunctionConfiguration or lambda:GetFunction IAM access. The correct alternative is AWS Secrets Manager or AWS Systems Manager Parameter Store (SecureString), where secrets are fetched at runtime and access is controlled via IAM.
Why it matters: Database passwords, API keys, and signing secrets in environment variables or source code have been the root cause of numerous high-profile cloud breaches. The AWS Shared Responsibility Model makes secret storage entirely your responsibility.
Enable AWS WAF on API Gateway or CloudFront
What it is: AWS WAF (Web Application Firewall) inspects HTTP requests before they reach your Lambda function, blocking common attack patterns (SQL injection, XSS, path traversal), known malicious IP ranges, and volumetric abuse. It is attached to API Gateway, CloudFront, or an Application Load Balancer.
Why it matters: Without WAF, a Lambda function exposed via API Gateway is reachable by anyone on the internet. Even a well-validated function can be overwhelmed by a flood of requests (costing money in Lambda invocations) or hit with sophisticated payloads designed to exploit dependencies.
Use Resource-Based Policies to Restrict Who Can Invoke
What it is: A Lambda resource policy (also called a function policy) defines which AWS principals — accounts, services, or IAM roles — are allowed to call lambda:InvokeFunction on your function. It is separate from the function's execution role, which controls what the function can do.
Why it matters: Without a restrictive resource policy, an API Gateway misconfiguration or an accidental public URL can expose your function to arbitrary invocation from the internet, leading to unexpected charges and potential data exposure.
Restrict Lambda Function URL Auth
What it is: Lambda Function URLs support two authentication modes: AuthType: NONE (publicly invocable by anyone with the URL) and AuthType: AWS_IAM (requires a valid AWS Signature Version 4 signed request). The NONE mode is occasionally used for public webhooks but is a security risk for most production functions.
Why it matters: AuthType: NONE means the URL is publicly accessible with no authentication. Anyone who discovers or guesses the URL can invoke your function at your expense and potentially exfiltrate or corrupt data. Use AWS_IAM for any function that is not intentionally public.
Operations
Configure Dead Letter Queues on Async Invocations
What it is: A Dead Letter Queue (DLQ) is an SQS queue or SNS topic configured to receive events that Lambda failed to process after exhausting its retry attempts. For asynchronous invocations (from SNS, S3 events, EventBridge, or direct async InvokeFunction calls), Lambda retries failed invocations twice by default before discarding the event.
Why it matters: Without a DLQ, a transient downstream failure (a throttled DynamoDB table, a network timeout) can cause permanent, silent data loss with no alert and no recovery path. Silent event loss is one of the hardest production bugs to detect.
Resources:
MyFunctionDLQ:
Type: AWS::SQS::Queue
Properties:
QueueName: my-function-dlq
MessageRetentionPeriod: 1209600 # 14 days in seconds
MyFunction:
Type: AWS::Serverless::Function
Properties:
Runtime: dotnet10
DeadLetterQueue:
Type: SQS
TargetArn: !GetAtt MyFunctionDLQ.Arn
EventInvokeConfig:
MaximumRetryAttempts: 2
MaximumEventAgeInSeconds: 3600
Always alarm on DLQ depth so a backlog is immediately visible.
Use Lambda Destinations for Async Success and Failure Routing
What it is: Lambda Destinations are a more flexible evolution of DLQs. They let you route both successful and failed async invocations to an SQS queue, SNS topic, EventBridge bus, or another Lambda function, with the full original event and function response or error details included in the routed payload.
Why it matters: DLQs receive only the original input event on failure. Destinations receive the original event plus the function response or error details for both successes and failures, making post-processing, alerting, and conditional routing significantly richer. They also enable success-path routing, which DLQs cannot do.
Alarm on Errors, Throttles, and Duration Breaches
What it is: Lambda publishes several CloudWatch metrics out of the box: Errors, Throttles, Duration, ConcurrentExecutions, and IteratorAge (for stream-based triggers). CloudWatch Alarms can monitor these metrics and trigger notifications or automated actions when thresholds are breached.
Why it matters: None of these metrics have alarms by default — you must create them explicitly. Without alarms, a function silently failing 10% of invocations, hitting account-level concurrency limits, or processing events that are hours behind will go undetected until a user or downstream system reports a problem.
ErrorRateAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${MyFunction}-ErrorRate"
Metrics:
- Id: errors
MetricStat:
Metric:
Namespace: AWS/Lambda
MetricName: Errors
Dimensions: [{ Name: FunctionName, Value: !Ref MyFunction }]
Period: 60
Stat: Sum
- Id: invocations
MetricStat:
Metric:
Namespace: AWS/Lambda
MetricName: Invocations
Dimensions: [{ Name: FunctionName, Value: !Ref MyFunction }]
Period: 60
Stat: Sum
- Id: errorRate
Expression: "errors / invocations * 100"
Label: ErrorRate
ComparisonOperator: GreaterThanThreshold
Threshold: 1
EvaluationPeriods: 2
TreatMissingData: notBreaching
AlarmActions: [!Ref OpsAlertTopic]
ThrottleAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${MyFunction}-Throttles"
MetricName: Throttles
Namespace: AWS/Lambda
Dimensions: [{ Name: FunctionName, Value: !Ref MyFunction }]
Statistic: Sum
Period: 300
EvaluationPeriods: 1
Threshold: 1
ComparisonOperator: GreaterThanOrEqualToThreshold
AlarmActions: [!Ref OpsAlertTopic]
DurationAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${MyFunction}-DurationP99"
MetricName: Duration
Namespace: AWS/Lambda
Dimensions: [{ Name: FunctionName, Value: !Ref MyFunction }]
ExtendedStatistic: p99
Period: 300
EvaluationPeriods: 3
Threshold: 8000 # 8 seconds if timeout is 10 seconds
ComparisonOperator: GreaterThanThreshold
AlarmActions: [!Ref OpsAlertTopic]
Use Lambda Aliases and Traffic Shifting for Safe Deployments
What it is: Lambda aliases are named pointers to specific function versions (e.g., live → version 12). Traffic shifting, configured through AWS CodeDeploy, allows you to route a percentage of invocations to a new version while the majority still goes to the current stable version — a serverless canary or linear deployment.
Why it matters: Deploying a new function version directly to 100% of traffic has no rollback path faster than re-deploying the previous version, which takes 30–90 seconds. With canary or linear deployments, CodeDeploy monitors your CloudWatch alarms and automatically rolls back to the previous version the moment an alarm fires — often within 60 seconds of a bad deployment.
Quick Reference
Must = required for any production function regardless of workload. Optional = strongly advisable but context-dependent — the "When to apply" column clarifies when it becomes effectively mandatory.
| # | Practice | Priority | When to apply |
|---|---|---|---|
| Runtime & Deployment | |||
| 1 | Use latest LTS .NET on arm64 | Optional | New functions or runtime upgrades; skip if library incompatibility blocks migration |
| 2 | Minimize deployment package size | Must | Always |
| 3 | Use Native AOT compilation | Optional | New functions; skip if dependencies are AOT-incompatible |
| 4 | Use Lambda SnapStart | Optional | JIT-based functions where cold start is a problem and AOT is not feasible |
| Memory & CPU Sizing | |||
| 5 | Profile with Lambda Power Tuning | Must | Before any function goes to production; re-run after significant code changes |
| 6 | Don't under-allocate memory | Must | Always |
| 7 | Set memory based on actual measurements | Must | Always |
| Cold Start Optimization | |||
| 8 | Move heavy initialization outside the handler | Must | Always |
| 9 | Use Provisioned Concurrency | Optional | Synchronous user-facing functions with strict latency SLAs only |
| 10 | Lazy-load non-critical dependencies | Optional | Functions with multiple code paths that aren't always exercised |
| 11 | Avoid heavy DI containers in the hot path | Must | Always; especially mandatory when using Native AOT |
| Execution & Compute Efficiency | |||
| 12 | Use async/await — avoid .Result and .Wait() | Must | Always |
| 13 | Reuse HttpClient and AWS SDK clients | Must | Always |
| 14 | Set appropriate timeout | Must | Always |
| 15 | Use System.Text.Json over Newtonsoft.Json | Must | New functions; for existing functions, mandatory when targeting Native AOT |
| 16 | Avoid boxing and unnecessary allocations | Optional | High-throughput functions processing thousands of events per second |
| Concurrency & Scaling | |||
| 17 | Set reserved concurrency | Must | Any function in a shared account with other critical functions |
| 18 | Configure SQS batch size and ReportBatchItemFailures | Must | All SQS-triggered functions |
| 19 | Use event source filtering | Optional | When the function receives a mixed event stream and only acts on a subset |
| 20 | Tune maximum concurrency on event source mappings | Must | When the downstream target (DB, API) has a known parallelism limit |
| Networking & Integrations | |||
| 21 | Avoid VPC unless strictly necessary | Must | Always — only place in VPC when there is no alternative |
| 22 | Use VPC endpoints for AWS services | Must | Any VPC-attached function that calls AWS services |
| 23 | Enable connection pooling and keep-alive | Must | Always |
| 24 | Configure AWS SDK retry and timeout | Must | Always |
| Observability & Cost Visibility | |||
| 25 | Enable Lambda Insights | Optional | Functions where resource-level diagnostics (CPU, memory trend) are needed |
| 26 | Use structured logging with log level filtering | Must | Always |
| 27 | Set CloudWatch log retention policy | Must | Always |
| 28 | Use X-Ray or OpenTelemetry | Optional | Functions that call downstream services and where latency attribution matters |
| 29 | Tag all functions with cost allocation tags | Must | Any shared or multi-team AWS account |
| Architecture & Design | |||
| 30 | Keep functions single-purpose | Must | Always |
| 31 | Consider Lambda URLs vs API Gateway | Optional | Simple HTTP endpoints with no need for API Gateway features |
| 32 | Use Step Functions for orchestration | Must | Any workflow that chains Lambda calls synchronously |
| Security | |||
| 33 | Apply least-privilege IAM execution roles | Must | Always |
| 34 | Never store secrets in environment variables | Must | Always |
| 35 | Enable AWS WAF on API Gateway or CloudFront | Must | Any publicly exposed HTTP endpoint |
| 36 | Use resource-based policies to restrict invocation | Must | Always |
| 37 | Restrict Lambda Function URL auth | Must | Any Function URL not intentionally public |
| Operations | |||
| 38 | Configure Dead Letter Queues on async invocations | Must | All async-invoked functions |
| 39 | Use Lambda Destinations | Optional | When you need success-path routing or richer failure payloads beyond a basic DLQ |
| 40 | Alarm on errors, throttles, and duration | Must | Always |
| 41 | Use aliases and traffic shifting for deployments | Must | Any function where a bad deploy would have immediate user or data impact |
The current checklist is a starting point. Adapt it to your team's context, add items that reflect your specific compliance requirements or architectural patterns, and remove items that genuinely do not apply to your workload. The goal is not a perfect score — it is a shared, team-maintained definition of what a production-ready Lambda function looks like for your organization. Thanks, and happy coding



