AWS Lambda Function: Check List

AWS Lambda has become a cornerstone of modern serverless architectures, but writing a function that "just works" is very different from writing one that is performant, cost-effective, secure, and ready for production at scale. For .NET developers, the distance between a naive implementation and a well-engineered one is surprisingly wide — it spans everything from cold start times and memory allocation, to secret management, deployment safety, and event loss prevention.

This guide covers ten areas of Lambda excellence for .NET, each derived from real-world production patterns and common failure modes. It assumes you are comfortable with the basics of Lambda and the AWS .NET SDK, and focuses on the why behind each practice as much as the how.

Runtime & Deployment Model

Use .NET 10 (or Latest LTS) on arm64

What it is: AWS Graviton2 processors power arm64 Lambda functions. They offer better performance-per-dollar for most workloads, and AWS charges approximately 20% less per GB-second compared to x86.

Why it matters: For a high-traffic function with 500ms average duration at 512 MB memory, switching from x86 to arm64 can save over $15/month per million daily invocations with zero code changes.

# template.yaml
Globals:
  Function:
    Runtime: dotnet10
    Architectures:
      - arm64
    MemorySize: 512

Resources:
  MyFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: MyFunction::MyFunction.Function::FunctionHandler
      CodeUri: src/MyFunction/

Minimize Deployment Package Size

What it is: The deployment package is the ZIP artifact containing your compiled application and its dependencies, uploaded to Lambda via S3. Its size directly affects how long Lambda takes to extract and load your code during a cold start.

Why it matters: Lambda extracts and initializes your deployment package on every cold start. A 50 MB package takes significantly longer to initialize than a 5 MB one. Smaller packages also reduce S3 storage costs and deployment time.

Use Native AOT Compilation

What it is: Ahead-of-Time (AOT) compilation converts your .NET code directly into a native binary at build time, eliminating the Just-In-Time (JIT) compiler that normally runs when a .NET application starts. For Lambda, this means the initialization phase — the most expensive part of a cold start — is dramatically faster.

Why it matters: Lambda bills in 1ms increments. Cold starts are fully billed duration. On a JIT-based .NET function, initialization can take 400–1200ms depending on the number of loaded assemblies and SDK clients. With Native AOT, this drops to 30–80ms.

Use Lambda SnapStart (if not using AOT)

What it is: SnapStart takes a snapshot of the initialized execution environment after the Init phase completes and caches it. Subsequent cold starts restore from the snapshot instead of re-running initialization, reducing cold start latency by up to 90%.

Why it matters: If you have an existing JIT-based .NET function that you cannot easily migrate to AOT (e.g., it uses libraries incompatible with trimming), SnapStart delivers most of the cold start benefit without a code rewrite.

Memory & CPU Sizing

Profile with AWS Lambda Power Tuning

What it is: The AWS Lambda Power Tuning tool is an open-source Step Functions state machine that runs your function at multiple memory configurations (e.g., 128 MB, 256 MB, 512 MB, 1024 MB, 1769 MB, 3008 MB) and returns a cost and performance graph showing the optimal setting.

Why it matters: The relationship between memory and cost is not linear because CPU allocation scales with memory. A function that runs in 1000ms at 256 MB might run in 200ms at 1024 MB, making the higher-memory option cheaper despite a higher per-ms rate.

Don't Under-Allocate Memory

What it is: Lambda allocates CPU proportionally to memory. Setting memory too low starves your function of CPU, which can make it run slower and cost more despite the lower per-GB-second rate.

Memory	CPU Allocation
128 MB	~5% of a vCPU
512 MB	~25% of a vCPU
1769 MB	1 full vCPU
3584 MB	2 full vCPUs

Why it matters: A .NET function with heavy computation (JSON parsing, data transformation, LINQ operations) starved of CPU at 256 MB may run 4–8x slower than at 1769 MB, making it cost more despite the lower RAM pricing.

Set Memory Based on Actual Measurements

What it is: Every Lambda invocation emits a REPORT log line containing the actual peak memory used. This measured value is the correct baseline for sizing your memory allocation, rather than guessing.

Why it matters: Setting memory allocation to peak measured usage plus 15% headroom prevents out-of-memory errors while avoiding wasteful over-provisioning. Both extremes cost money: OOM errors cause failed invocations and retries; excessive headroom means you pay for memory you never use.

REPORT RequestId: abc123  Duration: 245.12 ms  Billed Duration: 246 ms
Memory Size: 512 MB  Max Memory Used: 178 MB

Set your memory allocation to peakMB * 1.15 (15% headroom above peak measured usage).

Cold Start Optimization

Move Heavy Initialization Outside the Handler

What it is: In Lambda, code that runs at class construction or in static initializers executes only once per execution environment lifecycle — during the Init phase — not on every invocation. By moving expensive setup (SDK client construction, config loading, connection establishment) outside the handler method, you ensure it runs once and is reused.

Why it matters: Lambda reuses execution environments across invocations and freezes them between calls, preserving static state. Initialization that runs during the Init phase is billed as part of the cold start, which occurs infrequently. The same code inside the handler runs on every invocation, making every call slower and more expensive.

Use Provisioned Concurrency for Latency-Critical Paths

What it is: Provisioned Concurrency pre-initializes a specified number of execution environments, keeping them warm and ready to handle requests with no cold start. The environments are fully initialized — your static constructors have already run.

Why it matters: For synchronous, user-facing functions (e.g., API endpoints powering a mobile app), cold start latency can directly degrade user experience. Provisioned Concurrency eliminates cold starts entirely for those pre-warmed environments. It should not be used for async workers, background jobs, or event-processing functions, where latency is not user-visible and the added cost is unjustified.

Lazy-Load Non-Critical Dependencies

What it is: For dependencies only needed in specific code paths, use Lazy<T> to defer their initialization until the first time that code path is actually executed, rather than initializing them unconditionally at startup.

Why it matters: Eagerly initializing every dependency at startup contributes to cold start duration and memory usage, even for code paths that may never be invoked in a given execution environment. Lazy loading ensures you only pay the initialization cost for dependencies that are actually used.

Avoid Heavy DI Containers in the Hot Path

What it is: Reflection-based dependency injection frameworks (Microsoft.Extensions.DependencyInjection with assembly scanning, Autofac, etc.) perform extensive reflection during container construction. This conflicts directly with both Native AOT (which requires all types to be known at compile time) and cold start minimization goals.

Why it matters: Reflection-based DI scanning can add hundreds of milliseconds to your Init phase and is incompatible with Native AOT trimming. Preferred alternatives are manual wiring (fastest, recommended for simple functions) and source-generated DI (for more complex compositions).

Execution & Compute Efficiency

Use async/await Throughout — Avoid .Result and .Wait()

What it is: async/await enables non-blocking I/O in .NET — when a function awaits a network call, the thread is released to do other work. .Result and .Wait() are synchronous blocking calls that hold the thread until the operation completes.

Why it matters: Lambda bills for wall-clock time, not CPU time. A thread blocked on .Result or .Wait() holds the execution environment hostage while waiting for I/O, consuming billed duration even though it is doing nothing. It can also cause deadlocks in certain synchronization contexts.

Reuse HttpClient and AWS SDK Clients Across Invocations

What it is: HttpClient and AWS SDK service clients (e.g., AmazonDynamoDBClient) are designed to be long-lived and thread-safe. They should be created once as static fields and reused across invocations, not constructed per-request.

Why it matters: Each new HttpClient() creates a new socket pool. Creating one per invocation rapidly exhausts available ports (the default ephemeral port range is ~30,000), causing socket exhaustion under load — a notoriously difficult production bug to diagnose. AWS SDK clients hold connection pools internally; re-creating them per invocation defeats connection reuse and adds DNS resolution overhead on every call.

Set Appropriate Timeout

What it is: Lambda's timeout setting defines the maximum duration an invocation is allowed to run before it is forcibly terminated. It is configurable from 1 second to 15 minutes per function.

Why it matters: Lambda's default timeout for new functions is 3 seconds — too short for most real workloads. But setting it to the maximum (15 minutes) "just in case" means a function that hangs due to a slow downstream service will bill the full 15 minutes before being terminated. Calculate your timeout based on measured performance:

Timeout = (p99 measured duration) × 2 + network overhead margin

For example, if your function normally completes in 800ms and has a p99 of 1.2 seconds, set timeout to 3–5 seconds, not 15 minutes.

MyFunction:
  Type: AWS::Serverless::Function
  Properties:
    Timeout: 10    # Seconds — not 900 (15 min)

Use System.Text.Json over Newtonsoft.Json

What it is: System.Text.Json (STJ) is the built-in .NET JSON library introduced in .NET Core 3.1. It is allocation-friendly, supports Span<T> for zero-copy parsing, and is fully compatible with Native AOT via source generation. Newtonsoft.Json is the older, widely used third-party library that predates STJ.

Why it matters: Newtonsoft.Json relies on reflection, is incompatible with Native AOT, adds ~500 KB to your deployment package, and is measurably slower for typical Lambda payloads. STJ delivers better throughput, lower memory pressure, and AOT compatibility with no external dependency.

Avoid Boxing and Unnecessary Allocations in Hot Paths

What it is: Boxing occurs when a value type (e.g., int, struct) is implicitly converted to object, causing a heap allocation. Unnecessary allocations include creating objects, strings, arrays, or closures that are immediately discarded after a single use.

Why it matters: Lambda functions that process thousands of events per second generate enormous GC pressure if they produce many short-lived heap objects per invocation. GC pauses directly extend billed duration and increase tail latency.

Concurrency & Scaling

Set Reserved Concurrency to Avoid Noisy-Neighbor Throttling

What it is: Reserved concurrency is a per-function setting that both guarantees a minimum concurrency allocation for a function and caps its maximum concurrency. By default, all Lambda functions in an AWS account share a regional concurrency pool (default: 1,000 concurrent executions).

Why it matters: Without reserved concurrency, a single function experiencing a traffic spike can consume the entire regional pool, throttling completely unrelated functions in the same account. Reserved concurrency prevents this by isolating a function's allocation from the shared pool.

Configure SQS Batch Size and Batch Window

What it is: When Lambda polls an SQS queue, it can retrieve and process multiple messages in a single invocation (a batch). BatchSize controls the maximum number of messages per invocation. MaximumBatchingWindowInSeconds controls how long Lambda waits to fill the batch before invoking. ReportBatchItemFailures tells Lambda which specific messages in a batch failed, so only those are retried.

Why it matters: Larger batches mean fewer Lambda invocations, which directly reduces cost. Without ReportBatchItemFailures, if one message in a batch of 100 fails, all 100 are returned to the queue and reprocessed — a 100x amplification of failed work.

Resources:
  MyFunction:
    Type: AWS::Serverless::Function
    Events:
      SQSTrigger:
        Type: SQS
        Properties:
          Queue: !GetAtt MyQueue.Arn
          BatchSize: 100
          MaximumBatchingWindowInSeconds: 5
          FunctionResponseTypes:
            - ReportBatchItemFailures

Message visibility timeout must be greater than max function execution time.

Use Event Filtering on SQS/DynamoDB/Kinesis Triggers

What it is: Lambda event source filtering lets you define a filter pattern at the AWS service level. Messages that don't match the filter are discarded by the service before they ever invoke your function.

Why it matters: You are not billed for filtered-out events, and your function doesn't need to contain logic to discard irrelevant messages. This reduces invocation count, cost, and code complexity.

Events:
  SQSTrigger:
    Type: SQS
    Properties:
      Queue: !GetAtt MyQueue.Arn
      FilterCriteria:
        Filters:
          - Pattern: '{"body": {"eventType": ["ORDER_PLACED"]}}'

Tune Maximum Concurrency on Event Source Mappings

What it is: The ScalingConfig.MaximumConcurrency property on an event source mapping caps how many concurrent Lambda executions can be processing from that specific source simultaneously, independent of the function's overall reserved concurrency.

Why it matters: Without this limit, a large SQS queue backlog can scale Lambda to hundreds of concurrent executions simultaneously. This can overwhelm downstream databases or APIs not designed for that level of parallelism, causing cascading failures.

Events:
  SQSTrigger:
    Type: SQS
    Properties:
      Queue: !GetAtt MyQueue.Arn
      BatchSize: 10
      ScalingConfig:
        MaximumConcurrency: 10

Networking & Integrations

Avoid VPC Unless Strictly Necessary

What it is: Placing a Lambda function inside a VPC attaches an Elastic Network Interface (ENI) to the function's execution environment, giving it access to private VPC resources. This is required for accessing services that have no public endpoint, such as RDS or ElastiCache.

Why it matters: ENI attachment adds 100–500ms of cold start latency and introduces capacity constraints in subnets. Lambda functions outside a VPC still run in AWS-managed, isolated infrastructure with no public inbound access — the security benefit of VPC placement is often overstated. Services like DynamoDB, S3, SQS, and SNS are all accessible without a VPC (or via VPC endpoints if the function is already in one).

Use VPC Endpoints for AWS Services

What it is: VPC Endpoints (Gateway endpoints for S3/DynamoDB, Interface endpoints for most other services) route traffic between your Lambda function and AWS services privately within the AWS network, bypassing the public internet and NAT Gateway.

Why it matters: Without VPC endpoints, traffic from a VPC-based Lambda to AWS services routes through a NAT Gateway, which charges $0.045 per GB of data processed. For a high-throughput function reading large S3 objects, this adds up quickly. VPC endpoints eliminate the NAT Gateway data charge (Interface endpoints have a small hourly fee, but it is offset by avoided NAT costs above roughly 10 GB/month).

Enable Connection Pooling and Keep-Alive for HTTP

What it is: HTTP connection pooling reuses established TCP connections across multiple requests rather than opening a new connection for each one. In Lambda, SocketsHttpHandler manages this pool, and it persists across invocations as long as the execution environment stays warm.

Why it matters: Without connection reuse, each Lambda invocation (or each HTTP call within it) incurs TCP handshake and TLS negotiation overhead. Tuning pool settings prevents stale connections from causing errors while keeping connections alive long enough to benefit from reuse.

private static readonly HttpClient HttpClient = new HttpClient(new SocketsHttpHandler
{
    PooledConnectionLifetime = TimeSpan.FromMinutes(15),
    PooledConnectionIdleTimeout = TimeSpan.FromMinutes(2),
    MaxConnectionsPerServer = 50,
    UseCookies = false
})
{
    Timeout = TimeSpan.FromSeconds(10)
};

Configure AWS SDK Retry and Timeout

What it is: The AWS SDK has a built-in retry policy that automatically retries failed requests with exponential backoff. The default configuration retries up to 3 times. Both the retry count and per-attempt timeout can be configured on each SDK client.

Why it matters: For a Lambda function with a 5-second timeout calling a struggling DynamoDB table, the SDK's default aggressive retry policy may consume the entire timeout window retrying, billing you for the full duration and still returning an error. Setting explicit retry limits and per-call timeouts gives you control over this behavior.

Observability & Cost Visibility

Enable Lambda Insights

What it is: Lambda Insights is an enhanced CloudWatch monitoring feature that captures per-invocation metrics including memory used, CPU time, init duration, and network I/O. It is enabled by attaching the managed LambdaInsightsExtension Lambda layer to your function.

Why it matters: The default Lambda CloudWatch metrics (invocations, duration, errors) don't surface resource-level detail like memory pressure or CPU utilization. Lambda Insights fills that gap, making it possible to identify memory leaks, CPU-bound invocations, and slow init phases without adding custom instrumentation.

Cost note: Lambda Insights charges for custom CloudWatch metrics (~$0.30 per metric per month) and additional log data. For high-traffic functions, filter what you emit.

Use Structured Logging with Log Level Filtering

What it is: Structured logging emits log entries as JSON objects with consistent fields (timestamp, level, message, correlation IDs, etc.) rather than plain text strings. Log level filtering suppresses verbose logs (e.g., Debug, Trace) in production while keeping Warning and Error output.

Why it matters: Plain string logs are hard to query at scale. Structured JSON logs can be filtered and aggregated efficiently with CloudWatch Logs Insights. Log level control reduces log volume and CloudWatch ingestion costs in production.

To accomplish this easily, use Powertools for AWS Lambda (.NET).

Set CloudWatch Log Retention Policy

What it is: Each Lambda function automatically gets a CloudWatch Log Group. The default retention policy is "Never Expire," meaning logs accumulate indefinitely. You can configure a retention period (e.g., 14 days) via CloudFormation or the console.

Why it matters: CloudWatch Logs storage is billed at $0.03/GB. With "Never Expire," logs from high-volume functions accumulate indefinitely, and the bill grows silently. Explicitly setting a retention period caps storage costs and keeps log groups manageable.

Resources:
  MyFunctionLogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: !Sub "/aws/lambda/${MyFunction}"
      RetentionInDays: 14

Use X-Ray or OpenTelemetry for Distributed Tracing

What it is: AWS X-Ray is a distributed tracing service that instruments calls across Lambda invocations, AWS service calls (DynamoDB, S3, SQS), and outbound HTTP requests, producing service maps and per-segment latency breakdowns. OpenTelemetry is the vendor-neutral alternative that can export to X-Ray, Jaeger, or other backends.

Why it matters: Standard CloudWatch metrics tell you that a function is slow; distributed tracing tells you which downstream call is responsible. This is invaluable for identifying the specific DynamoDB query or external API call causing p99 latency spikes.

Globals:
  Function:
    Tracing: Active

To accomplish this easily, use Powertools for AWS Lambda (.NET).

Tag All Lambda Functions with Cost Allocation Tags

What it is: AWS resource tags are key-value pairs attached to resources. Cost Allocation Tags are tags you activate in the Billing console, after which AWS Cost Explorer can group and filter Lambda spending by those tag values.

Why it matters: Without consistent tagging, Lambda costs in a shared account appear as a single line item, making it impossible to attribute spending to specific teams, features, or environments. Tags enable per-team or per-service cost accountability.

Globals:
  Function:
    Tags:
      Team: payments
      Environment: production
      Service: order-processing

Architecture & Design

Keep Functions Single-Purpose

What it is: A single-purpose Lambda function handles one specific operation (e.g., process an order event, resize an image, send a notification) rather than branching across multiple unrelated operations based on input type.

Why it matters: A function that branches heavily based on input type ends up sized for its most demanding branch, wasting memory and compute for all other branches. Single-purpose functions are easier to size, tune independently, monitor, and debug.

Consider Lambda URLs vs API Gateway

What it is: Lambda Function URLs are built-in HTTPS endpoints directly on a Lambda function, requiring no additional AWS service. API Gateway is a fully managed service that sits in front of Lambda and adds features like request validation, authorizers, usage plans, WAF integration, and stage variables.

Why it matters: For simple HTTP endpoints that don't need API Gateway features, Lambda Function URLs eliminate the $1.00 per million request charge that HTTP API Gateway adds, reducing cost to Lambda invocation charges only. The right choice depends on whether you need the additional API Gateway capabilities.

Feature	API Gateway HTTP	Lambda URL
Cost per million requests	$1.00	$0 (Lambda invocation only)
CORS support	Yes	Yes
Auth	IAM, Cognito, Lambda authorizer	IAM, none
Custom domain	Yes	No (but works behind CloudFront)
WebSocket	Yes	No

Use Step Functions for Orchestration over Chained Lambdas

What it is: AWS Step Functions is a serverless orchestration service that coordinates multi-step workflows. The alternative — chaining Lambda functions by having one function synchronously invoke another and wait for its response — keeps the calling function's execution environment alive and billing during the entire wait.

Why it matters: Synchronously chaining Lambda functions doubles or triples your Lambda bill because the caller is billed for the full wait duration. Step Functions handles orchestration at the service level; your Lambda functions execute independently and are only billed for their own compute time.

Security

Apply Least-Privilege IAM Execution Roles

What it is: Every Lambda function runs under an IAM execution role. This role defines exactly which AWS API calls the function is permitted to make. Many teams default to broad managed policies like AmazonDynamoDBFullAccess or even AdministratorAccess during development and never revisit them.

Why it matters: If your function is compromised through a vulnerability in your code or a dependency, the attacker inherits every permission in that execution role. A function that can only call dynamodb:GetItem on one specific table does far less damage than one with write access to all tables in the account.

Never Store Secrets in Environment Variables or Source Code

What it is: Lambda environment variables are convenient but are stored in plaintext in the function configuration, visible to anyone with lambda:GetFunctionConfiguration or lambda:GetFunction IAM access. The correct alternative is AWS Secrets Manager or AWS Systems Manager Parameter Store (SecureString), where secrets are fetched at runtime and access is controlled via IAM.

Why it matters: Database passwords, API keys, and signing secrets in environment variables or source code have been the root cause of numerous high-profile cloud breaches. The AWS Shared Responsibility Model makes secret storage entirely your responsibility.

Enable AWS WAF on API Gateway or CloudFront

What it is: AWS WAF (Web Application Firewall) inspects HTTP requests before they reach your Lambda function, blocking common attack patterns (SQL injection, XSS, path traversal), known malicious IP ranges, and volumetric abuse. It is attached to API Gateway, CloudFront, or an Application Load Balancer.

Why it matters: Without WAF, a Lambda function exposed via API Gateway is reachable by anyone on the internet. Even a well-validated function can be overwhelmed by a flood of requests (costing money in Lambda invocations) or hit with sophisticated payloads designed to exploit dependencies.

Use Resource-Based Policies to Restrict Who Can Invoke

What it is: A Lambda resource policy (also called a function policy) defines which AWS principals — accounts, services, or IAM roles — are allowed to call lambda:InvokeFunction on your function. It is separate from the function's execution role, which controls what the function can do.

Why it matters: Without a restrictive resource policy, an API Gateway misconfiguration or an accidental public URL can expose your function to arbitrary invocation from the internet, leading to unexpected charges and potential data exposure.

Restrict Lambda Function URL Auth

What it is: Lambda Function URLs support two authentication modes: AuthType: NONE (publicly invocable by anyone with the URL) and AuthType: AWS_IAM (requires a valid AWS Signature Version 4 signed request). The NONE mode is occasionally used for public webhooks but is a security risk for most production functions.

Why it matters: AuthType: NONE means the URL is publicly accessible with no authentication. Anyone who discovers or guesses the URL can invoke your function at your expense and potentially exfiltrate or corrupt data. Use AWS_IAM for any function that is not intentionally public.

Operations

Configure Dead Letter Queues on Async Invocations

What it is: A Dead Letter Queue (DLQ) is an SQS queue or SNS topic configured to receive events that Lambda failed to process after exhausting its retry attempts. For asynchronous invocations (from SNS, S3 events, EventBridge, or direct async InvokeFunction calls), Lambda retries failed invocations twice by default before discarding the event.

Why it matters: Without a DLQ, a transient downstream failure (a throttled DynamoDB table, a network timeout) can cause permanent, silent data loss with no alert and no recovery path. Silent event loss is one of the hardest production bugs to detect.

Resources:
  MyFunctionDLQ:
    Type: AWS::SQS::Queue
    Properties:
      QueueName: my-function-dlq
      MessageRetentionPeriod: 1209600   # 14 days in seconds

  MyFunction:
    Type: AWS::Serverless::Function
    Properties:
      Runtime: dotnet10
      DeadLetterQueue:
        Type: SQS
        TargetArn: !GetAtt MyFunctionDLQ.Arn
      EventInvokeConfig:
        MaximumRetryAttempts: 2
        MaximumEventAgeInSeconds: 3600

Always alarm on DLQ depth so a backlog is immediately visible.

Use Lambda Destinations for Async Success and Failure Routing

What it is: Lambda Destinations are a more flexible evolution of DLQs. They let you route both successful and failed async invocations to an SQS queue, SNS topic, EventBridge bus, or another Lambda function, with the full original event and function response or error details included in the routed payload.

Why it matters: DLQs receive only the original input event on failure. Destinations receive the original event plus the function response or error details for both successes and failures, making post-processing, alerting, and conditional routing significantly richer. They also enable success-path routing, which DLQs cannot do.

Alarm on Errors, Throttles, and Duration Breaches

What it is: Lambda publishes several CloudWatch metrics out of the box: Errors, Throttles, Duration, ConcurrentExecutions, and IteratorAge (for stream-based triggers). CloudWatch Alarms can monitor these metrics and trigger notifications or automated actions when thresholds are breached.

Why it matters: None of these metrics have alarms by default — you must create them explicitly. Without alarms, a function silently failing 10% of invocations, hitting account-level concurrency limits, or processing events that are hours behind will go undetected until a user or downstream system reports a problem.

ErrorRateAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: !Sub "${MyFunction}-ErrorRate"
    Metrics:
      - Id: errors
        MetricStat:
          Metric:
            Namespace: AWS/Lambda
            MetricName: Errors
            Dimensions: [{ Name: FunctionName, Value: !Ref MyFunction }]
          Period: 60
          Stat: Sum
      - Id: invocations
        MetricStat:
          Metric:
            Namespace: AWS/Lambda
            MetricName: Invocations
            Dimensions: [{ Name: FunctionName, Value: !Ref MyFunction }]
          Period: 60
          Stat: Sum
      - Id: errorRate
        Expression: "errors / invocations * 100"
        Label: ErrorRate
    ComparisonOperator: GreaterThanThreshold
    Threshold: 1
    EvaluationPeriods: 2
    TreatMissingData: notBreaching
    AlarmActions: [!Ref OpsAlertTopic]

ThrottleAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: !Sub "${MyFunction}-Throttles"
    MetricName: Throttles
    Namespace: AWS/Lambda
    Dimensions: [{ Name: FunctionName, Value: !Ref MyFunction }]
    Statistic: Sum
    Period: 300
    EvaluationPeriods: 1
    Threshold: 1
    ComparisonOperator: GreaterThanOrEqualToThreshold
    AlarmActions: [!Ref OpsAlertTopic]

DurationAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: !Sub "${MyFunction}-DurationP99"
    MetricName: Duration
    Namespace: AWS/Lambda
    Dimensions: [{ Name: FunctionName, Value: !Ref MyFunction }]
    ExtendedStatistic: p99
    Period: 300
    EvaluationPeriods: 3
    Threshold: 8000   # 8 seconds if timeout is 10 seconds
    ComparisonOperator: GreaterThanThreshold
    AlarmActions: [!Ref OpsAlertTopic]

Use Lambda Aliases and Traffic Shifting for Safe Deployments

What it is: Lambda aliases are named pointers to specific function versions (e.g., live → version 12). Traffic shifting, configured through AWS CodeDeploy, allows you to route a percentage of invocations to a new version while the majority still goes to the current stable version — a serverless canary or linear deployment.

Why it matters: Deploying a new function version directly to 100% of traffic has no rollback path faster than re-deploying the previous version, which takes 30–90 seconds. With canary or linear deployments, CodeDeploy monitors your CloudWatch alarms and automatically rolls back to the previous version the moment an alarm fires — often within 60 seconds of a bad deployment.

Quick Reference

Must = required for any production function regardless of workload. Optional = strongly advisable but context-dependent — the "When to apply" column clarifies when it becomes effectively mandatory.

#	Practice	Priority	When to apply
Runtime & Deployment
1	Use latest LTS .NET on arm64	Optional	New functions or runtime upgrades; skip if library incompatibility blocks migration
2	Minimize deployment package size	Must	Always
3	Use Native AOT compilation	Optional	New functions; skip if dependencies are AOT-incompatible
4	Use Lambda SnapStart	Optional	JIT-based functions where cold start is a problem and AOT is not feasible
Memory & CPU Sizing
5	Profile with Lambda Power Tuning	Must	Before any function goes to production; re-run after significant code changes
6	Don't under-allocate memory	Must	Always
7	Set memory based on actual measurements	Must	Always
Cold Start Optimization
8	Move heavy initialization outside the handler	Must	Always
9	Use Provisioned Concurrency	Optional	Synchronous user-facing functions with strict latency SLAs only
10	Lazy-load non-critical dependencies	Optional	Functions with multiple code paths that aren't always exercised
11	Avoid heavy DI containers in the hot path	Must	Always; especially mandatory when using Native AOT
Execution & Compute Efficiency
12	Use async/await — avoid .Result and .Wait()	Must	Always
13	Reuse HttpClient and AWS SDK clients	Must	Always
14	Set appropriate timeout	Must	Always
15	Use System.Text.Json over Newtonsoft.Json	Must	New functions; for existing functions, mandatory when targeting Native AOT
16	Avoid boxing and unnecessary allocations	Optional	High-throughput functions processing thousands of events per second
Concurrency & Scaling
17	Set reserved concurrency	Must	Any function in a shared account with other critical functions
18	Configure SQS batch size and ReportBatchItemFailures	Must	All SQS-triggered functions
19	Use event source filtering	Optional	When the function receives a mixed event stream and only acts on a subset
20	Tune maximum concurrency on event source mappings	Must	When the downstream target (DB, API) has a known parallelism limit
Networking & Integrations
21	Avoid VPC unless strictly necessary	Must	Always — only place in VPC when there is no alternative
22	Use VPC endpoints for AWS services	Must	Any VPC-attached function that calls AWS services
23	Enable connection pooling and keep-alive	Must	Always
24	Configure AWS SDK retry and timeout	Must	Always
Observability & Cost Visibility
25	Enable Lambda Insights	Optional	Functions where resource-level diagnostics (CPU, memory trend) are needed
26	Use structured logging with log level filtering	Must	Always
27	Set CloudWatch log retention policy	Must	Always
28	Use X-Ray or OpenTelemetry	Optional	Functions that call downstream services and where latency attribution matters
29	Tag all functions with cost allocation tags	Must	Any shared or multi-team AWS account
Architecture & Design
30	Keep functions single-purpose	Must	Always
31	Consider Lambda URLs vs API Gateway	Optional	Simple HTTP endpoints with no need for API Gateway features
32	Use Step Functions for orchestration	Must	Any workflow that chains Lambda calls synchronously
Security
33	Apply least-privilege IAM execution roles	Must	Always
34	Never store secrets in environment variables	Must	Always
35	Enable AWS WAF on API Gateway or CloudFront	Must	Any publicly exposed HTTP endpoint
36	Use resource-based policies to restrict invocation	Must	Always
37	Restrict Lambda Function URL auth	Must	Any Function URL not intentionally public
Operations
38	Configure Dead Letter Queues on async invocations	Must	All async-invoked functions
39	Use Lambda Destinations	Optional	When you need success-path routing or richer failure payloads beyond a basic DLQ
40	Alarm on errors, throttles, and duration	Must	Always
41	Use aliases and traffic shifting for deployments	Must	Any function where a bad deploy would have immediate user or data impact

The current checklist is a starting point. Adapt it to your team's context, add items that reflect your specific compliance requirements or architectural patterns, and remove items that genuinely do not apply to your workload. The goal is not a perfect score — it is a shared, team-maintained definition of what a production-ready Lambda function looks like for your organization. Thanks, and happy coding

Command Palette