APM with Lumigo: Distributed Tracing Setup

In modern cloud-native architectures, a single user request can traverse dozens of services: API Gateway → Lambda → SQS → Lambda → DynamoDB → S3 → another Lambda → RDS. When that request fails or performs poorly, traditional monitoring tools show you that something broke, but not where or why across this distributed system. For teams managing cloud-native architectures, observability is critical.

This is where Application Performance Monitoring (APM) with distributed tracing becomes essential. Distributed tracing gives you end-to-end visibility into request flows, allowing you to identify bottlenecks, debug failures, and optimize performance across your entire microservices or serverless architecture. See our real metrics page for production monitoring examples.

In this comprehensive guide, we'll walk through setting up distributed tracing with Lumigo - a serverless-first APM platform that excels at automatically instrumenting AWS Lambda functions, capturing async event flows, and providing deep context without code changes. We'll cover everything from initial setup to production best practices, using real-world examples from production deployments.

What You'll Learn: How distributed tracing works, Lumigo's unique serverless-first approach, step-by-step integration for Node.js, Python, Java, and containers, visualizing traces, troubleshooting common issues, and production optimization strategies.

57%

Faster API Response

Code Changes

100%

Auto-Instrumentation

Lumigo APM dashboard showing distributed tracing overview with service map, request flows, latency metrics, and error rates across microservices

Lumigo APM dashboard providing end-to-end visibility into distributed request flows

1. Introduction: Why Distributed Tracing Matters

What is APM and Distributed Tracing?

Application Performance Monitoring (APM) is the practice of tracking application performance metrics, errors, and user experience in real-time. Distributed tracing is a specific APM technique that follows a request as it flows through multiple services, creating a "trace" that shows the complete journey.

Think of distributed tracing like a package tracking system: when you ship a package, you get a tracking number that shows every stop along the way - picked up, sorted, in transit, delivered. Distributed tracing does the same for requests: it shows you every service the request touched, how long it spent in each, and where it encountered errors or slowdowns.

The Challenge with Modern Cloud-Native Systems

Traditional monitoring tools were built for monolithic applications running on servers you control. Modern architectures present unique challenges:

Serverless Functions: Lambda functions are ephemeral, stateless, and scale to zero. Traditional agents don't work here. For AWS Lambda optimization, see our cost optimization case study.
Async Event Flows: SQS → Lambda → SNS → Lambda → DynamoDB. How do you trace a request that spans multiple async invocations?
Cold Starts: Lambda cold starts can add 1-5 seconds to response times. You need visibility into initialization time.
No Servers to Instrument: You can't install agents on Lambda execution environments.
High Cardinality: With thousands of functions and millions of invocations, you need intelligent sampling. AIOps platforms help manage this complexity.

Why Lumigo?

Lumigo was built specifically for serverless and cloud-native architectures. Here's what makes it unique:

Zero-Code Instrumentation: Automatically traces AWS Lambda, API Gateway, SQS, SNS, DynamoDB, S3, and more without code changes
Async Event Tracing: Automatically connects related invocations across async boundaries (SQS → Lambda → SNS)
Deep Context: Captures request/response payloads, environment variables, and full stack traces

Minimal Overhead:

Serverless-First: Designed for Lambda, containers, and microservices from the ground up

💡 Real-World Impact: In a production deployment, we used Lumigo to identify that 8 API endpoints were spending 40% of their time in DynamoDB queries. By optimizing those queries and adding caching, we reduced average API response time by 57% - from 420ms to 180ms.

Problems Lumigo Solves

Here are specific scenarios where Lumigo's distributed tracing shines:

Cold Start Debugging

Lambda cold starts are a major performance concern. Lumigo automatically identifies cold starts in traces and shows you initialization time, helping you optimize your functions and identify which ones need provisioned concurrency.

Async Event Flow Tracking

When a user uploads a file that triggers S3 → Lambda → SQS → Lambda → DynamoDB, traditional tools show you disconnected invocations. Lumigo automatically connects these into a single trace, showing the complete flow.

Latency Bottleneck Identification

A request takes 2 seconds, but where is the time spent? Lumigo's waterfall charts show you exactly: 200ms in API Gateway, 1500ms in Lambda (including 800ms in a DynamoDB query), 300ms in another Lambda. You can immediately see the bottleneck.

Error Root Cause Analysis

An error occurs, but which service caused it? Lumigo shows you the complete error chain across services, with full stack traces and payloads, making debugging dramatically faster.

2. Core Concepts: Understanding Distributed Tracing

Spans, Transactions, and Trace IDs

Distributed tracing is built on three fundamental concepts:

Spans

A span represents a single operation within a trace. Each span has:

Name: What operation was performed (e.g., "DynamoDB.Query", "Lambda.Invoke")
Start/End Time: When the operation began and completed
Duration: How long it took
Tags/Metadata: Additional context (HTTP status, error messages, resource names)
Parent Span ID: Which span this operation is part of

Traces

A trace is a collection of spans that represent a complete request flow. All spans in a trace share the same trace ID, which allows you to correlate operations across services.

Transactions

A transaction (in Lumigo's terminology) is a high-level operation that represents a user-facing action, like "Process Payment" or "Upload File". A transaction contains multiple traces and spans.

Example Trace Structure: Trace ID: abc123xyz ├─ Span: API Gateway (50ms) │ └─ Span: Lambda Function "processOrder" (420ms) │ ├─ Span: DynamoDB Query "getUser" (180ms) │ ├─ Span: Lambda Invoke "validatePayment" (150ms) │ │ └─ Span: External API Call "stripe.com" (120ms) │ └─ Span: DynamoDB Put "saveOrder" (90ms) └─ Span: Response (5ms) Total Duration: 475ms Bottleneck: DynamoDB Query (180ms = 38% of total time)

How Microservices/Serverless Need Specialized Tracing

Traditional APM tools assume:

Long-lived processes where you can install agents
Synchronous request/response patterns
Stable network connections between services
Centralized logging and metrics

Serverless and microservices break these assumptions:

Ephemeral Execution: Lambda functions exist for milliseconds, then disappear. You can't install persistent agents.
Async Boundaries: SQS, SNS, EventBridge create async boundaries where trace context must be propagated through message attributes.
High Concurrency: Thousands of concurrent invocations require efficient, low-overhead instrumentation.
Multi-Cloud: Requests might span AWS, GCP, and Azure services.

Lumigo solves this by:

Using Lambda Layers for zero-code instrumentation (no code changes needed)
Automatically extracting trace context from AWS service metadata (X-Ray, CloudWatch Logs)
Propagating trace context through SQS message attributes, SNS metadata, and EventBridge detail
Using intelligent sampling to handle high-volume workloads cost-effectively

Lumigo Architecture Overview

Lumigo's architecture consists of three main components:

Lumigo Architecture: ┌─────────────────────────────────────────────────┐ │ Your Application │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Lambda 1 │ │ Lambda 2 │ │ Lambda 3 │ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ │ │ │ │ │ └─────────────┴─────────────┘ │ │ │ │ │ ┌──────────▼──────────┐ │ │ │ Lumigo Tracer │ │ │ │ (Lambda Layer) │ │ │ └──────────┬──────────┘ │ └────────────────────┼───────────────────────────┘ │ ┌───────────▼───────────┐ │ Lumigo Collector │ │ (AWS Service) │ └───────────┬───────────┘ │ ┌───────────▼───────────┐ │ Lumigo Dashboard │ │ (Web UI) │ └───────────────────────┘ Data Flow: 1. Tracer captures spans from your functions 2. Collector aggregates and processes traces 3. Dashboard visualizes and allows querying

The Tracer

The Lumigo tracer is deployed as a Lambda Layer (or container sidecar) that automatically instruments your functions. It:

Captures function invocations, AWS SDK calls, and HTTP requests
Extracts trace context from incoming requests
Propagates trace context to downstream services
Sends trace data to the Lumigo collector

The Collector

Lumigo's collector runs as an AWS service that:

Receives trace data from tracers
Correlates spans into complete traces
Stores traces for querying and analysis
Applies sampling rules and retention policies

The Dashboard

The Lumigo web dashboard provides:

Transaction maps showing request flows
Waterfall charts for latency analysis
Error tracking and alerting
Service dependency graphs
Payload inspection and debugging tools

3. Setting Up Lumigo

Account Setup

Getting started with Lumigo is straightforward:

Sign up at lumigo.io (free trial available)
Connect your AWS account using CloudFormation or Terraform
Lumigo will create necessary IAM roles and Lambda layers automatically
Get your Lumigo token from the dashboard (you'll need this for configuration)

Installing the Lumigo CLI

The Lumigo CLI makes it easy to wrap and deploy functions. Install it:

# Using npm
npm install -g @lumigo/cli

# Using pip
pip install lumigo-cli

# Verify installation
lumigo --version

Connecting AWS Lambda Functions

There are three ways to instrument Lambda functions with Lumigo:

Method 1: Lambda Layer (Recommended)

Add the Lumigo layer to your Lambda function. This requires zero code changes:

# Using AWS CLI
aws lambda update-function-configuration \
 --function-name my-function \
 --layers arn:aws:lambda:us-east-1:114300393969:layer:lumigo-node:XXX \
 --environment Variables='{
 "LUMIGO_TRACER_TOKEN": "your-token-here",
 "LUMIGO_DEBUG": "false"
 }'

Method 2: Serverless Framework

If you're using the Serverless Framework, add Lumigo as a plugin:

# serverless.yml
plugins:
 - serverless-lumigo

provider:
 environment:
 LUMIGO_TRACER_TOKEN: ${env:LUMIGO_TRACER_TOKEN}
 layers:
 - arn:aws:lambda:${self:provider.region}:114300393969:layer:lumigo-node:XXX

functions:
 myFunction:
 handler: src/handler.myFunction
 lumigo:
 token: ${env:LUMIGO_TRACER_TOKEN}

Method 3: Terraform

resource "aws_lambda_function" "my_function" {
 function_name = "my-function"
 handler = "index.handler"
 runtime = "nodejs18.x"
 
 layers = [
 "arn:aws:lambda:us-east-1:114300393969:layer:lumigo-node:XXX"
 ]
 
 environment {
 variables = {
 LUMIGO_TRACER_TOKEN = var.lumigo_token
 LUMIGO_DEBUG = "false"
 }
 }
}

4. Configuration Examples by Runtime

Node.js Setup

For Node.js Lambda functions, Lumigo automatically instruments:

AWS SDK v2 and v3 calls
HTTP/HTTPS requests
Database connections (MongoDB, PostgreSQL, MySQL)
Async/await and Promise chains

Basic Setup (Zero Code Changes)

// No code changes needed! Just add the layer and environment variable.

// Your existing handler works as-is:
exports.handler = async (event, context) => {
 const dynamodb = new AWS.DynamoDB.DocumentClient();
 const result = await dynamodb.get({
 TableName: 'Users',
 Key: { userId: event.userId }
 }).promise();
 
 return result.Item;
};

Manual Instrumentation (Advanced)

For custom spans or additional context, you can manually instrument:

const lumigo = require('@lumigo/tracer')({
 token: process.env.LUMIGO_TRACER_TOKEN
});

exports.handler = lumigo.trace(async (event, context) => {
 // Create a custom span for a specific operation
 const span = lumigo.createSpan('custom-operation', {
 'custom.tag': 'value',
 'operation.type': 'data-processing'
 });
 
 try {
 // Your business logic
 const result = await processData(event);
 
 span.setTag('result.size', result.length);
 return result;
 } catch (error) {
 span.setTag('error', true);
 span.setTag('error.message', error.message);
 throw error;
 } finally {
 span.finish();
 }
});

Python Setup

Python functions are automatically instrumented for:

boto3 (AWS SDK)
requests, urllib3 (HTTP libraries)
SQLAlchemy, psycopg2 (database libraries)

Basic Setup

# No code changes needed with Lambda Layer

# Your existing handler:
import boto3
import json

def handler(event, context):
 dynamodb = boto3.resource('dynamodb')
 table = dynamodb.Table('Users')
 
 response = table.get_item(
 Key={'userId': event['userId']}
 )
 
 return {
 'statusCode': 200,
 'body': json.dumps(response['Item'])
 }

Manual Instrumentation

from lumigo_tracer import lumigo_tracer

@lumigo_tracer(token=os.environ.get('LUMIGO_TRACER_TOKEN'))
def handler(event, context):
 # Create custom span
 with lumigo_tracer.span('custom-operation') as span:
 span.set_tag('operation.type', 'data-processing')
 
 try:
 result = process_data(event)
 span.set_tag('result.size', len(result))
 return result
 except Exception as e:
 span.set_tag('error', True)
 span.set_tag('error.message', str(e))
 raise

Java Setup

For Java Lambda functions, add the Lumigo dependency:

// pom.xml

 
 io.lumigo
 lumigo-java-tracer
 1.0.0

// Handler.java
import io.lumigo.Lumigo;

public class Handler implements RequestHandler<Map, String> {
 static {
 Lumigo.init(System.getenv("LUMIGO_TRACER_TOKEN"));
 }
 
 @Override
 public String handleRequest(Map event, Context context) {
 // Your code - automatically instrumented
 DynamoDB dynamoDB = new DynamoDB(AmazonDynamoDBClientBuilder.defaultClient());
 Table table = dynamoDB.getTable("Users");
 
 Item item = table.getItem("userId", event.get("userId"));
 return item.toJSON();
 }
}

Container Setup (Docker/Kubernetes)

For containerized applications, you can use Lumigo's auto-instrumentation or sidecar pattern:

Auto-Instrumentation (Node.js Container)

# Dockerfile
FROM node:18-alpine

# Install Lumigo tracer
RUN npm install -g @lumigo/tracer

# Set environment variables
ENV LUMIGO_TRACER_TOKEN=your-token-here
ENV LUMIGO_DEBUG=false

# Wrap your application
CMD ["lumigo", "node", "app.js"]

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
 name: my-app
spec:
 replicas: 3
 template:
 spec:
 containers:
 - name: app
 image: my-app:latest
 env:
 - name: LUMIGO_TRACER_TOKEN
 valueFrom:
 secretKeyRef:
 name: lumigo-secrets
 key: token
 - name: LUMIGO_DEBUG
 value: "false"

5. Distributed Tracing Setup

How Lumigo Injects Trace Context

Lumigo automatically propagates trace context across service boundaries. Here's how it works:

HTTP Requests

When your Lambda makes an HTTP request, Lumigo automatically adds trace headers:

// Automatically added headers:
X-Trace-Id: abc123xyz
X-Span-Id: def456uvw
X-Parent-Span-Id: ghi789rst

Downstream services that are also instrumented with Lumigo will pick up these headers and continue the trace.

AWS Service Integration

Lumigo automatically extracts trace context from:

API Gateway: From request headers and X-Ray integration
SQS: From message attributes
SNS: From message metadata
EventBridge: From event detail
Step Functions: From execution context

Capturing Asynchronous Flows

One of Lumigo's key strengths is automatically connecting async invocations. Here's an example:

Async Flow Example: User Request │ ▼ API Gateway │ ▼ Lambda: processUpload (sends to SQS) │ ├─► SQS Queue │ │ │ ▼ │ Lambda: processImage (sends to SNS) │ │ │ ├─► SNS Topic │ │ │ │ │ ├─► Lambda: sendNotification │ │ └─► Lambda: updateDatabase │ │ │ └─► S3: Save processed image Lumigo automatically connects all these into a single trace!

SQS Integration

When a Lambda sends a message to SQS, Lumigo automatically adds trace context to message attributes:

// Your code (no changes needed)
const AWS = require('aws-sdk');
const sqs = new AWS.SQS();

await sqs.sendMessage({
 QueueUrl: 'https://sqs.us-east-1.amazonaws.com/123456789/my-queue',
 MessageBody: JSON.stringify(data)
}).promise();

// Lumigo automatically adds:
// MessageAttributes: {
// 'lumigo-trace-id': { StringValue: 'abc123', DataType: 'String' },
// 'lumigo-span-id': { StringValue: 'def456', DataType: 'String' }
// }

When the consumer Lambda processes the message, Lumigo extracts the trace context and continues the trace.

Automatic vs Manual Instrumentation

Lumigo provides both automatic and manual instrumentation:

Automatic Instrumentation (Default)

With just the Lambda layer and environment variable, Lumigo automatically traces:

Function invocations (entry/exit, duration, errors)
AWS SDK calls (DynamoDB, S3, SQS, SNS, etc.)
HTTP requests (fetch, axios, requests library)
Database queries (if using supported libraries)

Manual Instrumentation (When Needed)

Use manual instrumentation for:

Custom business logic spans
Third-party APIs not automatically instrumented
Adding custom tags/metadata
Marking specific operations for alerting

// Node.js example
const lumigo = require('@lumigo/tracer');

exports.handler = lumigo.trace(async (event) => {
 // Automatic: AWS SDK calls are traced
 const dynamodb = new AWS.DynamoDB.DocumentClient();
 await dynamodb.get({...}).promise();
 
 // Manual: Custom span for business logic
 const span = lumigo.createSpan('process-payment', {
 'payment.amount': event.amount,
 'payment.currency': 'USD'
 });
 
 try {
 const result = await processPayment(event);
 span.setTag('payment.status', 'success');
 return result;
 } catch (error) {
 span.setTag('payment.status', 'failed');
 span.setTag('error', error.message);
 throw error;
 } finally {
 span.finish();
 }
});

Environment Variables and Configuration

Key environment variables for Lumigo:

# Required
LUMIGO_TRACER_TOKEN=your-token-here

# Optional - Debugging
LUMIGO_DEBUG=true # Enable debug logging
LUMIGO_LOG_LEVEL=INFO # DEBUG, INFO, WARN, ERROR

# Optional - Sampling
LUMIGO_SAMPLE_RATE=1.0 # 1.0 = 100%, 0.1 = 10%

# Optional - Data Redaction
LUMIGO_REDACT_ALL=false # Redact all payloads
LUMIGO_REDACT_REGEX=.*password.* # Redact fields matching regex

# Optional - Performance
LUMIGO_MAX_ENTRY_SIZE=10000 # Max payload size (bytes)
LUMIGO_SKIP_HTTP_ENDPOINTS=health # Skip tracing for specific endpoints

IAM Permissions

Lumigo requires minimal IAM permissions. The CloudFormation template creates a role with:

{
 "Version": "2012-10-17",
 "Statement": [
 {
 "Effect": "Allow",
 "Action": [
 "logs:CreateLogGroup",
 "logs:CreateLogStream",
 "logs:PutLogEvents"
 ],
 "Resource": "arn:aws:logs:*:*:*"
 },
 {
 "Effect": "Allow",
 "Action": [
 "xray:PutTraceSegments",
 "xray:PutTelemetryRecords"
 ],
 "Resource": "*"
 }
 ]
}

💡 Security Best Practice: Use AWS Secrets Manager or Parameter Store for the LUMIGO_TRACER_TOKEN instead of hardcoding it in environment variables. This allows rotation and better security.

6. Visualizing Traces

Understanding Lumigo's Transaction Map

Lumigo's transaction map provides a visual representation of your request flow. Here's how to read it:

Lumigo transaction map visualization showing request flow from API Gateway through Lambda functions, DynamoDB queries, and external API calls with latency breakdown

Lumigo transaction map showing complete request flow with service interactions and timing

Transaction Map Layout: ┌─────────────────────────────────────────────────────┐ │ Transaction: Process Order (Total: 475ms) │ ├─────────────────────────────────────────────────────┤ │ │ │ API Gateway [50ms] │ │ │ │ │ ▼ │ │ Lambda: processOrder [420ms] │ │ ├─ DynamoDB: getUser [180ms] ⚠️ │ │ ├─ Lambda: validatePayment [150ms] │ │ │ └─ External: stripe [120ms] │ │ └─ DynamoDB: saveOrder [90ms] │ │ │ │ Legend: │ │ ⚠️ = Slow (>150ms) │ │ ❌ = Error │ │ 🔵 = Cold Start │ └─────────────────────────────────────────────────────┘

Key Features

Service Icons: Visual representation of each service (Lambda, DynamoDB, API Gateway, etc.)
Duration Bars: Horizontal bars showing time spent in each service
Color Coding: Green (fast), Yellow (slow), Red (error)
Cold Start Indicator: Special marker showing Lambda cold starts
Click to Expand: Click any service to see detailed span information

Latency Waterfall Charts

Waterfall charts show the sequential timing of operations, making it easy to identify bottlenecks:

Waterfall chart showing sequential operation timing and identifying performance bottlenecks

Waterfall Chart Example: Time (ms) 0 100 200 300 400 500 │ │ │ │ │ │ API Gateway ████ │ Lambda Start ████████████████████████████████ │ │ DynamoDB │ ████████████████ (180ms - bottleneck!) │ │ Lambda Call │ │ ████████████ │ │ │ Stripe API │ │ │ ████████ │ │ │ │ DynamoDB │ │ │ │ ████████ │ │ │ │ │ Response │ │ │ │ │ █ Total: 475ms Bottleneck: DynamoDB getUser (180ms = 38%)

Payload Inspection

Lumigo captures request and response payloads automatically, which is invaluable for debugging:

Benefits

See Exact Inputs: What data was sent to each service
See Exact Outputs: What each service returned
Debug Errors: Full error messages and stack traces
Performance Analysis: Identify large payloads causing slowdowns

Data Redaction

For security and compliance, redact sensitive data:

# Redact specific fields
LUMIGO_REDACT_REGEX=.*password.*|.*token.*|.*secret.*

# Redact all payloads (only keep metadata)
LUMIGO_REDACT_ALL=true

# Programmatic redaction (Node.js)
const lumigo = require('@lumigo/tracer');
lumigo.redact(['password', 'ssn', 'creditCard']);

Error and Timeout Detection

Lumigo automatically detects and highlights:

Errors: Exceptions, HTTP error status codes, AWS service errors
Timeouts: Lambda timeouts, API Gateway timeouts, service timeouts
Cold Starts: Lambda initialization time
Throttling: DynamoDB throttling, API rate limiting

Each error includes:

Full stack trace
Error message and type
Request payload that caused the error
Service context (Lambda name, region, version)

Correlating Traces with Logs and Metrics

Lumigo integrates with CloudWatch Logs and X-Ray for comprehensive observability:

CloudWatch Logs: Click "View Logs" in a trace to see related log entries
X-Ray Integration: Traces appear in AWS X-Ray console
Metrics: Aggregate trace data into metrics (p50, p95, p99 latencies)

7. Real-World Example: E-Commerce Order Processing

Let's walk through a complete example: an e-commerce order processing system. This will show how Lumigo captures a real-world distributed flow.

Architecture

Order Processing Flow: User → API Gateway → Lambda: createOrder │ ├─► DynamoDB: Orders (write) ├─► SQS: order-queue (send message) └─► Response: Order ID SQS → Lambda: processOrder │ ├─► Lambda: validatePayment (invoke) │ └─► External API: Stripe ├─► DynamoDB: Inventory (check/update) ├─► SNS: order-processed (publish) └─► S3: order-receipt (upload) SNS → Lambda: sendConfirmationEmail └─► SES: Send email

Implementation

Here's the code for the main order processing Lambda:

// createOrder.js
const AWS = require('aws-sdk');
const dynamodb = new AWS.DynamoDB.DocumentClient();
const sqs = new AWS.SQS();

exports.handler = async (event) => {
 const orderId = generateOrderId();
 const userId = event.requestContext.authorizer.userId;
 
 // Create order in DynamoDB
 await dynamodb.put({
 TableName: 'Orders',
 Item: {
 orderId,
 userId,
 items: event.body.items,
 status: 'pending',
 createdAt: new Date().toISOString()
 }
 }).promise();
 
 // Send to processing queue
 await sqs.sendMessage({
 QueueUrl: process.env.ORDER_QUEUE_URL,
 MessageBody: JSON.stringify({ orderId, userId })
 }).promise();
 
 return {
 statusCode: 200,
 body: JSON.stringify({ orderId, status: 'created' })
 };
};

// processOrder.js
const AWS = require('aws-sdk');
const lambda = new AWS.Lambda();
const dynamodb = new AWS.DynamoDB.DocumentClient();
const sns = new AWS.SNS();

exports.handler = async (event) => {
 const { orderId, userId } = JSON.parse(event.Records[0].body);
 
 // Validate payment
 const paymentResult = await lambda.invoke({
 FunctionName: 'validatePayment',
 Payload: JSON.stringify({ orderId, userId })
 }).promise();
 
 if (JSON.parse(paymentResult.Payload).status!== 'success') {
 throw new Error('Payment validation failed');
 }
 
 // Update inventory
 for (const item of order.items) {
 await dynamodb.update({
 TableName: 'Inventory',
 Key: { productId: item.productId },
 UpdateExpression: 'SET quantity = quantity -:qty',
 ExpressionAttributeValues: { ':qty': item.quantity }
 }).promise();
 }
 
 // Publish to SNS
 await sns.publish({
 TopicArn: process.env.ORDER_PROCESSED_TOPIC,
 Message: JSON.stringify({ orderId, status: 'processed' })
 }).promise();
 
 return { statusCode: 200 };
};

Sample Trace Output

Here's what Lumigo captures for this flow (simplified JSON):

{
 "traceId": "abc123xyz",
 "transactionName": "Process Order",
 "duration": 1250,
 "spans": [
 {
 "name": "API Gateway",
 "service": "apigateway",
 "duration": 45,
 "tags": {
 "http.method": "POST",
 "http.path": "/orders",
 "http.status_code": 200
 }
 },
 {
 "name": "Lambda: createOrder",
 "service": "lambda",
 "duration": 320,
 "coldStart": false,
 "spans": [
 {
 "name": "DynamoDB.Put",
 "service": "dynamodb",
 "duration": 85,
 "tags": {
 "table": "Orders",
 "operation": "PutItem"
 }
 },
 {
 "name": "SQS.SendMessage",
 "service": "sqs",
 "duration": 45,
 "tags": {
 "queue": "order-queue"
 }
 }
 ]
 },
 {
 "name": "Lambda: processOrder",
 "service": "lambda",
 "duration": 885,
 "coldStart": true,
 "coldStartDuration": 1200,
 "spans": [
 {
 "name": "Lambda.Invoke: validatePayment",
 "service": "lambda",
 "duration": 450,
 "spans": [
 {
 "name": "HTTP Request",
 "service": "external",
 "duration": 380,
 "tags": {
 "http.url": "https://api.stripe.com/v1/charges",
 "http.status_code": 200
 }
 }
 ]
 },
 {
 "name": "DynamoDB.Update",
 "service": "dynamodb",
 "duration": 280,
 "tags": {
 "table": "Inventory"
 }
 },
 {
 "name": "SNS.Publish",
 "service": "sns",
 "duration": 155,
 "tags": {
 "topic": "order-processed"
 }
 }
 ]
 }
 ],
 "errors": [],
 "metadata": {
 "region": "us-east-1",
 "accountId": "123456789012"
 }
}

What This Trace Reveals

From this trace, we can immediately see:

Cold Start Impact: processOrder had a 1200ms cold start, adding significant latency
External API Bottleneck: Stripe API call took 380ms (43% of processOrder time)
DynamoDB Performance: Inventory updates took 280ms (could be optimized with batch writes)
Total Latency: 1250ms end-to-end (acceptable but could be improved)

8. Best Practices

Minimize Noise and Control Sampling

High-traffic applications can generate millions of traces. Use sampling to control costs and focus on what matters:

# Sample 10% of traces (reduce costs by 90%)
LUMIGO_SAMPLE_RATE=0.1

# Sample 100% of errors, 10% of successful requests
# (Requires custom logic in your code)
const lumigo = require('@lumigo/tracer');

exports.handler = lumigo.trace(async (event) => {
 const shouldTrace = event.isError || Math.random() < 0.1;
 
 if (shouldTrace) {
 lumigo.setTraceSampled(true);
 }
 
 // Your code...
});

Secure Sensitive Data

Always redact sensitive information:

# Environment variables
LUMIGO_REDACT_REGEX=.*password.*|.*token.*|.*secret.*|.*api[_-]?key.*
LUMIGO_REDACT_ALL=false # Only redact matching fields

# Programmatic (Node.js)
const lumigo = require('@lumigo/tracer');
lumigo.redact([
 'password',
 'creditCard',
 'ssn',
 'apiKey',
 'authorization'
]);

Integrate with CI/CD

Add Lumigo to your deployment pipeline:

# GitHub Actions example
name: Deploy Lambda with Lumigo

on:
 push:
 branches: [main]

jobs:
 deploy:
 runs-on: ubuntu-latest
 steps:
 - uses: actions/checkout@v3
 
 - name: Configure AWS credentials
 uses: aws-actions/configure-aws-credentials@v2
 with:
 aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
 aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
 aws-region: us-east-1
 
 - name: Deploy with Lumigo
 run: |
 aws lambda update-function-configuration \
 --function-name my-function \
 --layers arn:aws:lambda:us-east-1:114300393969:layer:lumigo-node:XXX \
 --environment Variables="{
 LUMIGO_TRACER_TOKEN=${{ secrets.LUMIGO_TOKEN }},
 LUMIGO_DEBUG=false
 }"

Use Lumigo Alerts for SLOs

Set up alerts based on trace data:

Error Rate: Alert if error rate exceeds 1%
Latency: Alert if p95 latency exceeds 500ms
Cold Starts: Alert if cold start rate exceeds 10%
Timeout Rate: Alert if timeout rate exceeds 0.5%

Monitor Latency Hotspots

Use Lumigo's service map to identify slow services:

Sort services by average latency
Identify services with high p95/p99 latencies
Compare latency across different time periods
Set up alerts for latency degradation

9. Troubleshooting & Common Pitfalls

Missing Spans

Problem: Some operations aren't showing up in traces.

Common Causes:

Function not instrumented (missing layer or environment variable)
Operation not automatically instrumented (custom code, unsupported library)
Sampling rate too low (operation was sampled out)
Trace context not propagated (async boundary, external service)

Solutions:

Verify Lambda layer is attached: aws lambda get-function --function-name my-function
Check environment variables: aws lambda get-function-configuration --function-name my-function
Increase sample rate temporarily: LUMIGO_SAMPLE_RATE=1.0
Add manual instrumentation for unsupported operations

Cold Start Confusion

Problem: Traces show high latency, but it's unclear if it's cold start or actual execution time.

Solution: Lumigo automatically marks cold starts. Look for the cold start indicator in the trace. If cold starts are frequent, consider:

Provisioned concurrency for critical functions
Optimizing initialization code (move heavy imports inside handler)
Using Lambda SnapStart (Java functions)

Improper Environment Variables

Problem: Traces not appearing, or errors in logs.

Common Issues:

Missing LUMIGO_TRACER_TOKEN
Invalid token (expired, wrong account)
Token stored in wrong format (extra quotes, whitespace)

Verification:

# Check environment variables
aws lambda get-function-configuration \
 --function-name my-function \
 --query 'Environment.Variables'

# Test token validity
curl -H "Authorization: Bearer $LUMIGO_TRACER_TOKEN" \
 https://api.lumigo.io/v1/traces

Network/Permission Issues

Problem: Tracer can't send data to Lumigo collector.

Check:

VPC configuration (functions in VPC need NAT Gateway for internet access)
Security groups (allow outbound HTTPS to Lumigo endpoints)
IAM permissions (CloudWatch Logs, X-Ray permissions)

10. Conclusion & Next Steps

Distributed tracing with Lumigo transforms how you understand and optimize your serverless and microservices architectures. By automatically instrumenting your functions and connecting async flows, Lumigo gives you the visibility you need to:

Identify performance bottlenecks across your entire stack
Debug errors quickly with full context
Optimize cold starts and reduce latency
Understand service dependencies and interactions
Meet SLOs with confidence

Recommended Next Steps

Instrument All Functions: Add Lumigo to all your Lambda functions, not just the critical ones
Set Up Dashboards: Create custom dashboards for your key metrics (error rate, latency, throughput)
Configure Alerts: Set up alerts for error rates, latency spikes, and cold start increases
Optimize Hotspots: Use trace data to identify and fix the top 5 performance bottlenecks
Integrate with Logging: Connect Lumigo traces with CloudWatch Logs for complete debugging context
Share with Team: Train your team on reading traces and using Lumigo for debugging

Real-World Impact: In a production deployment, implementing Lumigo distributed tracing helped us identify 8 slow API endpoints, optimize DynamoDB queries, reduce cold starts by 60%, and achieve a 57% reduction in average API response time - from 420ms to 180ms. The investment in observability paid for itself within the first month through reduced debugging time and improved user experience.

Start with instrumenting your most critical functions, then expand coverage as you see the value. The zero-code instrumentation makes it easy to get started, and the deep insights will quickly become indispensable for your operations.