Sunday, July 20, 2025

Reusing the same Next.js Docker image with runtime CDN assetPrefix

Recently, while investigating a production issue, I needed to deploy the production Docker image locally for debugging purposes. However, I encountered a significant architectural constraint that highlighted a fundamental deployment challenge requiring both architectural and DevOps expertise to resolve.

The Challenge: CDN Configuration

Next.js provides assetPrefix configuration to specify CDN or custom domains for static asset delivery, enabling faster content delivery through geographically distributed networks. However, this configuration presents a critical architectural constraint: it's a build-time setting that gets permanently embedded into static pages and CSS assets.

This limitation creates substantial operational challenges when reusing a Docker image:

  • Multiple Environments: Different environments like QA, stage, and production use different CDN domains.
  • Regional CDNs: Production environments could use different regional CDN domains.

The traditional approach would require building separate Docker images for each environment, violating the fundamental DevOps principle of "build once, deploy anywhere."

Architectural Solution: Runtime Asset Prefix Injection

Through careful analysis of Next.js's asset handling mechanisms and strategic use of text replacement techniques, I developed a solution that maintains build-time optimization while enabling runtime CDN configuration.

Configuration Architecture

The solution begins with environment-aware configuration that handles both basePath and assetPrefix dynamically:

Docker Build Strategy

During the Docker image creation process, we set placeholder values that will be replaced at runtime, along with a custom entrypoint script:

Runtime Replacement Logic

The entrypoint script performs intelligent text replacement, substituting placeholder values with environment-specific configurations at container startup:

Addressing Image Security Constraints

During implementation, I encountered an additional architectural challenge: Next.js image optimization returns cryptic 404 errors when serving images from CDN domains. Investigation into the Next.js source code revealed this is a security feature that only allows pre-approved domains for image serving.

The error message "url parameter is not allowed" provides insufficient context for troubleshooting, but the root cause is Next.js's domain whitelist mechanism. This requires configuring images.remotePatterns to explicitly allow CDN domains.

Advanced Runtime Configuration

The most sophisticated aspect of this solution involves dynamically updating the remotePatterns configuration at runtime. Since we're performing text replacements on configuration files, I leveraged Node.js's command-line execution capabilities to add intelligent remotePatterns generation for CDN domain whitelisting.

This approach ensures that:

  • Security policies remain enforced through domain whitelisting
  • Runtime flexibility is maintained for multi-environment deployments
  • Performance optimization continues through proper CDN utilization
  • Operational efficiency is achieved through single-image deployment

Key Architectural Benefits

This solution delivers several critical advantages:

  • Single Image Deployment: One Docker image serves all environments, reducing build complexity and storage requirements
  • Runtime Flexibility: CDN configurations adapt to deployment context without rebuild cycles
  • Performance Preservation: Static page optimization remains intact while enabling dynamic asset serving
  • Security Compliance: Domain whitelisting ensures controlled image processing
  • Operational Simplicity: Environment-specific configurations are managed through standard environment variables

Implementation Considerations

When implementing this approach, consider these architectural factors:

  • Text replacement scope: Ensure replacements target only intended configuration values
  • Environment variable validation: Implement proper fallbacks for missing or invalid CDN configurations
  • Security boundaries: Maintain strict domain whitelisting for image processing
  • Performance monitoring: Verify that runtime replacements don't impact application startup time

Conclusion

With some architectural creativity, we can resolve platform limitations while maintaining operational best practices. By combining Next.js's build-time optimizations with runtime configuration flexibility, we achieve the ideal balance of performance and deployment efficiency.

The approach enables teams to leverage CDN benefits across multiple environments while adhering to containerization principles, ultimately delivering both technical excellence and operational simplicity. For organizations deploying Next.js applications across diverse environments, this pattern provides a production-ready solution that scales with architectural complexity while maintaining deployment consistency.

Resolving Next.js Standalone Build Docker Reachability Failure on AWS ECS Fargate

While deploying a Next.js standalone build in a Docker container on AWS ECS Fargate, I encountered a subtle but critical issue that highlights the importance of understanding platform-specific runtime behaviors.

The Problem: Silent Health Check Failures

During the deployment of our Next.js application to AWS ECS Fargate, the service consistently failed health checks despite the application appearing to function correctly. The container would start successfully, but the Target Group couldn't establish connectivity, resulting in deployment failures.

Initial Investigation

Examining the container logs revealed the root cause:

- Local: http://ip-10-0-5-61.us-west-2.compute.internal:3000 - Network: http://10.0.5.61:3000

The Next.js server was binding to the container's internal hostname rather than accepting connections from external networks. This prevented the ALB health checks from reaching the application endpoint.

Root Cause Analysis

Tracing through the Next.js standalone build revealed the hostname configuration logic in server.js:

const currentPort = parseInt(process.env.PORT, 10) || 3000 const hostname = process.env.HOSTNAME || '0.0.0.0'

The application defaults to 0.0.0.0 (accepting all connections) when no HOSTNAME environment variable is present. My initial approach was to explicitly set HOSTNAME=0.0.0.0 in the ECS task definition.

However, this approach failed due to a critical AWS Fargate behavior: Fargate automatically sets the HOSTNAME environment variable at runtime, overriding any pre-configured values.

Evaluating Solution Approaches

Approach 1: Runtime Environment Override

CMD "HOSTNAME=0.0.0.0 node server.js"

While functional, this approach embeds environment variable assignments within the CMD instruction, reducing container portability and maintainability.

Approach 2: Build-Time Source Modification

Using sed during the Docker build process to directly modify the hostname assignment:

RUN sed -i "s/const hostname = process.env.HOSTNAME || '0.0.0.0'/const hostname = '0.0.0.0'/g" server.js

Approach 3: Systematic Source Patching (Recommended as it provides consistent behavior locally and in the cloud)

The most architecturally sound solution leverages patch-package to modify the Next.js build process itself. The hostname assignment originates from node_modules/next/dist/build/utils.js in the writeFile function that generates server.js.

By creating a systematic patch that modifies the server generation logic, we achieve:

  • Consistency across environments (local development and production)
  • Maintainability through version-controlled patches
  • Architectural integrity by addressing the root cause rather than symptoms

Implementation Details

The patch modifies the server template generation in Next.js build to use more relevant environment variable names. This ensures consistent behavior across all deployment targets while maintaining clean separation of concerns.

Key Architectural Insights

This experience reinforces several important principles for cloud-native application deployment:

  • Platform Behavior Awareness: Cloud platforms often inject runtime configurations that can override application-level settings
  • Health Check Design: Container applications must be designed with load balancer connectivity patterns in mind
  • Source-Level Solutions: Sometimes the most maintainable solution requires modifying the build process rather than working around runtime constraints

Conclusion

While AWS Fargate's automatic hostname assignment serves legitimate infrastructure purposes, it can create unexpected challenges for containerized applications. By understanding the platform's behavior and implementing systematic source modifications, we can create robust deployment solutions that maintain architectural integrity while meeting operational requirements.

Saturday, July 19, 2025

Access logs in Next.js production build

I frequently encounter architectural decisions that seem counterintuitive from an operational perspective. Recently, while containerizing a Next.js application, I discovered one such puzzling design choice that required a creative engineering solution.

The Problem: Silent Production Builds

During the Docker image creation process for our Next.js application, I encountered an unexpected operational blind spot: Next.js production builds generate zero access logs. This absence of fundamental observability data immediately raised concerns about our ability to monitor application behavior in production environments.

The conventional wisdom suggests deploying an nginx reverse proxy to capture access logs. However, as an architect focused on operational efficiency, introducing an additional process layer solely for logging felt architecturally unsound, particularly within containerized environments where process minimalism is a core principle.

Exploring Conventional Solutions

My initial investigation led me to application-level logging libraries such as winston and pino. While these tools excel at application logging, they operate within the application boundary and don't provide the standardized access log format that operations teams expect from web applications.

Root Cause Analysis

After extensive research into similar reported issues, I discovered the underlying cause: Vercel has intentionally omitted access logging from Next.js production builds. This architectural decision, while perhaps suitable for Vercel's managed platform, creates operational challenges for self-hosted deployments.

Deep Dive

Taking a source-code-first approach, I downloaded the Next.js repository and traced the request handling flow to its core: the async function requestListener(req, res) function. By strategically placing console.log statements within the node_modules Next.js installation, I successfully exposed the access log data we needed.

However, this manual modification approach presented obvious maintainability challenges for automated deployment pipelines.

Production-Ready Implementation

While researching sustainable patching methodologies, I discovered an excellent resource by TomUps (https://www.tomups.com/posts/log-nextjs-request-response-as-json/) that introduced patch-package, a tool designed precisely for this type of systematic source modification.

Their approach provided the foundational technique, though it captured extensive request/response metadata including headers and body content. For our operational requirements, I needed a more focused solution that provided essential access log fields: timestamp, URL, and HTTP status code.

Architectural Solution

The final implementation leverages patch-package combined with pino-http-print to deliver clean, standardized access logs that integrate seamlessly with our existing observability stack. This approach:

  • container efficiency by avoiding additional processes
  • Provides operational visibility through standard access log formats
  • Ensures deployment consistency via automated patching during image builds
  • Preserves maintainability through version-controlled patch files

Key Takeaway

This experience reinforces a fundamental architectural principle: when platform decisions conflict with operational requirements, creative engineering solutions can bridge the gap while maintaining system integrity. The key is balancing pragmatic problem-solving with long-term maintainability, exactly what patch-package enables in this scenario.

Steps

You can follow the steps in TomUps post and use the following patch.

Thursday, May 22, 2025

Segmentation fault when running pip inside docker buildx on host=arm platform=amd64

Following Docker's transition to a paid licensing model for Docker Desktop, I migrated our development environment to Rancher Desktop, which offers integrated support for both Docker and Kubernetes functionalities. During a recent initiative to convert legacy Docker images to multi-platform images, I encountered an unusual issue.

During the buildx build, the `pip install` command was failing with a segmentation fault error, despite the same installation steps executing flawlessly in an interactive container environment.

Initial Troubleshooting Efforts

Further investigation revealed:
1. The ARM64 architecture builds completed successfully, while AMD64 builds consistently failed
2. The failures were isolated to two specific Python packages: pycrypto and uWSGI
3. Standard remediation approaches suggested on various posts were ineffective, including:
   - Modifying Python versions
   - Updating pip
   - Testing alternative package versions
   - Installing supplementary system libraries
   - Switching between base images (python:3.8-slim (Debian bookworm), Debian bullseye, Ubuntu)
   - Using root installation instead of virtual environments

Root Cause Analysis

After extensive troubleshooting, I identified that the failures stemmed from cross-architecture compilation requirements. Both uWSGI and pycrypto lack pre-built wheels, necessitating compilation during installation. This compilation process was failing specifically when building AMD64 binaries on ARM64 host architecture.

Shifting my focus to the underlying virtualization platform, I searched for issues related to QEMU emulation and segmentation faults on AMD64 platforms. A comment on a relevant post suggested trying the build process on Docker Desktop. 

Resolution

After reinstalling Docker Desktop, I was able to successfully build the image.

To verify the findings, I retested all previous versions of the Dockerfile, and they worked as expected.

Conclusion

The root cause of this issue lies in the QEMU virtualization framework's emulation of AMD64 architectures on ARM-based systems. 

While ARM support has matured considerably across the board, such cases represent areas requiring further development attention from the community. 

NOTE

To facilitate easy reuse and streamline the build process, I recommend creating a builder

docker buildx create \
--name multi-platform-docker \
--driver docker-container \
--platform linux/amd64,linux/arm64


To utilize this builder, run the following command:

docker buildx --builder multi-platform-docker build .

Monday, May 19, 2025

Multi-platform multi-stage build docker images with platform specific files

 While working on multi platform docker images for a FastAPI application with MySQL client libraries, ran into a scenario where the MySQL library files are stored in platform specific paths.

/usr/lib/x86_64-linux-gnu

/usr/lib/aarch64-linux-gnu

As part of multi-stage build, I wanted to do something like

FROM python:${PYTHON_VERSION}-slim

# if x64

COPY --from=builder /usr/lib/x86_64-linux-gnu/libmysql* /usr/lib/x86_64-linux-gnu

# if arm

COPY --from=builder /usr/lib/aarch64-linux-gnu/libmysql* /usr/lib/aarch64-linux-gnu

But there is no conditional platform support at the COPY command level. It is only at the FROM level.

The simple solution would be to have duplicate final image setup.

FROM --platform=amd64 python:${PYTHON_VERSION}-slim

COPY --from=builder /usr/lib/x86_64-linux-gnu/libmysql* /usr/lib/x86_64-linux-gnu
...

FROM --platform=arm64 python:${PYTHON_VERSION}-slim

COPY --from=builder /usr/lib/aarch64-linux-gnu/libmysql* /usr/lib/aarch64-linux-gnu
...

But this would mean duplicating rest of the steps too.

The solution is to create multiple platform specific intermediate images and use them to create the final image.

FROM --platform=amd64 python:${PYTHON_VERSION}-slim as final-intermediate-amd64

COPY --from=builder /usr/lib/x86_64-linux-gnu/libmysql* /usr/lib/x86_64-linux-gnu
COPY --from=builder /usr/lib/x86_64-linux-gnu/libmaria* /usr/lib/x86_64-linux-gnu
COPY --from=builder /usr/lib/x86_64-linux-gnu/*xslt* /usr/lib/x86_64-linux-gnu
COPY --from=builder /usr/lib/x86_64-linux-gnu/*xml* /usr/lib/x86_64-linux-gnu

FROM --platform=arm64 python:${PYTHON_VERSION}-slim as final-intermediate-arm64

COPY --from=builder /usr/lib/aarch64-linux-gnu/libmysql* /usr/lib/aarch64-linux-gnu
COPY --from=builder /usr/lib/aarch64-linux-gnu/libmaria* /usr/lib/aarch64-linux-gnu
COPY --from=builder /usr/lib/aarch64-linux-gnu/*xslt* /usr/lib/aarch64-linux-gnu
COPY --from=builder /usr/lib/aarch64-linux-gnu/*xml* /usr/lib/aarch64-linux-gnu

FROM final-intermediate-${TARGETARCH}

COPY . /code/
COPY requirements.txt /code/requirements.txt
COPY --from=builder /code/.venv /code/.venv
COPY --from=builder /usr/bin/mysql_config /usr/bin

WORKDIR /code

EXPOSE 8000

CMD ["/code/.venv/bin/uvicorn", "--app-dir", "/code/myapp", "server:app", "--proxy-headers"]


Monday, May 13, 2024

FastAPI middleware performance

 As per the FastAPI docs, the way to create and add custom middlewares is


@app.middleware("http")
async def add_my_middlware(request: Request, call_next):
response = await call_next(request)
return response

Seems simple enough. Before the await, you can do something with the request. After the await, you can do something with the response.

But if you run benchmark, you will find something very surprising.

Saturday, May 11, 2024

Python - calling async function from a sync code flow

 Recently I ran into a scenario where I needed to call a init function for a third party library in my FastAPI application. The problem was, the function was async.

One way of calling an async function from a sync flow is using asyncio event loop.


import asyncio
loop = asyncio.get_event_loop()
loop.run_until_complete(init())

But this gave event loop is already running error.

The problem is, FastAPI application is run with uvicorn which starts a loop.

So I tried creating a new loop.

loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
loop.run_until_complete(init())

But this still didn't work as asyncio only supports one loop at a time.

The most suggested approach is to use nest-asyncio

import nest_asyncio
loop = asyncio.new_event_loop()
nest_asyncio.apply(loop)
asyncio.set_event_loop(loop)
loop.run_until_complete(init())

This raised an exception: Can't patch uvloop.Loop.

uvicorn patches asyncio to use uvloop which is better performant that vanilla asyncio (previous claims were 2x to 4x. In the simple test I ran, even with performance changes in 3.12, asyncio was about 25% slower than uvloop.).

Did some research, but couldn't find any solution around this other than forcing uvicorn to run with vanilla asyncio


uvicorn main:app --loop asyncio


It didn't make sense to take a performance hit, just to call a function.

So I decided to dig deeper into why asyncio.new_event_loop returns uvloop.Loop.

The way uvloop does it is, it sets the asyncio event_loop_policy.

This gave me an idea, what if we temporarily restore the event loop policy, get a loop, apply nest_asyncio and then restore the event loop policy.


import nest_asyncio
import asyncio
_cur_event_loop_policy = asyncio.get_event_loop_policy()
asyncio.set_event_loop_policy(asyncio.DefaultEventLoopPolicy())
loop = asyncio.new_event_loop()
nest_asyncio.apply(loop) # type: ignore
asyncio.set_event_loop(loop)
result = loop.run_until_complete(init())
loop.close()
asyncio.set_event_loop_policy(_cur_event_loop_policy)


This seems to do the trick and I am able to call an async function in my FastAPI application main.py before initializing app.

Thursday, June 22, 2023

FastAPI and Swagger2

 I recently started working on a web project with Python backend. The hosting was on Google Cloud. 

 We built the POC version of the Python backend using the most popular Python3 framework FastAPI. We deployed it on Cloud Run (public) with Docker image and everything worked as expected. 

 We then moved on to making the Cloud Run private and adding an API gateway infront of it. We saved the http://localhost:8000/openapi.json and ran the gcloud cli to create an API gateway and it failed!

 Looking at the documentation https://cloud.google.com/api-gateway/docs/openapi-overview we realized, Google, for some reason, still only supports OpenAPI 2.0 (Swagger 2.0). Working on AWS for quite some time, I had assumed OpenAPI 3 would be supported.

 We tried figuring out a way to configure FastAPI to generate Swagger 2 spec. But FastAPI only supports OpenAPI 3.x. Many have suggested downloading the openapi.json, passing it through some convertor and then some manual intervention to convert it to Swagger 2. 

 Since I wanted to use CI/CD, I didn't want to deal with manual conversion every time developers made some change. So, I referred the FastAPI OpenAPI related code and came up with Package version

Requirements

Python 3.8+
FastAPI 0.79.0+

Installation

$ pip install fastapi_swagger2

Example

from typing import Union
from fastapi import FastAPI
from fastapi_swagger2 import FastAPISwagger2

app = FastAPI()
FastAPISwagger2(app)


@app.get("/")
def read_root():
    return {"Hello": "World"}


@app.get("/items/{item_id}")
def read_item(item_id: int, q: Union[str, None] = None):
    return {"item_id": item_id, "q": q}


This adds following endpoints:
http://localhost:8000/swagger2.json
http://localhost:8000/swagger2/docs
http://localhost:8000/swagger2/redoc


Generate spec for CI/CD

import os

import yaml

from app.main import app

URL = os.environ["CLOUD_RUN_URL"]

app.servers.append(URL)

spec = app.swagger2()
spec['x-google-backend'] = {'address': URL}

print(yaml.dump(spec))


Reusing the same Next.js Docker image with runtime CDN assetPrefix

Recently, while investigating a production issue, I needed to deploy the production Docker image locally for debugging purposes. However, I ...