PREPARING QUALITY CONTENT
COMING SOON
devops

How Docker Actually Works — Container Internals Explained

A deep-dive into the Linux primitives that power every Docker container: namespaces for isolation, cgroups for resource limits, and union filesystems for layered images.

DS
Divyanshu Singh Chouhan
12 min read2,650 words

What Is a Container, Really?

A Docker container is not a virtual machine. This is the single most important thing to understand before going further. A VM runs a complete operating system with its own kernel. A container shares the host kernel and uses Linux primitives to create an isolated environment.

Three Linux features make containers possible:

  • Namespaces — isolate what a process can see
  • cgroups — limit what a process can use
  • Union filesystems — layer files efficiently

Docker is essentially a user-friendly wrapper around these three kernel features. When you run docker run nginx, Docker asks the Linux kernel to create a new set of namespaces and cgroups, mounts a layered filesystem, and starts your process inside that isolated environment.

Linux Namespaces: The Isolation Layer

Namespaces control what a process can see. Each namespace type isolates a different system resource:

NamespaceWhat It IsolatesFlag
PIDProcess IDs — container sees its own PID 1CLONE_NEWPID
NETNetwork interfaces, IP addresses, portsCLONE_NEWNET
MNTMount points — filesystem treeCLONE_NEWNS
UTSHostname and domain nameCLONE_NEWUTS
IPCInter-process communicationCLONE_NEWIPC
USERUser and group IDsCLONE_NEWUSER
CGROUPcgroup root directoryCLONE_NEWCGROUP

When Docker creates a container, it calls the clone() system call with these flags. The result is a process that thinks it has its own hostname, its own network stack, its own PID numbering, and its own filesystem — even though it shares the same kernel as the host.

You can verify this yourself:

# On the host, list namespaces for a running container
docker inspect --format '{{.State.Pid}}' my-container
# Then check its namespaces
ls -la /proc/<PID>/ns/

Each file in /proc/<PID>/ns/ is a namespace handle. If two processes share the same namespace file (same inode), they see the same view of that resource.

PID Namespace in Practice

Inside a container, the main process is always PID 1. But on the host, that same process has a completely different PID. This is PID namespace isolation:

# Inside the container
$ ps aux
PID  USER  COMMAND
1    root  nginx: master process
7    nginx nginx: worker process

# On the host
$ ps aux | grep nginx
PID    USER  COMMAND
28431  root  nginx: master process
28458  nginx nginx: worker process

Same processes, different PID numbering. The container cannot see or signal any process outside its namespace.

NET Namespace: Virtual Networking

Each container gets its own network namespace with its own eth0 interface, its own IP address, and its own port space. Docker creates a virtual ethernet pair (veth) — one end goes into the container namespace, the other connects to a bridge on the host (typically docker0).

This is why two containers can both listen on port 80 without conflicting — they each have their own network namespace.

cgroups: The Resource Limit Layer

Control groups (cgroups) limit how much of the host's resources a container can consume. Without cgroups, a runaway container could eat all available CPU and memory, starving other containers and the host itself.

Key cgroup controllers:

  • cpu — CPU time allocation and throttling
  • memory — RAM limits (hard and soft)
  • blkio — disk I/O bandwidth limits
  • pids — maximum number of processes

When you run docker run --memory=512m --cpus=1.5 nginx, Docker creates a cgroup with these limits and places the container process inside it.

# Check a container's memory limit
cat /sys/fs/cgroup/memory/docker/<container-id>/memory.limit_in_bytes

# Check CPU allocation
cat /sys/fs/cgroup/cpu/docker/<container-id>/cpu.cfs_quota_us

What Happens When Limits Are Exceeded

If a container tries to allocate more memory than its limit, the kernel's OOM (Out of Memory) killer terminates the process. This is why you sometimes see containers restart unexpectedly — they hit their memory ceiling.

CPU limits work differently. The container isn't killed; it's throttled. The kernel simply doesn't give it more CPU time than allocated. The process runs slower but stays alive.

Union Filesystems: The Image Layer

Docker images use a union filesystem (typically OverlayFS / overlay2 on modern Linux) to build images in layers. Each instruction in a Dockerfile creates a new layer. Layers are stacked on top of each other, and the union filesystem presents them as a single coherent filesystem.

Consider this Dockerfile:

FROM ubuntu:22.04          # Layer 1: Base OS (~77 MB)
RUN apt-get update         # Layer 2: Package lists (~45 MB)
RUN apt-get install -y nginx  # Layer 3: Nginx binary (~20 MB)
COPY index.html /var/www/  # Layer 4: Your file (~1 KB)

Each RUN and COPY creates a read-only layer. When you start a container, Docker adds one thin writable layer on top. Any file modifications in the running container go into this writable layer — the image layers below remain untouched.

Why Layers Matter

Layers are shared between images. If ten images all start FROM ubuntu:22.04, that base layer is stored only once on disk and once in memory. This is why pulling a new image is often fast — most layers already exist locally.

# See the layers of an image
docker history nginx:latest

# See layer storage on disk
ls /var/lib/docker/overlay2/

Copy-on-Write

When a container modifies a file that exists in a lower layer, the union filesystem copies that file to the writable layer first, then applies the modification. The original file in the read-only layer is unchanged. This is called copy-on-write (CoW).

This means:

  • Starting a container is fast — no filesystem copying needed
  • Multiple containers from the same image share all read-only layers
  • Only modifications create new data

How Docker Networking Works

Docker sets up networking by combining namespaces with virtual network devices:

  1. docker0 bridge — a virtual switch on the host (default: 172.17.0.0/16)
  2. veth pairs — virtual cables connecting each container to the bridge
  3. iptables rules — NAT rules for port mapping (-p 8080:80)

When you run docker run -p 8080:80 nginx:

  1. Docker creates a NET namespace for the container
  2. Creates a veth pair — one end in the container (becomes eth0), one end on the bridge
  3. Assigns an IP from the bridge subnet (e.g., 172.17.0.2)
  4. Adds an iptables DNAT rule: host:8080 -> container:80
# See the bridge
brctl show docker0

# See iptables rules Docker created
iptables -t nat -L -n | grep 8080

What Dockerfile Instructions Actually Do

Every Dockerfile instruction maps to a specific operation:

InstructionWhat It Does Under the Hood
FROMSets the base image (starting layers)
RUNExecutes command in a temporary container, saves the resulting layer
COPYAdds files from build context as a new layer
ENVSets environment variable in image metadata (no new layer)
EXPOSEDocuments a port in image metadata (does NOT publish it)
CMDSets default command in image metadata
ENTRYPOINTSets the executable that wraps CMD
WORKDIRSets working directory for subsequent instructions

The key insight: RUN creates a new container, executes the command, takes a snapshot of the filesystem changes, and saves that as a layer. Then it destroys the temporary container. This is why environment variables set with export in one RUN instruction don't persist to the next — each RUN is a separate container lifecycle.

Containers vs VMs: The Real Difference

AspectContainerVirtual Machine
IsolationProcess-level (namespaces)Hardware-level (hypervisor)
KernelShared with hostOwn kernel
Boot timeMillisecondsSeconds to minutes
SizeMegabytesGigabytes
OverheadNear-zero CPU overhead5-15% hypervisor overhead
SecurityWeaker isolation (shared kernel)Stronger isolation (separate kernel)
DensityHundreds per hostTens per host

Containers are not more secure than VMs. They trade isolation strength for speed and density. If a kernel exploit exists, a container attacker can potentially escape to the host. VMs provide a hardware-level boundary that is significantly harder to cross.

For most web applications, the trade-off is worth it. For multi-tenant environments where you run untrusted code, VMs or microVMs (like Firecracker, which powers AWS Lambda) remain the safer choice.

The Connection to ABCsteps Episode 06

In Episode 06: Netflix & Docker, we build a real containerized application from scratch. The episode covers Dockerfile creation, port mapping, and running your Snake game inside a container.

This blog post gives you the theoretical foundation. The episode gives you hands-on practice. Together, they give you the complete picture of how Docker works — both the "why" and the "how."

Key Takeaways

  1. Containers are not VMs — they share the host kernel and use namespaces + cgroups for isolation
  2. Namespaces provide visibility isolation (PID, network, filesystem, users)
  3. cgroups provide resource limits (CPU, memory, disk I/O)
  4. Union filesystems enable efficient layered images with copy-on-write
  5. Docker networking uses virtual bridges and veth pairs with iptables NAT
  6. Layers are shared — this is why Docker is so storage-efficient
  7. Security trade-off — containers are faster and lighter but less isolated than VMs
#docker #containers #linux #devops
DS

Divyanshu Singh Chouhan

Founder, ABCsteps Technologies

On a mission to demystify the black box of technology for everyone. Building ABCsteps — a 20-chapter coding curriculum from absolute zero to AI Architect.