ruario/inode-truncation-cpio.md

## inode-truncation-cpio.md

      
    Raw
  

              inode-truncation-cpio.md
            
          
    GNU cpio and inode truncation considerations

The following explains how inode
truncation in classic cpio formats (odc and newc) interacts with
hardlinked files, and under what conditions you may encounter problems.
TL;DR

Unless you are still using cpio for full system backups, this is probably a
non‑issue. Nonetheless, to ensure correct hardlink preservation with GNU cpio:

Use --renumber-inodes (or --reproducible, which also enables renumbering)

What inode truncation is

Classic cpio formats store inode numbers in fixed‑width fields:

odc uses 16‑bit inode numbers
newc uses 32‑bit inode numbers

Modern filesystems (ext4, XFS, btrfs, ZFS, etc.) commonly use 64‑bit inode
numbers. When a filesystem inode does not fit into the field size of the old
cpio formats, GNU cpio truncates it to the available width. For example:

real inode: 123456789
truncated: 23456789

If two different real inode numbers share the same low bits, they collapse to
the same stored inode value in the archive.
cpio uses inode numbers only when handling hardlinks. It checks each file’s
link count (nlink > 1) and uses the inode number to match entries that refer
to the same underlying file. Later entries in a hardlink group store no file
contents and rely on the first match.
When truncation is problematic


No hardlinks = no problem

If none of the files in the archive are hardlinks (i.e. no nlink > 1
entries), inode truncation cannot cause corruption because the inode number
is effectively ignored.


One hardlink group = no problem

Even if the stored inode is truncated, nothing else can collide with it.


Multiple hardlink groups = corruption is possible (but unlikely)

Extraction becomes unsafe only if:

multiple real inode numbers truncate to the same stored inode value, and
some of those truncated values correspond to hardlink groups (nlink > 1), and
the archive order interleaves the groups in a way that causes overlap.


If all three conditions are met, GNU cpio might incorrectly merge distinct
hardlink groups into a single group during extraction, causing data loss.
Note: A harmful collision remains unlikely. Truncation can happen on
modern filesystems, but actual corruption requires two different 64‑bit inode
values to truncate to the same smaller number and for those files to appear
as hardlink groups in an unfortunate order within the archive. That
combination is rare. For newc archives with a couple thousand entries, inode
truncation is effectively a non‑issue. Even with 5% of your archive as
hardlink pairs on a host with billions of inodes, you are in the “one in ten
million” territory for any harmful hardlink collisions. (odc, however, would
start to get risky.)
Working around the issue (renumbering)

GNU cpio provides:
--renumber-inodes
--reproducible

Either option rewrites inode numbers during archive creation so that:

All non‑hardlinked files use inode 0
Each hardlink group is assigned a new sequential inode number (1, 2, 3, …)
All files in the same hardlink group share the same renumbered inode

This avoids truncation collisions entirely for a 'normal' archive.
Note: Renumbering does not make truncation collisions completely
impossible. If an archive contains enough hardlink groups to exhaust the
available inode space of the chosen format, renumbered inodes would still
truncate and collide. However, in GNU cpio’s renumbering modes, all
non‑hardlinked files use inode 0, and only hardlink groups are assigned
sequential inode numbers (1, 2, 3, …). That means you would need:

over 26 thousand hardlink groups for odc
over 4.3 billion hardlink groups for newc

before truncation of renumbered inodes becomes a concern.
In practice, --renumber-inodes makes the problem so unlikely that it is safe
to ignore for any realistic archive.

Correction note

A previous version of this gist stated that BSD cpio (libarchive) renumbers
inodes by default. Further testing has shown that this is not the case:
BSD cpio writes the real inode numbers from the filesystem, just like GNU cpio
in normal mode. It will therefore suffer from inode truncation on
systems with very large inode numbers and can run into the same hardlink related
issues described above.
No results found