Unused blocks and fstrim

Discussion:

(too old to reply)

Steve Keller

2024-09-20 09:30:01 UTC

I'd like to understand some technical details about how fstrim, file
systems, and block devices work.

Do ext4 and btrfs keep a list of blocks that have already been reported as
unused or do they have to report all unused blocks to the block device
layer everytime the fstrim command is issued?

Does LVM keep information on every block about its usage or does it always
have to pass trim operations to the lower layer?

And does software RAID, i.e. /dev/md* keep this information on every block?
Can RAID skip unused blocks from syncing in a RAID-1 array when I replace a
disk?

Steve

Tim Woodall

2024-09-20 10:10:01 UTC

Permalink

Post by Steve Keller
I'd like to understand some technical details about how fstrim, file
systems, and block devices work.
Do ext4 and btrfs keep a list of blocks that have already been reported as
unused or do they have to report all unused blocks to the block device
layer everytime the fstrim command is issued?
Does LVM keep information on every block about its usage or does it always
have to pass trim operations to the lower layer?
And does software RAID, i.e. /dev/md* keep this information on every block?
Can RAID skip unused blocks from syncing in a RAID-1 array when I replace a
disk?
Steve

In the default, iscsi, md, lvm, ext2 do not keep this information. Don't
know if it's configurable sonewhere but I suspect not. Don't know about
btrfs.

Some of this data is cached, but not between reboots.

The raid rebuild is a particular pain point IMO. It's important to do a
discard after a failed disk rebuild otherwise every block is 'in use' on
the underlying storage.

After a rebuild I always create a LV with all the free space and then
discard it.

I think a VG free space skipping md rebuild would suit me better than
discard tracking at all the different levels. I guess ZFS users might
have a different view of how useful lvm aware mdraid is :-)

Michael Kjörling

2024-09-20 11:00:01 UTC

Permalink

Post by Tim Woodall
I guess ZFS users might
have a different view of how useful lvm aware mdraid is :-)

ZFS nowadays has the pool `autotrim` property (default off) and the
`zpool trim` subcommand for manual or scripted usage. This is one of
those times when ZFS' awareness of actual storage usage and allocation
comes in handy at what's typically considered other layers in the
stack.

--
Michael Kjörling 🔗 https://michael.kjorling.se
“Remember when, on the Internet, nobody cared that you were a dog?”

Steve Keller

2024-09-23 13:30:01 UTC

Permalink

Post by Tim Woodall
In the default, iscsi, md, lvm, ext2 do not keep this information. Don't
know if it's configurable sonewhere but I suspect not. Don't know about
btrfs.
Some of this data is cached, but not between reboots.

I have played a bit and it seems for ext4 and btrfs they keep
information on already trimmed blocks but only as long as the file
system is mounted:

# lvcreate -n foo -L1G vg0
Logical volume "foo" created.
# mkfs.ext4 /dev/vg0/foo
mke2fs 1.47.0 (5-Feb-2023)
[...]
# mount /dev/vg0/foo /mnt
# fstrim -v /mnt
/mnt: 973.4 MiB (1020678144 bytes) trimmed
# fstrim -v /mnt
/mnt: 0 B (0 bytes) trimmed
# umount /mnt
# mount /dev/vg0/foo /mnt
# fstrim -v /mnt
/mnt: 973.4 MiB (1020678144 bytes) trimmed
# fstrim -v /mnt
/mnt: 0 B (0 bytes) trimmed
# umount /mnt
# mkfs.btrfs -f /dev/vg0/foo
btrfs-progs v6.2
See http://btrfs.wiki.kernel.org for more information.

Performing full device TRIM /dev/vg0/foo (1.00GiB) ...
[...]
# mount /dev/vg0/foo /mnt
# fstrim -v /mnt
/mnt: 1022.6 MiB (1072267264 bytes) trimmed
# fstrim -v /mnt
/mnt: 126 MiB (132087808 bytes) trimmed
# fstrim -v /mnt
/mnt: 126 MiB (132087808 bytes) trimmed
# fstrim -v /mnt
/mnt: 126 MiB (132087808 bytes) trimmed
# umount /mnt
# mount /dev/vg0/foo /mnt
# fstrim -v /mnt
/mnt: 1022.6 MiB (1072267264 bytes) trimmed
# fstrim -v /mnt
/mnt: 126 MiB (132087808 bytes) trimmed
# fstrim -v /mnt
/mnt: 126 MiB (132087808 bytes) trimmed
# umount /mnt

I also currently play with ext4 and btrfs on QCOW2 with discard
support. Looks nice.

Post by Tim Woodall
The raid rebuild is a particular pain point IMO. It's important to do a
discard after a failed disk rebuild otherwise every block is 'in use' on
the underlying storage.

Hmm, does a RAID rebuild really always copy the whole new disk, even
the unused space? But what kind of info is then kept in the first
128 MiB of /dev/md0, if not a flag for every block telling whether it's
used or not?

Post by Tim Woodall
After a rebuild I always create a LV with all the free space and then
discard it.

:(

I currently have RAID only on a server with HDDs which don't support
TRIM anyway. I have only needed twice to rebuild the RAID-1 with 2
disks and I seem to remember that not the whole disk was copied, but I
might be wrong on that.

Steve

Tim Woodall

2024-09-23 14:40:01 UTC

Permalink

Post by Steve Keller

Post by Tim Woodall
The raid rebuild is a particular pain point IMO. It's important to do a
discard after a failed disk rebuild otherwise every block is 'in use' on
the underlying storage.

Post by Tim Woodall
After a rebuild I always create a LV with all the free space and then
discard it.

:(
I currently have RAID only on a server with HDDs which don't support
TRIM anyway. I have only needed twice to rebuild the RAID-1 with 2
disks and I seem to remember that not the whole disk was copied, but I
might be wrong on that.

I think the bitmaps are for dirty blocks - so a resynch after a power
failure is quick, not for a failed disk replacement rebuild.

But perhaps there's a config option somewhere so that the md device can
track discards in a bitmap.

My guess is most people run at 90% capacity so it's not that useful...