Discussion:
wait until swapoff is *actually* finished (it returns too early)?
(too old to reply)
Thorsten Glaser
2024-08-21 00:00:01 UTC
Permalink
(Please d=CC=B2o=CC=B2 Cc me on replies, I don=E2=80=99t subscribe to this =
list. Thanks!)

Hi,

this is a bit curious problem:

I have a setup with swap devices on dmcrypt:

$ cat /etc/crypttab
# <target name> <source device> <key file> <options>
crtpv LABEL=3Dfooclvm none discard,luks,init=
ramfs
cswp1 /dev/vg-foo/lv-swp1 /dev/random discard,cipher=3Dae=
s-xts-plain64,size=3D256,plain,swap
cswp2 /dev/vg-foo/lv-swp2 /dev/random discard,cipher=3Dae=
s-xts-plain64,size=3D256,plain,swap

In a cronjob, I basically do swapoff && cryptdisks_stop && \
cryptdisks_start && swapon for both swaps individually to
throw away the old encryption key regularily (but not too
frequently).

I immediately ran into the problem, when trying this for the
first time, that a =E2=80=9Cswapoff /dev/mapper/cswp1=E2=80=9D returns befo=
re
the device is released, so the subsequent cryptdisks_stop fails.

I found that inserting a =E2=80=9Ccat /proc/swaps=E2=80=9D, funnily enough,
makes those failures less frequent but still present; adding
a =E2=80=9Csleep 3=E2=80=9D as well made it work for months.

Until tonight when it didn=E2=80=99t.

Just adding a =E2=80=9Csleep=E2=80=9D is no proper fix anyway, so the quest=
ion
is, how to wait in a shell script until the swap device is
*really* swapoff=E2=80=99d when the syscall returns too early, and
(someone from the Linux kernel maintainers reading this?) should
I report the latter as a bug against the kernel?

This is on bullseye/amd64, on VMs and bare metal both. Using
direct partitions like /dev/sda3 (or via LABEL=3D to avoid trouble)
makes no difference from using LVs.

Thanks in advance,
//mirabilos
--=20
16:47=E2=8E=9C=C2=ABmika:#grml=C2=BB .oO(mira ist einfach gut....) 23:=
22=E2=8E=9C=C2=ABmikap:#grml=C2=BB
mirabilos: und dein bootloader ist geil :) 23:29=E2=8E=9C=C2=ABmikap:#gr=
ml=C2=BB und ich
finds saugeil dass ich ein bsd zum booten mit grml hab, das muss ich dann
gleich mal auf usb-stick installieren=09-- Michael Prokop =C3=BCber MirOS b=
sd4grml
Roberto C. Sánchez
2024-08-21 00:10:01 UTC
Permalink
Just adding a “sleep” is no proper fix anyway, so the question
is, how to wait in a shell script until the swap device is
*really* swapoff’d when the syscall returns too early, and
(someone from the Linux kernel maintainers reading this?) should
I report the latter as a bug against the kernel?
I forget where and when (a long time ago?) but I recall having learned
that prior to swapoff it is necessary to call sync and in my history I
have it like this:

sync && sync && sync && swapoff

I couldn't tell why I have sync 3 times, but I know that it's how I've
called swapoff since as far back as I can remember.

Regards,

-Roberto
--
Roberto C. Sánchez
Greg Wooledge
2024-08-21 00:20:01 UTC
Permalink
Post by Roberto C. Sánchez
sync && sync && sync && swapoff
I couldn't tell why I have sync 3 times, but I know that it's how I've
called swapoff since as far back as I can remember.
Cargo cult. It was never useful to the best of my knowledge.
Erwan David
2024-08-21 06:50:01 UTC
Permalink
Post by Greg Wooledge
Post by Roberto C. Sánchez
sync && sync && sync && swapoff
I couldn't tell why I have sync 3 times, but I know that it's how I've
called swapoff since as far back as I can remember.
Cargo cult. It was never useful to the best of my knowledge.
Once upon a time, the sync command would return before the actual
syscall where completed. Doing 3 times sync gave you a very high
probability that the first one indeed completed all its writes.

But it was already false in SunOS 4 (~1990)
--
Erwan David
Greg Wooledge
2024-08-21 11:40:01 UTC
Permalink
Post by Erwan David
Post by Greg Wooledge
Post by Roberto C. Sánchez
sync && sync && sync && swapoff
I couldn't tell why I have sync 3 times, but I know that it's how I've
called swapoff since as far back as I can remember.
Cargo cult. It was never useful to the best of my knowledge.
Once upon a time, the sync command would return before the actual
syscall where completed. Doing 3 times sync gave you a very high
probability that the first one indeed completed all its writes.
But it was already false in SunOS 4 (~1990)
Even if that's true, running them all in the same command as Roberto
shows would not give you any benefit.

You'd need to physically *type* the command and press Enter three times
to get any "protection". And even then, it's really just the extra
time that it takes to type those commands out. You'd get the same
"protection" by simply waiting 10 seconds (or whatever's appropriate)
after running sync once.
Alain D D Williams
2024-08-21 12:20:01 UTC
Permalink
Post by Greg Wooledge
Even if that's true, running them all in the same command as Roberto
shows would not give you any benefit.
In early Unix sync *did* return immediately after scheduling a buffer flush.
Post by Greg Wooledge
You'd need to physically *type* the command and press Enter three times
to get any "protection". And even then, it's really just the extra
time that it takes to type those commands out.
This is exactly what I was taught to do in the 1980s and the reason was to
cause delay before typing ^p.
Post by Greg Wooledge
You'd get the same "protection" by simply waiting 10 seconds (or whatever's
appropriate) after running sync once.
Remembering to wait is much harder than remembering to type sync on 3 lines -
especially late at night.
--
Alain Williams
Linux/GNU Consultant - Mail systems, Web sites, Networking, Programmer, IT Lecturer.
+44 (0) 787 668 0256 https://www.phcomp.co.uk/
Parliament Hill Computers. Registration Information: https://www.phcomp.co.uk/Contact.html
#include <std_disclaimer.h>
Alain D D Williams
2024-08-21 11:50:01 UTC
Permalink
Post by Erwan David
Post by Greg Wooledge
Post by Roberto C. Sánchez
sync && sync && sync && swapoff
I couldn't tell why I have sync 3 times, but I know that it's how I've
called swapoff since as far back as I can remember.
Cargo cult. It was never useful to the best of my knowledge.
Once upon a time, the sync command would return before the actual
syscall where completed. Doing 3 times sync gave you a very high
probability that the first one indeed completed all its writes.
I do remember one smart alec typing the following, he did not realise that
typing on separate lines was to slow him down:

sync;sync;sync
^P
H

^P on a PDP-11 console made it enter console state and 'H' then halted the
processor.
--
Alain Williams
Linux/GNU Consultant - Mail systems, Web sites, Networking, Programmer, IT Lecturer.
+44 (0) 787 668 0256 https://www.phcomp.co.uk/
Parliament Hill Computers. Registration Information: https://www.phcomp.co.uk/Contact.html
#include <std_disclaimer.h>
Roberto C. Sánchez
2024-08-21 11:50:01 UTC
Permalink
Post by Greg Wooledge
Post by Roberto C. Sánchez
sync && sync && sync && swapoff
I couldn't tell why I have sync 3 times, but I know that it's how I've
called swapoff since as far back as I can remember.
Cargo cult. It was never useful to the best of my knowledge.
Yeah, that is not at all unspurising to me.

It seemed somewhat odd and since I couldn't, even after wracking my
brain, come up with a source or even a vaguely plausible reason for that
particular incantation, I figured it either wasn't doing what I thought
I what was doing or (more likley) it was doing something but essentially
as a side-effect.

Regards,

-Roberto
--
Roberto C. Sánchez
Stefan Monnier
2024-08-21 13:40:01 UTC
Permalink
Just adding a “sleep” is no proper fix anyway, so the question
is, how to wait in a shell script until the swap device is
*really* swapoff’d when the syscall returns too early, and
(someone from the Linux kernel maintainers reading this?) should
I report the latter as a bug against the kernel?
I'd file a bug report against the `mount` package (the one that
provides `swapoff`).


Stefan
Franco Martelli
2024-08-21 14:00:01 UTC
Permalink
(Please d̲o̲ Cc me on replies, I don’t subscribe to this list. Thanks!)
<snip>
Just adding a “sleep” is no proper fix anyway, so the question
is, how to wait in a shell script until the swap device is
*really* swapoff’d when the syscall returns too early, and
(someone from the Linux kernel maintainers reading this?) should
I report the latter as a bug against the kernel?
I don't think this as a kernel bug, many stuffs have timeout on all
OSes, however you can stop the execution flow of a script using an
endless loop then interrupt it when a condition is satisfied, e.g.:

while true
do
/usr/bin/grep lv-swp1 /proc/swaps >/dev/null 2>&1
[ $? -ne 0 ] && break
/usr/bin/sleep 1
done

HTH

P.S.
To other readers, the OP asked to Cc to him when replying, see above
--
Franco Martelli
Thorsten Glaser
2024-08-21 21:20:01 UTC
Permalink
(Please Cc me on replies.)
Post by Franco Martelli
while true
do
/usr/bin/grep lv-swp1 /proc/swaps >/dev/null 2>&1
Not the right condition though=E2=80=A6 it=E2=80=99s absent there but still=
in use.
I am looking for the right thing to check=E2=80=A6

bye,
//mirabilos
--=20
15:41=E2=8E=9C<Lo-lan-do:#fusionforge> Somebody write a testsuite for hello=
world :-)
Stefan Monnier
2024-08-22 12:50:01 UTC
Permalink
Not the right condition though… it’s absent there but still in use.
I am looking for the right thing to check…
How 'bout checking the success of `cryptdisks_stop`?


Stefan
Mike Castle
2024-08-22 15:50:02 UTC
Permalink
Post by Stefan Monnier
How 'bout checking the success of `cryptdisks_stop`?
Does cryptdisks have the ability to display what is in use at the
moment? Maybe polling that before executing the stop?

I suspect that the race is that, when the the swapoff() syscall
returns, the kernel has indeed moved all of the content off, so that
part is fine... but it has not yet released whatever kind of resources
is has on the backing store (akin to an open file handle). Or the
cryptdisks stack itself hasn't fully processed the notification.

Ideally the stop command would have a flag that says 'wait for any
pending changes to happen', but short of that, some sort of status
than can be polled with a sleep between it might be a bit more formal.

You could then control the timeout by looping no more than N times,
then failing a bit more gracefully than it is now.

mrc
Thorsten Glaser
2024-08-22 17:30:01 UTC
Permalink
Post by Mike Castle
Does cryptdisks have the ability to display what is in use at the
moment? Maybe polling that before executing the stop?
That=E2=80=99s what I would like to ask and why I sent this eMail.
Post by Mike Castle
I suspect that the race is that, when the the swapoff() syscall
returns, the kernel has indeed moved all of the content off, so that
part is fine... but it has not yet released whatever kind of resources
is has on the backing store (akin to an open file handle).
My guess as well.
Post by Mike Castle
You could then control the timeout by looping no more than N times,
then failing a bit more gracefully than it is now.
That=E2=80=99d be a method of last resort, same category as the sleep,
except even worse. I=E2=80=99d like to find a way to prevent that.

Thanks,
//mirabilos
--=20
<cnuke> den AGP stecker anfeilen, damit er in den slot aufm 440BX board pas=
st=E2=80=A6
oder netzteile, an die man auch den monitor angeschlossen hat und die dann =
f=C3=BCr
ein elektrisch aufgeladenes geh=C3=A4use gesorgt haben [=E2=80=A6] f=C3=BCr=
lacher gut auf jeder
LAN party =E2=94=82 <nvb> damals, als der pizzateig noch auf dem monior "ge=
gangen" ist
t***@tuxteam.de
2024-08-22 19:40:01 UTC
Permalink
[...]
Post by Thorsten Glaser
Post by Mike Castle
I suspect that the race is that, when the the swapoff() syscall
returns, the kernel has indeed moved all of the content off, so that
part is fine... but it has not yet released whatever kind of resources
is has on the backing store (akin to an open file handle).
My guess as well.
I'm not convinced. Finding out what needs copying back and locating
somewhere to put it is AIUI a slow process.
Actually, thinking about it: if the system hasn't enough discardable
RAM, the process might take arbitrarily long, no?

Cheers
--
t
Mike Castle
2024-08-23 03:40:01 UTC
Permalink
Irrespective of the time taken, that could trigger the OOM killer,
couldn't it. Very risky, unless you're using two swaps as mentioned.
I was actually surprised to see this happen in a test right now. I
*thought* that swapoff() would fail if reduce the available memory to
below current usage.

But indeed, the OOM Killer not only killed my test program, it took
out the swapoff command for good measure!

mrc
Mike Castle
2024-08-23 02:50:01 UTC
Permalink
I'm not convinced. Finding out what needs copying back and locating
somewhere to put it is AIUI a slow process. What's much faster is
when processes themselves demand something be paged back in from
swap. I think there are "tricks" available to cause that to occur,
thus speeding up swapoff.
Exactly. In my experience, running swapoff(8) _will_ take a long time
if the swap area has a lot of content. It will block until everything
is moved out.

What I think we are seeing here is that phase has finished, but the
kernel has not yet notified the backing store that it is no longer
used. Though that is just a WAG.

mrc
Thorsten Glaser
2024-08-23 03:20:02 UTC
Permalink
Post by Mike Castle
Exactly. In my experience, running swapoff(8) _will_ take a long time
if the swap area has a lot of content.
Yes.
Post by Mike Castle
It will block until everything is moved out.
Unfortunately not. It will block until *almost* everything is
moved out. I think what we=E2=80=99re seeing is that the request to move
out the last bits was sent but is processed async, or something.

I think at this point, perhaps the kernel team has an idea.

bye,
//mirabilos
PS: please keep Cc=E2=80=99ing me on replies, thanks!
--=20
<hecker> cool ein Ada Lovelace Google-Doodle. aber zum 197. Geburtstag? H=
=C3=A4tten
die nicht noch 3 Jahre warten k=C3=B6nnen? <mirabilos> bis dahin gibts goog=
le nicht
mehr <hecker> ja, k=C3=B6nnte man meinen. wahrscheinlich ist der angek=C3=
=BCndigte welt-
untergang aus dem maya-kalender die globale abschaltung von google =E2=98=
=BA und darum
m=C3=BCssen die die doodles vorher noch raushauen
Pierre-Elliott Bécue
2024-08-23 11:10:02 UTC
Permalink
Hey,
(Please d̲o̲ Cc me on replies, I don’t subscribe to this list. Thanks!)
Hi,
$ cat /etc/crypttab
# <target name> <source device> <key file> <options>
crtpv LABEL=fooclvm none discard,luks,initramfs
cswp1 /dev/vg-foo/lv-swp1 /dev/random discard,cipher=aes-xts-plain64,size=256,plain,swap
cswp2 /dev/vg-foo/lv-swp2 /dev/random discard,cipher=aes-xts-plain64,size=256,plain,swap
In a cronjob, I basically do swapoff && cryptdisks_stop && \
cryptdisks_start && swapon for both swaps individually to throw away
the old encryption key regularily (but not too frequently).
Ooc, what do you expect to actually gain from this setup?

Apart from that, I had read that discard does a bad job and prople
should use a fstrim timer and drop discard options from mount points.

Bests,
--
PEB
Thorsten Glaser
2024-08-23 20:20:01 UTC
Permalink
Post by Pierre-Elliott Bécue
Post by Thorsten Glaser
In a cronjob, I basically do swapoff && cryptdisks_stop && \
cryptdisks_start && swapon for both swaps individually to throw away
the old encryption key regularily (but not too frequently).
Ooc, what do you expect to actually gain from this setup?
Encryption key rotation. Pages encrypted with the old key
are no longer readable afterwards. This is for long-running
VMs, on hoster infra, mostly (so the hoster could snapshot
the storage any time (ok, they could also snapshot the RAM,
but=E2=80=A6)).

This is to get a bit closer to swapencrypt on BSD, which
uses separate keys for each page or set of pages, AIUI.

bye,
//mirabilos
--=20
Solange man keine schmutzigen Tricks macht, und ich meine *wirklich*
schmutzige Tricks, wie bei einer doppelt verketteten Liste beide
Pointer XORen und in nur einem Word speichern, funktioniert Boehm ganz
hervorragend.=09=09-- Andreas Bogk =C3=BCber boehm-gc in d.a.s.r

Loading...