Discussion:
Creating PDF/A from LaTeX source and from existing PDF
(too old to reply)
Ceppo
2024-07-03 13:10:01 UTC
Permalink
I wrote a report with LaTeX, and afterwards discovered it must be
PDF/A-compliant - which wasn't. I found the pdfx LaTeX package and followed its
instructions, thus obtaining a file that should be PDF/A and pdfinfo identifies
as such, but my employer's upload form thinks isn't. Is pdfinfo reliable enough
that I can tell my employer his form is broken? If not, how can I make sure
that pdflatex's output is actually PDF/A-compliant?

I will also probably have to upload under the same requirement some third-party
PDF, which is not PDF/A, without access to an editable version. Is there a way
to convert them to PDF/A? I know that converting from an editable version would
be the correct way for this, but I have no real way to get it.

A requirement of any solution is that it doesn't rely on non-DFSG-compliant
software, including online conversion tools.

Thanks for any help.


--
Ceppo
t***@tuxteam.de
2024-07-03 13:40:01 UTC
Permalink
Post by Ceppo
I wrote a report with LaTeX, and afterwards discovered it must be
PDF/A-compliant - which wasn't. I found the pdfx LaTeX package and followed its
instructions, thus obtaining a file that should be PDF/A and pdfinfo identifies
as such, but my employer's upload form thinks isn't [...]
Uh-oh. We set the standards, but won't tell you what they are.
Post by Ceppo
Thanks for any help.
Not concrete help, but the Wikipedia [1] makes for an interesting
read (including refs to bunches of test suites you can throw at your
publisher's site to find out where their validator is failing).

And there seems to be a kind of semi-official validaror, according
to the above ref.

Cheers

[1] https://en.wikipedia.org/wiki/PDF/A
--
t
Henning Follmann
2024-07-03 15:10:01 UTC
Permalink
Post by t***@tuxteam.de
Post by Ceppo
I wrote a report with LaTeX, and afterwards discovered it must be
PDF/A-compliant - which wasn't. I found the pdfx LaTeX package and followed its
instructions, thus obtaining a file that should be PDF/A and pdfinfo identifies
as such, but my employer's upload form thinks isn't [...]
Uh-oh. We set the standards, but won't tell you what they are.
But they did! They say PDF/A. But you have a point that this maybe is
not enough. Which version of PDF/A are we talking about?

In general the policy is most likely a good one, because PDF/A gives you
certain guarantees (e.g. That the document renders consistently to the
same printed output, even years after archiving).
Post by t***@tuxteam.de
Post by Ceppo
Thanks for any help.
Not concrete help, but the Wikipedia [1] makes for an interesting
read (including refs to bunches of test suites you can throw at your
publisher's site to find out where their validator is failing).
And there seems to be a kind of semi-official validaror, according
to the above ref.
I never tried to generate PDF/A from LaTeX but I am sure it is possible.
By default it would not include any javascript and IIRC it embeds the
font.
Post by t***@tuxteam.de
Cheers
[1] https://en.wikipedia.org/wiki/PDF/A
--
t
--
Henning Follmann | ***@itcfollmann.com
t***@tuxteam.de
2024-07-03 16:00:02 UTC
Permalink
[...]
Post by Henning Follmann
Post by t***@tuxteam.de
Uh-oh. We set the standards, but won't tell you what they are.
But they did! They say PDF/A. But you have a point that this maybe is
not enough. Which version of PDF/A are we talking about?
Don't get me wrong. The idea of PDF/A is great, the idea of using it
is too... but judging by the Wikipedia entry, the actual implementation
seems to be a mess, with several "levels", one semi-official validator
and a whole bunch of pairwise incompatible validators.

So just specifying PDF/A sounds like a sadistic torture coming out of
Catbert's Evil Human Resources Department :-)

Cheers
--
t
Ceppo
2024-07-08 17:30:02 UTC
Permalink
Post by t***@tuxteam.de
Post by Ceppo
I wrote a report with LaTeX, and afterwards discovered it must be
PDF/A-compliant - which wasn't. I found the pdfx LaTeX package and followed
its instructions, thus obtaining a file that should be PDF/A and pdfinfo
identifies as such, but my employer's upload form thinks isn't [...]
Uh-oh. We set the standards, but won't tell you what they are.
Well, in fact they did tell - they just did *after* I produced my report. But
yes, the workflow is very broken...
Post by t***@tuxteam.de
Not concrete help, but the Wikipedia [1] makes for an interesting
read (including refs to bunches of test suites you can throw at your
publisher's site to find out where their validator is failing).
I read about Isartor Test Suite, but [1] says it checks if the validator
accepts non-compliant files, not if it rejects compliant files.
Post by t***@tuxteam.de
And there seems to be a kind of semi-official validaror, according
to the above ref.
I guess you mean veraPDF?


[1]: https://pdfa.org/resource/isartor-test-suite/


--
Ceppo
Sarunas Burdulis
2024-07-03 14:20:01 UTC
Permalink
Post by Ceppo
I wrote a report with LaTeX, and afterwards discovered it must be
PDF/A-compliant - which wasn't. I found the pdfx LaTeX package and followed its
instructions, thus obtaining a file that should be PDF/A and pdfinfo identifies
as such, but my employer's upload form thinks isn't. Is pdfinfo reliable enough
that I can tell my employer his form is broken? If not, how can I make sure
that pdflatex's output is actually PDF/A-compliant?
pdfinfo probably only reads metadata, but does not do any PDF/A
compliance validation.

VeraPDF seems to work for validation (https://verapdf.org/software/).

--
Sarunas Burdulis
Dartmouth Mathematics
math.dartmouth.edu/~sarunas

· https://useplaintext.email ·
Ceppo
2024-07-08 16:50:02 UTC
Permalink
pdfinfo probably only reads metadata, but does not do any PDF/A compliance
validation.
VeraPDF seems to work for validation (https://verapdf.org/software/).
I don't know about pdfinfo, but it looks like veraPDF at least agrees with my
contractor's form. Thanks for pointing me to it, it looks like now I have a
tool to check if my document is compliant.


--
Ceppo
Henning Follmann
2024-07-03 15:20:01 UTC
Permalink
Post by Ceppo
I wrote a report with LaTeX, and afterwards discovered it must be
PDF/A-compliant - which wasn't. I found the pdfx LaTeX package and followed its
instructions, thus obtaining a file that should be PDF/A and pdfinfo identifies
as such, but my employer's upload form thinks isn't. Is pdfinfo reliable enough
that I can tell my employer his form is broken? If not, how can I make sure
that pdflatex's output is actually PDF/A-compliant?
I will also probably have to upload under the same requirement some third-party
PDF, which is not PDF/A, without access to an editable version. Is there a way
to convert them to PDF/A? I know that converting from an editable version would
be the correct way for this, but I have no real way to get it.
A requirement of any solution is that it doesn't rely on non-DFSG-compliant
software, including online conversion tools.
Thanks for any help.
I did research a bit. It is possible to create a PDF/A compliant
document from LaTeX. It looks like you have to do some work though.

Please looks at this thread at StackExchange. I found that to be very
helpful.
https://tex.stackexchange.com/questions/130201/pdf-a-with-hyperref-on-tex-live-2013/136653#136653

Please let me know how it works out for you.

-H
--
Henning Follmann | ***@itcfollmann.com
Ceppo
2024-07-08 18:20:01 UTC
Permalink
Post by Henning Follmann
Post by Ceppo
A requirement of any solution is that it doesn't rely on non-DFSG-compliant
software, including online conversion tools.
Please looks at this thread at StackExchange. I found that to be very
helpful.
https://tex.stackexchange.com/questions/130201/pdf-a-with-hyperref-on-tex-live-2013/136653#136653
Please let me know how it works out for you.
Hello.
Thanks for pointing to the thread, but the solution isn't suitable for me. I
need a solution that does not rely on non-DFSG-compliant software, but the
first step requires to use a file from a zip archive [1] with a license that
explicitly forbids to modify and sell it.


[1]: http://www.eci.org/_media/downloads/icc_profiles_from_eci/ecirgbv20.zip


--
Ceppo
Richard
2024-07-03 16:40:01 UTC
Permalink
From LaTeX, this is quite simple, there's a package for that - as for
pretty much everything in the LaTeX world. Googling for just like 10 sec
could have given you this great guide:
https://webpages.tuni.fi/latex/pdfa-guide.pdf

Now, for just random PDFs, this is a bit more tricky, but you can do so
with ghostscript. Now, this sadly doesn't have such a great guide, but
something like this should do the trick, though that's only PDF/A-1 for all
I can tell. If your contractor needs a different version, you'll have to
adapt it:

gs -dQUIET -dUseCIEColor -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite
-dPDFACompatibilityPolicy=1 -dCompressFonts=true -dSubsetFonts=true
-sFONTPATH=/usr/share/fonts/ -o <file name of output> <file name of input>

Now, one common thing that can happen is that you don't have the necessary
fonts installed (I'm using the system-wide fonts path here, but you can
also set any other path) so the result would look off. In that case, you
could just convert the fonts into outlines, which will make text
machine-unreadable and the file much bigger. For that,
replavce "-dCompressFonts=true -dSubsetFonts=true
-sFONTPATH=/usr/share/fonts/" with "-dNoOutputFonts". Since I'm not
completely certain about ghostscripts defaults, you can also add
"-dDownsampleMonoImages=false -dDownsampleGrayImages=false
-dDownsampleColorImages=false" to make sure the images stay otherwise
unchanged.

For anything further, you'll have to research yourself as ghostscript is
very complex but used by many people.

Best
Richard
Van Snyder
2024-07-03 19:30:01 UTC
Permalink
Post by Richard
For anything further, you'll have to research yourself as ghostscript
is very complex but used by many people.
Please stop using such a dinky font. There are plenty of old farts
trying to read this list.


Can ghostscript convert a PDF generated by pdflatex to ePub or mobi?

Calibre made a mess, especially of tables. E-mailing it to my Kindle
account with "convert" in the subject line made a mess. Tools to
convert LaTeX to html in the hope of ultimately getting to ePub or mobi
utterly failed, so I don't know whether they in the end would have made
a mess.
e***@gmx.us
2024-07-03 19:40:01 UTC
Permalink
Post by Van Snyder
Post by Richard
For anything further, you'll have to research yourself as ghostscript
is very complex but used by many people.
Please stop using such a dinky font.
That's what ctrl-shift-+ is for.
Richard
2024-07-03 21:10:01 UTC
Permalink
Please stop using such a dinky font. There are plenty of old farts trying
to read this list.
Tell that to your mail program. If it chooses to show you the mail that
way, don't blame me. Everything needed to display it any way you want is
there, it just needs to be used. Thunderbird can define a minimum text size
and refuse messages to use their own font. If your archaic software doesn't
do basics, blame the dev - or better yet yourself, as the choice is yours.

And the other part is its own thread. I've commented everything I know.
Just redo your book as native ePub, there's no way around it. If you want
to find shortcuts, you'll have to do your own research, even beyond Linux
there probably is no piece of software that can do what you are looking
for. But on the other hand, I'd never have expected ghostscript - or to be
more precise GhostPDL, if I'm not mistaken - to be able to handle
Microsoft's rubbish XPS format and convert that to a proper PDF. So who
knows? Instead of going on other people's nerves with an unsolvable issue,
put those questions into the search machine of your choice. Maybe it will
be more competent than your mail program.

Richard

Am Mi., 3. Juli 2024 um 21:20 Uhr schrieb Van Snyder <
For anything further, you'll have to research yourself as ghostscript is
very complex but used by many people.
Please stop using such a dinky font. There are plenty of old farts trying
to read this list.
Can ghostscript convert a PDF generated by pdflatex to ePub or mobi?
Calibre made a mess, especially of tables. E-mailing it to my Kindle
account with "convert" in the subject line made a mess. Tools to convert
LaTeX to html in the hope of ultimately getting to ePub or mobi utterly
failed, so I don't know whether they in the end would have made a mess.
Greg Wooledge
2024-07-03 21:20:01 UTC
Permalink
Post by Richard
Please stop using such a dinky font. There are plenty of old farts trying
to read this list.
Tell that to your mail program. If it chooses to show you the mail that
way, don't blame me. Everything needed to display it any way you want is
there, it just needs to be used. Thunderbird can define a minimum text size
and refuse messages to use their own font. If your archaic software doesn't
do basics, blame the dev - or better yet yourself, as the choice is yours.
I never saw any problem, as my terminal-based MUA renders your text/plain
part just fine. I didn't even know you were posting multi-part messages
until someone complained about the font size.

That said, I wonder *why* you would go out of your way to make your
messages harder to read for people who don't know how to activate every
single feature of their MUA. It would be a good policy to make your
messages as easy to read as possible, for as many people as possible,
by default.

If you're simply punishing the ignorant for their ignorance, well, that
seems a bit spiteful.
Richard
2024-07-03 21:30:01 UTC
Permalink
Well, guess what, I haven't done anything to change the way messages look.
The only settings I ever change is how they are displayed to me. And never
has anyone ever had an issue with that, in many years. Probably because
other people are just not using unusable software. And quite frankly,
punishing the ignorant for their ignorance is the best policy there is. If
you go out of your way to make life as difficult as possible for yourself,
that's your issue. Don't make it everybody else's issue. Quit whining and
learn how to search the internet for solutions first. The chance that
you're the first to ask such basic questions is pretty much not existent.
And if you refuse to learn, that's up to you. But then you didn't learn
live's biggest lesson, you never stop learning.

And with that I'm ending this ridiculous discussions, this has gone far
enough off-topic.

Best
Post by Greg Wooledge
That said, I wonder *why* you would go out of your way to make your
messages harder to read for people who don't know how to activate every
single feature of their MUA.
Stefan Monnier
2024-07-04 03:20:01 UTC
Permalink
Hi Richard,

I don't see any problem because I'm reading this mailing-list from a MUA
that's mostly text-only and doesn't try to use variable-size fonts, but
<div class="gmail_default" style="font-family:arial,helvetica,sans-serif;font-size:small">
repeated several times. I have no idea why your MUA puts it there, but
I suspect that's the reason some of the readers here find your email's
messages to be hard to read: your mail specifically asks for
`font-size:small`.

This might qualify as a bug in your MUA (it can make sense to require
a small font for some parts of the message, but it seems this style
applies to the whole message, which makes no sense), tho maybe it's due
to some particularity of your configuration, or of the way you use your
MUA's editor.


Stefan
jeremy ardley
2024-07-04 03:20:01 UTC
Permalink
Post by Stefan Monnier
This might qualify as a bug in your MUA (it can make sense to require
a small font for some parts of the message, but it seems this style
applies to the whole message, which makes no sense), tho maybe it's due
to some particularity of your configuration, or of the way you use your
MUA's editor.
I use thunderbird and I usually remember to select Sending Format as
text only when sending to lists. This should  always render correctly on
any mua
Max Nikulin
2024-07-04 02:40:01 UTC
Permalink
Post by Van Snyder
Please stop using such a dinky font. There are plenty of old farts
trying to read this list.
- writing this before an attempt to hijack the thread using an already
discussed question,
Post by Van Snyder
Tell that to your mail program. If it chooses to show you the mail that
way, don't blame me.
- insisting on an "industry standard" mail style
Post by Van Snyder
<div class=3D"gmail_default" style=3D"font-famil=
y:arial,helvetica,sans-serif;font-size:small">Tell that to your mail progra=
-------------------------------^^^^^^^^^^^^^^^
Michel Verdier
2024-07-04 09:00:02 UTC
Permalink
Post by Max Nikulin
Tell that to your mail program. If it chooses to show you the mail that way,
don't blame me.
- insisting on an "industry standard" mail style
<div class=3D"gmail_default" style=3D"font-famil=
y:arial,helvetica,sans-serif;font-size:small">Tell that to your mail progra=
-------------------------------^^^^^^^^^^^^^^^
I would add that it's up to the *sender* mail program to send text only
mail to this list (and others). As the html part is useless and multiply
the mail size by almost 10.
Thomas Schmitt
2024-07-04 09:30:01 UTC
Permalink
Hi,
Post by Michel Verdier
I would add that it's up to the *sender* mail program to send text only
mail to this list (and others).
I found this link in the monthly list FAQ:

https://www.debian.org/MailingLists/#codeofconduct

where i read:

"Please don't send your messages in HTML; use plain text instead."

(I cannot judge how hard it is to fulfill this request on a contemporary
desktop.)


Regrettably the list archives seem to have a preference for publishing
the HTML version of list mails. At least i see two different fonts in
an archived mail of Richard:
https://lists.debian.org/debian-user/2024/07/msg00124.html


Have a nice day :)

Thomas
Greg Wooledge
2024-07-04 12:40:01 UTC
Permalink
Post by Thomas Schmitt
Regrettably the list archives seem to have a preference for publishing
the HTML version of list mails. At least i see two different fonts in
https://lists.debian.org/debian-user/2024/07/msg00124.html
In a way, this is good. It lets Richard see what his messages look like
to other people. At least, assuming his browser isn't configured to
ignore his own font size directives....

If you still have one of Richard's messages in your inbox, you can
look at the raw HTML. An excerpt of it has been posted a couple
times in this thread already. A bit of it is also included in the HTML
source of the list archive page (hence you seeing the font size
difference), but that's after the archives have wrapped another layer
of HTML encoding around the original, so take that with a grain of salt.
Roger Price
2024-07-04 09:40:02 UTC
Permalink
Post by Michel Verdier
Post by Max Nikulin
Post by Van Snyder
<div class=3D"gmail_default" style=3D"font-famil=
y:arial,helvetica,sans-serif;font-size:small">Tell that to your mail progra=
-------------------------------^^^^^^^^^^^^^^^
I would add that it's up to the *sender* mail program to send text only
mail to this list (and others). As the html part is useless and multiply
the mail size by almost 10.
The Debian mailing list Code of Conduct at https://www.debian.org/MailingLists/
is clear:

« Please don't send your messages in HTML; use plain text instead »

Roger
jeremy ardley
2024-07-04 10:30:01 UTC
Permalink
Post by Roger Price
The Debian mailing list Code of Conduct at
https://www.debian.org/MailingLists/
« Please don't send your messages in HTML; use plain text instead »
I presume there is some compelling reason that the mailing list doesn't
filter html emails and only resend the text only version?
t***@tuxteam.de
2024-07-04 10:40:02 UTC
Permalink
Post by jeremy ardley
Post by Roger Price
The Debian mailing list Code of Conduct at
https://www.debian.org/MailingLists/
« Please don't send your messages in HTML; use plain text instead »
I presume there is some compelling reason that the mailing list doesn't
filter html emails and only resend the text only version?
Strange question, and even stranger way to pose it.

But let me try: perhaps because the people who set up the mailing
list don't believe in enforcing behavior by technological means,
but rather by convincing people?

That's at least how I'd think.

Cheers
--
t
jeremy ardley
2024-07-04 11:20:01 UTC
Permalink
Post by t***@tuxteam.de
But let me try: perhaps because the people who set up the mailing
list don't believe in enforcing behavior by technological means,
but rather by convincing people?
If I understand the history correctly:

- All early email lists were text only

- After some long time people started sending mixed format emails to lists

- Shortly afterwards list administrators asked people to not send mixed
format emails

- Since then people either in ignorance of list etiquette or ignorance
of their mailer properties kept on sending mixed format emails to lists

The problem is mostly because users have email software that
automatically uses mixed format. That's not their fault as they are
probably unaware of the problem.

List administrators have the ability to ban users who violate etiquette
and in this list actively do so. Banning a user for using mixed format
in violation of list etiquette is obviously not an option.

Unless there is a compelling reason to accept mixed format ( HTML ) I
can't see why the list can't filter submissions to text only - which is
the list policy anyway - and by doing so provide education to users
about what the list format is.
Richard Owlett
2024-07-04 11:50:01 UTC
Permalink
Post by jeremy ardley
Post by t***@tuxteam.de
But let me try: perhaps because the people who set up the mailing
list don't believe in enforcing behavior by technological means,
but rather by convincing people?
- All early email lists were text only
- After some long time people started sending mixed format emails to lists
- Shortly afterwards list administrators asked people to not send mixed
format emails
- Since then people either in ignorance of list etiquette or ignorance
of their mailer properties kept on sending mixed format emails to lists
The problem is mostly because users have email software that
automatically uses mixed format. That's not their fault as they are
probably unaware of the problem.
List administrators have the ability to ban users who violate etiquette
and in this list actively do so. Banning a user for using mixed format
in violation of list etiquette is obviously not an option.
Unless there is a compelling reason to accept mixed format ( HTML ) I
can't see why the list can't filter submissions to text only - which is
the list policy anyway - and by doing so provide education to users
about what the list format is.
Another problem is that nothing intrinsically forces the "plain text
format content" to match the "HTML format content". There's one
obnoxious individual who puts sarcastic comments in the "plain text
format content" about mail programs that enforce "plain text only".
Greg Wooledge
2024-07-04 12:30:01 UTC
Permalink
Unless there is a compelling reason to accept mixed format ( HTML ) I can't
see why the list can't filter submissions to text only - which is the list
policy anyway - and by doing so provide education to users about what the
list format is.
That would be a drastic change. The main purpose of debian-user, as I
see it, is to offer help to Debian users who need it. As such, the
posting etiquette expectations need to be lowered a bit for this list.
Users who need help may have lower technical proficiency than one would
expect on, say, a Debian developers' list.

As long as people are not intentionally being rude about it, I'd give
them the benefit of the doubt. "Richard", for example, seemed to be
unaware that the HTML parts of his multipart messages were being sent
with the font size set to "small". Now that he knows about it, he might
be able to get that taken care of.

Lots of other people may be facing similar technological challenges,
and yet they may still be capable of contributing to the discussion.
Richard
2024-07-04 13:10:01 UTC
Permalink
The bug is reported already, as that's by no means what's intended or what
the settings would suggest. Anything beyond that is out of my hands. And as
already explained, since there are enough tools out there to have messages
look the way you want, this simply doesn't have any priority.

Best
Post by Greg Wooledge
"Richard", for example, seemed to be
unaware that the HTML parts of his multipart messages were being sent
with the font size set to "small".
Michel Verdier
2024-07-04 12:30:01 UTC
Permalink
The problem is mostly because users have email software that automatically
uses mixed format. That's not their fault as they are probably unaware of the
problem.
And lots of MUA only show HTML version, hiding the text copy and the
problem.
Unless there is a compelling reason to accept mixed format ( HTML ) I can't
see why the list can't filter submissions to text only - which is the list
policy anyway - and by doing so provide education to users about what the list
format is.
Perhaps because users needing "education" usually don't read policy and
don't read MUA options... Or just perhaps some mails could be HTML
only.

PS : for your eyes (almost) only : don't reply to list AND to author as
it creates duplicates :)
Jeffrey Walton
2024-07-05 03:00:01 UTC
Permalink
Post by jeremy ardley
Post by Roger Price
The Debian mailing list Code of Conduct at
https://www.debian.org/MailingLists/
« Please don't send your messages in HTML; use plain text instead »
I presume there is some compelling reason that the mailing list doesn't
filter html emails and only resend the text only version?
Yeah, the policy should be enforced at the list server if it is an
important policy. As an example, Vger will reject HTML messages since
the kernel folks feel HTML emails are a characteristic of spam. See
<http://vger.kernel.org/majordomo-info.html>.

If it is not an important policy, then it should probably be removed
from the FAQ.

Jeff
Richard
2024-07-04 14:30:01 UTC
Permalink
If you ever want to be taken seriously, stop spreading such bogus nonsense.
Even base64 encoding wouldn't blow up the size that much. No idea what bs
mail you are talking about, but for me, both the plain text and html
version are said to be 4k in size (by du). Even though that's not that
exact, simple logic is enough to be able to tell your claim is pretty much
impossible.

Best
Post by Michel Verdier
As the html part is useless and multiply
the mail size by almost 10.
Greg Wooledge
2024-07-04 14:50:01 UTC
Permalink
Post by Richard
If you ever want to be taken seriously, stop spreading such bogus nonsense.
Even base64 encoding wouldn't blow up the size that much. No idea what bs
mail you are talking about, but for me, both the plain text and html
version are said to be 4k in size (by du). Even though that's not that
exact, simple logic is enough to be able to tell your claim is pretty much
impossible.
Best
Post by Michel Verdier
As the html part is useless and multiply
the mail size by almost 10.
Richard, your message to which I'm replying shows the following sizes:

I 1 <no description> [multipa/alternativ, 7bit, 2.0K]
I 2 ├─><no description> [text/plain, quoted, utf-8, 0.5K]
I 3 └─><no description> [text/html, quoted, utf-8, 1.2K]

The HTML part is more than double the size of the plain text part, and
when you include all of the MIME metadata needed to set up the multipart
message, the overall size of the body is about 4x what it would have been
if you'd only sent plain text (0.5k -> 2.0k).

Granted, this is not the 10x increase that Michel predicted, but it's
easy to see how a *different* HTML message, with a lot more markup,
could certainly reach that threshold.

Also, please stop top-posting your replies.
Richard
2024-07-04 15:00:01 UTC
Permalink
Right, because 4x = 10x. Jesus, stop being so ridiculous. Also, there's
some magic trick called compression. Human readable text is especially easy
to compress, basically negating all those effects. So just stick to
reality, everything else is just embarrassing.
Post by Greg Wooledge
The HTML part is more than double the size of the plain text part, and
when you include all of the MIME metadata needed to set up the multipart
message, the overall size of the body is about 4x what it would have been
if you'd only sent plain text (0.5k -> 2.0k).
Granted, this is not the 10x increase that Michel predicted, but it's
easy to see how a *different* HTML message, with a lot more markup,
could certainly reach that threshold.
Michel Verdier
2024-07-04 16:00:01 UTC
Permalink
Post by Richard
Right, because 4x = 10x. Jesus, stop being so ridiculous. Also, there's
some magic trick called compression. Human readable text is especially easy
to compress, basically negating all those effects. So just stick to
reality, everything else is just embarrassing.
Please don't be rude.

Ok I over estimate the gap in *your* mails but not in other html mails
I've seen (as Greg correctly supposed).

Compression reduces the size but it's proportionnal so don't negate the
extra html size. The global size will always be 4-10x.

The *reality* is that you need to change just one option in your MUA.
And I stop here this thread.
Stefan Monnier
2024-07-05 04:40:01 UTC
Permalink
Post by Michel Verdier
Compression reduces the size but it's proportionnal so don't negate the
extra html size. The global size will always be 4-10x.
No, the compression is not proportional. HTML is naturally very
redundant, and machine-generated HTML like the one seen in Richard's
email tends to be excruciatingly redundant, so it compresses even much
better than plain text. Plus the part of the plain/text that's in
common with the text/html (i.e. the actual useful part) would usually be
recognized as a redundancy, so all in all you'll typically get a much
smaller size difference after compression.

Of course, that's if compression takes place, which is not necessarily
the case. In practice, for most emails like the ones exchanged on this
mailing-list, the precise size of the message is largely irrelevant:
even if multiplied by 10x, the cost of the actual content is lost in the
noise of the rest of the protocol.


Stefan
Richard
2024-07-05 08:50:01 UTC
Permalink
Not how lossless compression works. The final size depends much more on the
content than on how much content there is. By no means it's "proportional".
Post by Michel Verdier
Compression reduces the size but it's proportionnal so don't negate the
extra html size. The global size will always be 4-10x.
Jeffrey Walton
2024-07-05 03:10:01 UTC
Permalink
Right, because 4x = 10x. Jesus, stop being so ridiculous. Also, there's some magic trick called compression.
Compression is a security hole. It leaks information. It should be
disabled. Infact, TLS v1.3 removed it from the protocol. Also see
<https://en.wikipedia.org/wiki/CRIME> and
<https://en.wikipedia.org/wiki/BREACH>.
Human readable text is especially easy to compress, basically negating all those effects. So just stick to reality, everything else is just embarrassing.
Jeff
Post by Greg Wooledge
The HTML part is more than double the size of the plain text part, and
when you include all of the MIME metadata needed to set up the multipart
message, the overall size of the body is about 4x what it would have been
if you'd only sent plain text (0.5k -> 2.0k).
Granted, this is not the 10x increase that Michel predicted, but it's
easy to see how a *different* HTML message, with a lot more markup,
could certainly reach that threshold.
Richard
2024-07-05 09:00:01 UTC
Permalink
And who was talking about transport? The whole discussion was about
storage, and storing mail compressed is hardly a security issue.
Post by Jeffrey Walton
Compression is a security hole. It leaks information. It should be
disabled. Infact, TLS v1.3 removed it from the protocol. Also see
<https://en.wikipedia.org/wiki/CRIME> and
<https://en.wikipedia.org/wiki/BREACH>.
Greg Wooledge
2024-07-05 11:40:01 UTC
Permalink
Post by Richard
And who was talking about transport? The whole discussion was about
storage, and storing mail compressed is hardly a security issue.
The discussion was originally about your messages containing directives
to render all of your text in a small font. Some readers found your
messages difficult to read for that reason, and we politely pointed it
out to you.

Everything after that has been you ranting against everyone who talks to
you, shifting the goalposts, and refusing to acknowledge that *any* of
your etiquette violations are your responsibility, or even a problem.

Your messages:

* Are multipart text/html and text/plain.

* Consistently use top-posting of your reply above the quoted text.

* Contain directives to use small fonts that make reading them difficult
for some people.

Two of these issues are breaches of list's established rules of etiquette,
and the third one is an unfortunate rudeness. Yet, instead of taking
steps to correct any of them, you chose to shift the topic to message
sizes (which isn't the primary reason HTML email is frowned upon), and
then to compression efficiency (completely irrelevant, as email is
generally neither transmitted nor stored in a compressed form).

Anyway, this is pretty clearly a lost cause. You're using gmail, which
is notoriously horrible, and it seems you are not willing to change your
behavior. There's nothing more I can do to help you.
Richard
2024-07-05 12:10:02 UTC
Permalink
You really need to better read who writes what. I didn't start the
discussion on message sizes due to HTML, I simply ended it because of
irrelevance.
[...] you chose to shift the topic to message
sizes (which isn't the primary reason HTML email is frowned upon) [...]
Nicolas George
2024-07-05 12:10:02 UTC
Permalink
Post by Richard
You really need to better read who writes what. I didn't start the
discussion on message sizes due to HTML, I simply ended it because of
irrelevance.
And that ended the willingness of many people to help you.

Good luck with your problems.
--
Nicolas George
Richard
2024-07-05 12:20:01 UTC
Permalink
Thank god nobody needs help from people so hung up on absolute irrelevant
stuff and rules that haven't made sense in decades - if ever. As you may
have read from the threads, those rules aren't undisputed at all. If they
where seen as relevant as some people want to make believe, the list
maintainers had automated the compliance on the server side long ago. They
didn't, so it can't be that relevant.

Also, you implying rudeness because of an unnoticed bug is just
hilarious and only shows that you just have no clue about what you are
writing and don't even bother properly reading threads before making up
accusations. So better hope nobody remembers your bad attitude if you ever
need help. Now stop spamming.
Post by Nicolas George
And that ended the willingness of many people to help you.
Good luck with your problems.
--
Nicolas George
Felix Miata
2024-07-05 18:10:01 UTC
Permalink
I'm not able to read this message.
Can you suggest to us why you think that might be?
Post by Richard
You really need to better read who writes what. I didn't start the
discussion on message sizes due to HTML, I simply ended it because of
irrelevance.
[...] you chose to shift the topic to message
sizes (which isn't the primary reason HTML email is frowned upon)
[...]--
Evolution as taught in public schools is, like religion,
based on faith, not based on science.

Team OS/2 ** Reg. Linux User #211409 ** a11y rocks!

Felix Miata
Felix Miata
2024-07-05 19:10:01 UTC
Permalink
Post by Felix Miata
I'm not able to read this message.
Can you suggest to us why you think that might be?
Because the message was composed in html using a very small font, and
my mail reader (evolution) automatically prefers to read mail in html.
I've never before had to make an explicit request to the mail reader to
switch to plain text, so I haven't taken the trouble to work out how to
do it. It's easier to ignore messages that are either incompetently or
intentionally composed so as to be unreadable without special actions
taken by the recipients.
I don't use Evolution, but I suspect being a Gnome application that it works like
web browsers, where fonts can be enlarged using Ctrl-+ as many times as it takes
to grow the fonts adequately. Possibly it also has a minimum displayed text size
option as web browsers offer.

Evolution may also be able to do as I have done since last century, starting with
Netscape 2 email, then Netscape 3 email, then Netscape 4 email, then Mozilla
email, and eventually as now, SeaMonkey, the replacement name for Mozilla. That
is, to set select mail reading in plain text only mode via menu option

view message body as plain text

Sometimes email is sent in multipart with no plain text content. Unless those come
from doctors or financial institutions, I consider them spam and delete without
attempting to read any included content. When necessary, I temporarily switch from
view message body as plain text to as simple HTML.
--
Evolution as taught in public schools is, like religion,
based on faith, not based on science.

Team OS/2 ** Reg. Linux User #211409 ** a11y rocks!

Felix Miata
d***@howorth.org.uk
2024-07-06 14:50:01 UTC
Permalink
It's not my responsibility to deal with messages the senders aren't
serious about being read.
It's up to you of course but if that's your opinion then you always
have the option of simply not reading messages that are sent (against
list guidelines) with HTML parts that suggest using fonts that are too
small for you.

Alternatively:

- you could search for how to adjust font sizes in evolution (hint
edit/ preferences/ mail preferences/ general tab)

- you could set evolution to display the plain text version of emails

- you could choose another mail reader

Sadly whilst it is your opinion that it's not your responsibility, I
doubt many other people share your opinion, so I think your options are
limited to those within your own control, such as the four above.
Felix Miata
2024-07-06 23:20:01 UTC
Permalink
I know what to do to read messages with tiny fonts -- if I can see
enough of it to decide they're interesting.
So far, only one correspondent, whom I have by-and-large concluded
doesn't have anything interesting to way.
What I'm offering to those who send messages that they seriously
consider to be worth reading: You ought to make them readable. If you
make it hard for recipients to read them, they'll ignore your wisdom.
FWIW to any not familiar with how email was 30+ years ago, M$ and Win95 seem to be
the root blame for the practice of both use of not only HTML for email by default,
but also of defaulting to imposition of a smaller than default font size in those
HTML emails, apparently to match what web designers were doing, making email
mousetype similar to the web page mousetype those eagle-eyed designers were fond
of imposing on everyone in the days before zoom was invented. Most GUI email
clients, as well as webmail apps, seem to have followed this stupid, rude lead.
It's the sender not changing the rude default, typically not knowing it even
exists, or can be changed (though in some cases default cannot be changed), which
is the immediate locus for blame.
--
Evolution as taught in public schools is, like religion,
based on faith, not based on science.

Team OS/2 ** Reg. Linux User #211409 ** a11y rocks!

Felix Miata
t***@tuxteam.de
2024-07-07 05:40:02 UTC
Permalink
On Sat, Jul 06, 2024 at 07:17:51PM -0400, Felix Miata wrote:

[...]
Post by Felix Miata
FWIW to any not familiar with how email was 30+ years ago, M$ and Win95 seem to be
the root blame for the practice of both use of not only HTML for email by default,
but also of defaulting to imposition of a smaller than default font size in those
HTML emails, apparently to match what web designers were doing, making email
mousetype similar to the web page mousetype those eagle-eyed designers were fond
of imposing on everyone in the days before zoom was invented.
[...]

Don't blame the designers. There has always been a struggle over control
of the end user's computer -- just as a means of reaching the end user's
perception. The companies are winning.

That's what we get when the companies financing the infrastructure are all,
basically, advertising companies (Microsoft? They don't make tech. They sell
tech).

Cheers
--
t
Max Nikulin
2024-07-06 02:40:01 UTC
Permalink
I'm not able to read this message.
I do not think you will manage to achieve anything on this way. The
person has clearly expressed that their are not going to follow
recommendations concerning message format and do not care if messages
cause trouble for some readers. Better options may be:

- Silently ignore.
- If you still expect something useful then find a way to deal with this
kind of messages: temporary switch to plain text part, configure fonts,
copy-paste text to an editor.
- Discuss with the mailer developers if they can implement some kind of
workaround.
- Switch to another mail user agent.

P.S. Please, read
- <https://www.debian.org/MailingLists/#codeofconduct>
- Monthly FAQ for Debian-user mailing list
<https://lists.debian.org/msgid-search/***@einval.com>
Ceppo
2024-07-08 17:00:02 UTC
Permalink
From LaTeX, this is quite simple, there's a package for that - as for pretty
much everything in the LaTeX world. Googling for just like 10 sec could have
given you this great guide: https://webpages.tuni.fi/latex/pdfa-guide.pdf
I did my research and found the document you linked. In fact it's what pointed
me to the pdfx LaTeX package, but I couldn't make it work. I acknowledge I
missed its reference to veraPDF, though.
gs -dQUIET -dUseCIEColor -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite
-dPDFACompatibilityPolicy=1 -dCompressFonts=true -dSubsetFonts=true
-sFONTPATH=/usr/share/fonts/ -o <file name of output> <file name of input>
The output isn't accepted by veraPDF, either. I will try to understand
something more about ghostscript.


--
Ceppo
Jeffrey Walton
2024-07-08 22:10:01 UTC
Permalink
Post by Ceppo
From LaTeX, this is quite simple, there's a package for that - as for pretty
much everything in the LaTeX world. Googling for just like 10 sec could have
given you this great guide: https://webpages.tuni.fi/latex/pdfa-guide.pdf
I did my research and found the document you linked. In fact it's what pointed
me to the pdfx LaTeX package, but I couldn't make it work. I acknowledge I
missed its reference to veraPDF, though.
gs -dQUIET -dUseCIEColor -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite
-dPDFACompatibilityPolicy=1 -dCompressFonts=true -dSubsetFonts=true
-sFONTPATH=/usr/share/fonts/ -o <file name of output> <file name of input>
The output isn't accepted by veraPDF, either. I will try to understand
something more about ghostscript.
Have a look at <https://stackoverflow.com/q/1659147>. It discusses
some of the finer points of PDF/A conversion in the comments, like
color spaces.

Jeff
Greg Marks
2024-07-03 21:50:01 UTC
Permalink
Post by Richard
Now, for just random PDFs, this is a bit more tricky, but you can do so
with ghostscript. Now, this sadly doesn't have such a great guide, but
something like this should do the trick, though that's only PDF/A-1 for all
I can tell. If your contractor needs a different version, you'll have to
gs -dQUIET -dUseCIEColor -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite
-dPDFACompatibilityPolicy=1 -dCompressFonts=true -dSubsetFonts=true
-sFONTPATH=/usr/share/fonts/ -o <file name of output> <file name of input>
This does not seem to work. For example:

$cd /tmp

$wget -O old.pdf https://arxiv.org/pdf/2406.18499

$gs -dQUIET -dUseCIEColor -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite -dPDFACompatibilityPolicy=1 -dCompressFonts=true -dSubsetFonts=true -sFONTPATH=/usr/share/fonts/ -o new.pdf old.pdf

[Gives warnings:

GPL Ghostscript 10.00.0:

Use of -dUseCIEColor detected!
Since the release of version 9.11 of Ghostscript we recommend you do not set
-dUseCIEColor with the pdfwrite/ps2write device family.]

Uploading new.pdf to https://www.pdfforge.org/online/en/validate-pdfa
produces report "The file is not a valid PDF/A document" with these
details:

ISO 19005-1:2005
6.1.8
The object number and generation number shall be separated
by a single white-space character. The generation number
and obj keyword shall be separated by a single white-space
character. The object number and endobj keyword shall each be
preceded by an EOL marker. The obj and endobj keywords shall
each be followed by an EOL marker.

ISO 19005-1:2005
6.1.7
The stream keyword shall be followed either by a CARRIAGE RETURN
(0Dh) and LINE FEED (0Ah) character sequence or by a single
LINE FEED character. The endstream keyword shall be preceded
by an EOL marker

Repeating with the flag -dUseCIEColor removed prevents the Ghostscript
warnings but doesn't change the PDF/A validation result.

Best regards,
Greg Marks
Max Nikulin
2024-07-04 02:30:01 UTC
Permalink
Post by Greg Marks
$gs -dQUIET -dUseCIEColor -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite -dPDFACompatibilityPolicy=1 -dCompressFonts=true -dSubsetFonts=true -sFONTPATH=/usr/share/fonts/ -o new.pdf old.pdf
[...]
Post by Greg Marks
The object number and generation number shall be separated
by a single white-space character. The generation number
[...]
Post by Greg Marks
The stream keyword shall be followed either by a CARRIAGE RETURN
I expect that pdftk and qpdf have their own serializers. I have no idea
if they can transform a file to a PDF/A compliant document, but they
might use proper separators.

Perhaps LaTeX documents require some tuning (metadata blocks, etc.). If
you use pdflatex then I would try lualatex.
Richard
2024-07-04 08:20:02 UTC
Permalink
The first bit is just a warning, not an error. Of course, you could check
what has changed in v9.11 that makes this not recommended anymore. Maybe
they already handle it internally when you set -dPDFACompatibilityPolicy=1
and the old setting can interfere. But when the output of the validator
doesn't change, it's probably just meant as you don't need to specify this
anymore, we activate it ourselves.

Speaking of the validator, those look more like warnings too and not like
deal breakers. In the end, only you know what your contractor expects of
you. And if they don't even bother inspecting the result, this will be
irrelevant. After all, the only reason PDF/A exists is for archiving
reasons. It pretty much just throws out all the proprietary clutter from
the PDF standard. The important thing is that fonts are embedded to always
be able to display them correctly, and that it's specified how images and
other media are embedded. If your contractor expects more of you, they
should pay for the appropriate software.

Richard

PS: this isn't really meant for this, but you could install Scribus and try
to import the PDF there. It also has a validator similar to Adobes
Preflight. Maybe it can give you a more precise result. I'm not sure if it
even can output PDF/A, I only know that it does PDF/X, but maybe it can
even be used for better conversion to PDF/A. The last time I tried to
import a large PDF into Scribus it got kinda stuck, but it has evolved
since then and maybe it was a hardware limitation.
Post by Greg Marks
$gs -dQUIET -dUseCIEColor -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite -dPDFACompatibilityPolicy=1
-dCompressFonts=true -dSubsetFonts=true -sFONTPATH=/usr/share/fonts/ -o
new.pdf old.pdf
Use of -dUseCIEColor detected!
Since the release of version 9.11 of Ghostscript we recommend you do not set
-dUseCIEColor with the pdfwrite/ps2write device family.]
Uploading new.pdf to https://www.pdfforge.org/online/en/validate-pdfa
produces report "The file is not a valid PDF/A document" with these
ISO 19005-1:2005
6.1.8
The object number and generation number shall be separated
by a single white-space character. The generation number
and obj keyword shall be separated by a single white-space
character. The object number and endobj keyword shall each be
preceded by an EOL marker. The obj and endobj keywords shall
each be followed by an EOL marker.
ISO 19005-1:2005
6.1.7
The stream keyword shall be followed either by a CARRIAGE RETURN
(0Dh) and LINE FEED (0Ah) character sequence or by a single
LINE FEED character. The endstream keyword shall be preceded
by an EOL marker
Repeating with the flag -dUseCIEColor removed prevents the Ghostscript
warnings but doesn't change the PDF/A validation result.
Best regards,
Greg Marks
Ceppo
2024-07-08 16:50:01 UTC
Permalink
latex .../Nix.tex  .../Nix.dvi
dvips -o Nix.ps Nix.pdf
^^^^^^^
I guess here you meant Nix.dvi...
ps2pdf ... Nix.ps ... Nix.pdf
chmod 755 script
All works since many many years absolutly perfect, nothing else ever was is
needed
However, the resulting PDF is not recognized as PDF/A by veraPDF. Have you
tested it with something else?


--
Ceppo
Jeffrey Walton
2024-07-08 21:30:02 UTC
Permalink
Post by Ceppo
I wrote a report with LaTeX, and afterwards discovered it must be
PDF/A-compliant - which wasn't. I found the pdfx LaTeX package and followed its
instructions, thus obtaining a file that should be PDF/A and pdfinfo identifies
as such, but my employer's upload form thinks isn't. Is pdfinfo reliable enough
that I can tell my employer his form is broken? If not, how can I make sure
that pdflatex's output is actually PDF/A-compliant?
The pdf-linter I use to verify a pdf document is qpdf,
<https://github.com/qpdf/qpdf>. It is available on most distributions,
including Debian, Fedora and Red Hat.

The command to check the document is `qpdf --check <doc>`.
Post by Ceppo
I will also probably have to upload under the same requirement some third-party
PDF, which is not PDF/A, without access to an editable version. Is there a way
to convert them to PDF/A? I know that converting from an editable version would
be the correct way for this, but I have no real way to get it.
qpdf may provide this functionality, but I have never used it. From
the project's description: "qpdf is a command-line tool and C++
library that performs content-preserving transformations on PDF files.
It supports linearization, encryption, and numerous other features. It
can also be used for splitting and merging files, creating PDF files
(but you have to supply all the content yourself), and inspecting
files for study or analysis."

Another tool I would look at is GhostScript. It looks like it can
convert to PDF/A: <https://stackoverflow.com/a/9343820>.
Post by Ceppo
A requirement of any solution is that it doesn't rely on non-DFSG-compliant
software, including online conversion tools.
Jeff
Ceppo
2024-07-10 08:40:01 UTC
Permalink
Post by Jeffrey Walton
The pdf-linter I use to verify a pdf document is qpdf,
<https://github.com/qpdf/qpdf>. It is available on most distributions,
including Debian, Fedora and Red Hat.
The command to check the document is `qpdf --check <doc>`.
This command doesn't show me any info abouy PDF/A compliance. man says it
"merely checks that the PDF file is syntactically valid".
Post by Jeffrey Walton
Post by Ceppo
I will also probably have to upload under the same requirement some
third-party PDF, which is not PDF/A, without access to an editable version.
Is there a way to convert them to PDF/A? I know that converting from an
editable version would be the correct way for this, but I have no real way
to get it.
qpdf may provide this functionality, but I have never used it.
[1] says PDF/A conversion is out of scope for the library. However, [2] pointed
me to ocrmypdf and this command produces a valid PDF/A-1b file:

ocrmypdf --output-type pdfa-1 --tesseract-timeout=0 --skip-text \
input.pdf output.pdf

Another comment pointed out this relies on ghostscript, but I couldn't build a
working gs command. I will try harder as soon as I have some free time. Anyway
I have my conversion tool now, and I'm happy with it.

As a short summary of this thread outcome, I can:

- compile with `pdflatex` as usual
- convert to PDF/A with the `ocrmypdf` command above (probably not the most
clean way, but it works)
- validate with veraPDF

Thanks everyone for your help, it was higly appreciated even when it didn't
work as expected!


[1]: https://github.com/qpdf/qpdf/issues/85
[2]: https://github.com/qpdf/qpdf/issues/85#issuecomment-1278055568


--
Ceppo
Max Nikulin
2024-07-11 00:50:02 UTC
Permalink
but I couldn't build a working gs command.
[...]
[1]: https://github.com/qpdf/qpdf/issues/85
There is a link to gs arguments
<https://github.com/ocrmypdf/OCRmyPDF/blob/d9eb0ba7e/src/ocrmypdf/_exec/ghostscript.py#L199>
Loading...