Discussion:
PDF Editor for Debian
(too old to reply)
Richard
2024-06-24 05:40:01 UTC
Permalink
Hello,
this very much depends on what you are expecting it to do. In general, PDFs
are only meant to be viewed - and printed - they where never meant for
anything else. Even filling out forms is just s bad hackjob through
JavaScript. That being said, there is software with PDF editing
capabilities on Linux, though it's much more basic than what you'll find on
Windows.

If you want to just make comments, Okular has some neat capabilities,
including signing PDFs. For handwritten notes on a PDF, Xournal++ is a
great tool. If you want to just want to reorder pages, rotate, delete or
add them, there are some tools like PDFSam. There's also the quite powerful
Ghostscript, though that's CLI only. At least I don't know of any GUI. For
more "editing" features, LibreOffice can import PDFs, but in my experience
it struggles quite a lot with layout. OnlyOffice also has that capability,
but I never used it. Also, Inkscape can do that. It can also import
multiple pages at once, but I recommend only importing single pages,
otherwise Inkscape quickly reaches its limits. It has two import modes, an
internal one and poppler. Use the internal one and see if that works for
you. It's easier to edit text boxes in there, but it's quite likely it
won't be able to use the right font, which will break the whole look. The
poppler import can preserve that, but that's because letters aren't
imported as letters but as paths. So you can't just edit text, you'd have
to delete letters and try to insert text in a way that looks decent.

Other than that, there are a few commercial tools, but they are not that
well known. So your best bet is just to try to never have to edit a PDF at
all. Always try to get a hand on the original file the PDF was delivered
from. Even if it's a docx - Microsofts infamous wannabe-open source format
that just nobody can handle properly, including their own software - it
will most likely be better handled by the software you use than a PDF made
editable.

Best
Richard
Hello.
Is there a PDF editor that would work with Debian 12?
Thanks.
--
*ArbolOne.ca* Using Fire Fox and Thunderbird. ArbolOne is composed of
students and volunteers dedicated to providing free services to charitable
organizations. ArbolOne on Java Development is in progress [ í ]
Klaus Singvogel
2024-06-24 06:40:01 UTC
Permalink
Is there a PDF editor that would work with Debian 12?
It's depending on what you understand under "edit", and whether you expect to use Free Open Source Software (FOSS) or not.

If you just want to fill out forms (JavaScript), then I'd recommend the FOSS programs: chromium browser (not: Google chrome browser), or evince.

If you want to edit the PDF itself, like moving lines, edit texts or rearranging elements (like pictures), you can either use LibreOffice (but for me it wasn't quiet usable), or buy a license for a commercial program.

For commercial programs, I'd made good experience with Master PDF Editor. But I'd also give Qoppa a try, because a lot of people say that Qoppa is the better choice.
My experience for Master PDF Editor is: I'd running it since 2019 with the same bought license. But my Debian changed from Jessie, over Stretch and Buster to Bullseye (now), and it's still running. I'd to admit that I needed to reinstall the DEB package every then and a while (not after a Debian Upgrade), due to issues with Qt5. So it was a good idea to download the DEB package after I bought it, and keep the packages till today. So far I can say, all required depencies are included in the downloaded DEB package.

Best regards,
Klaus.
--
Klaus Singvogel
GnuPG-Key-ID: 1024R/5068792D 1994-06-27
Timothy M Butterworth
2024-06-24 08:30:01 UTC
Permalink
Hello.
Is there a PDF editor that would work with Debian 12?
I use Master PDF Editor. It works great.
https://code-industry.net/free-pdf-editor/

Thanks.
--
*ArbolOne.ca* Using Fire Fox and Thunderbird. ArbolOne is composed of
students and volunteers dedicated to providing free services to charitable
organizations. ArbolOne on Java Development is in progress [ í ]
--
⢀⣎⠟⠻⢶⣊⠀
⣟⠁⢠⠒⠀⣿⡁ Debian - The universal operating system
⢿⡄⠘⠷⠚⠋⠀ https://www.debian.org/
⠈⠳⣄⠀⠀
Detlef Vollmann
2024-06-25 18:20:01 UTC
Permalink
On Mon, 24 Jun 2024 04:26:47 -0400
Post by Timothy M Butterworth
I use Master PDF Editor. It works great.
https://code-industry.net/free-pdf-editor/
It looks nice.
But being a closed source SW from Russia I'd be careful to run
it outside of an isolated VM (which is actually true for most
closed source SW).

Detlef
t***@tuxteam.de
2024-06-25 18:30:02 UTC
Permalink
Post by Detlef Vollmann
On Mon, 24 Jun 2024 04:26:47 -0400
Post by Timothy M Butterworth
I use Master PDF Editor. It works great.
https://code-industry.net/free-pdf-editor/
It looks nice.
But being a closed source SW from Russia I'd be careful to run
it outside of an isolated VM (which is actually true for most
closed source SW).
Yes, not just from Russia. Here's some old, old story which might
amuse you :-)

https://en.wikipedia.org/wiki/Crypto_AG

Cheers
--
tomás
jeremy ardley
2024-06-24 08:40:01 UTC
Permalink
So your best bet is just to try to never have to edit a PDF at all.
Always try to get a hand on the original file the PDF was delivered
from. Even if it's a docx
In my view, pdf and docx shoud be regarded as publication formats for
content managed in a professional content management system. HTML and
odt and postscript also fall in to the category of publication formats.

Word documents suffer because back in the dim ages of the late 1980s
Microsoft decided to merge content managing with content editing with
content publishing and abysmally failed at all of them.

However, the easiest way to edit a pdf is convert it to word using say
https://pdf2docx.com/ There are also plenty of ways in linux to do that
but they all take time and effort to make work.
Richard
2024-06-24 14:30:01 UTC
Permalink
Since it's quite OT, starting a new thread for this.

I would most certainly never call formats like ooxml or odf “publishing formats”, they are content creation or editing formats. From a publishing format I expect to be able to show the content as intended — which actually neither of them can do 100 % can, the probability of messing up just isn't that big. Either you want a fixed format, e.g. for printing, what you get with the likes of PDF, PS, SVG or your various raster graphic formats. Or you want your content to adapt in a foreseeable way to the viewer, i.e. HTLM, usually with the help of CSS and worst case JS. Sure, ooxml and odf want to be the former, but due to technical caveats that's not necessarily possible. With ooxml, you have several incompatible versions you can't just easily tell apart, often making identical display impossible due to using but not embedding proprietary fonts by default — and being an abomination of a format spanning around 5500 pages plus another 1000 pages for their tranistional mode, that was
only standardized by world-wide corruption. ODF usually does things way better, but support in software beyond LibreOffice is still often lacking — though that's not their fault since their format is much simpler, being documented in just around 1000 pages. But still, as it doesn't communicate fixed positions — and as far as I can tell doesn't imply those by telling the software explicitly how to render font, so the result will always look identical, and won't embed fonts — or the needed subset — by default, it's also kinda not fulfilling the needs.

And no, editing a PDF as docx isn't the easiest — not to mention best — way to edit a PDF, especially not with some ominous web tool. Maybe someone can write an AI for that, but even then it's most likely much easier to just go the OCR route to derive content and extract layout from the document. At least I don't know how strict PDF defines things, I only always hear that PDF is at least as much of an unholy mess as ooxml — which was supposed to be fixed by PDF 2.0, which still pretty much no software creates by default, even though most software seems to be supporting it — and writing tools like Ghostscript or Poppler is a royal pain. LaTeX can probably only circumvent this because they just have to create a PDF from a predefined set of functions — and be able to embed other PDFs into these PDFs. But the most reliable way to edit PDFs — as I have little to no experience with most commercial solutions — is Inkscapte. If the internal importer succeeds, you get creat text
editing features, which obviously can't rival office suites, but at least you don't completely and almost guaranteed completely mess up the whole layout.

Richard
In my view, pdf and docx shoud be regarded as publication formats for content managed in a professional content management system. HTML and odt and postscript also fall in to the category of publication formats.
Word documents suffer because back in the dim ages of the late 1980s Microsoft decided to merge content managing with content editing with content publishing and abysmally failed at all of them.
However, the easiest way to edit a pdf is convert it to word using say https://pdf2docx.com/ There are also plenty of ways in linux to do that bu
jeremy ardley
2024-06-24 21:30:01 UTC
Permalink
Post by Richard
Since it's quite OT, starting a new thread for this.
I would most certainly never call formats like ooxml or odf
“publishing formats”, they are content creation or editing formats.
From a publishing format I expect to be able to show the content as
intended — which actually neither of them can do 100 % can, the
probability of messing up just isn't that big. Either you want a fixed
format, e.g. for printing, what you get with the likes of PDF, PS, SVG
or your various raster graphic formats. Or you want your content to
adapt in a foreseeable way to the viewer, i.e. HTLM, usually with the
help of CSS and worst case JS. Sure, ooxml and odf want to be the
former, but due to technical caveats that's not necessarily possible.
With ooxml, you have several incompatible versions you can't just
easily tell apart, often making identical display impossible due to
using but not embedding proprietary fonts by default — and being an
abomination of a format spanning around 5500 pages plus another 1000
pages for their tranistional mode, that was only standardized by
world-wide corruption. ODF usually does things way better, but support
in software beyond LibreOffice is still often lacking — though that's
not their fault since their format is much simpler, being documented
in just around 1000 pages. But still, as it doesn't communicate fixed
positions — and as far as I can tell doesn't imply those by telling
the software explicitly how to render font, so the result will always
look identical, and won't embed fonts — or the needed subset — by
default, it's also kinda not fulfilling the needs.
I triggered this by saying docx and pdf are publishing formats. In the
world of professional content management that is exactly so. You have
your content in a neutral format in a version controlled storage system,
and you have choice to publish in pdf or docx or html or epub or
whatever. What you don't do is use these output formats as your primary
content.

Examples relevant to debian include package documentation such as man
pages, markdown, doxygen, docbook, latex.

In fact I can't think of any project in debian that has pdf or docx as
the primary source of documentation

Tools that can do this transform include pandoc, Visual Studio Code,
ghostwriter, marktext and many many more.
Post by Richard
And no, editing a PDF as docx isn't the easiest — not to mention best
— way to edit a PDF, especially not with some ominous web tool. Maybe
someone can write an AI for that, but even then it's most likely much
easier to just go the OCR route to derive content and extract layout
from the document. At least I don't know how strict PDF defines
things, I only always hear that PDF is at least as much of an unholy
mess as ooxml — which was supposed to be fixed by PDF 2.0, which still
pretty much no software creates by default, even though most software
seems to be supporting it — and writing tools like Ghostscript or
Poppler is a royal pain. LaTeX can probably only circumvent this
because they just have to create a PDF from a predefined set of
functions — and be able to embed other PDFs into these PDFs. But the
most reliable way to edit PDFs — as I have little to no experience
with most commercial solutions — is Inkscapte. If the internal
importer succeeds, you get creat text editing features, which
obviously can't rival office suites, but at least you don't completely
and almost guaranteed completely mess up the whole layout.
Richard
In my most recent experience, OCR of pdf documents is quite difficult if
the layout is significant such as in bank statements. There are various
tools to assist in extracting the content but it's quite marginal.

On the other hand, give a screenshot of a bank statement to the 'ai'
GPT4 and ask it to extract all transactions in csv format and it is done
perfectly.

On a sidenote, PDF is basically Postscript on steroids. Its entire
purpose is to describe how content is to be placed on a printed page. On
a side-side note Postscript is actually a programming language with
specialty in text layout but quite capable of doing significant
computation activities - so long as your output eventually gets rendered
on a page.
Post by Richard
Post by jeremy ardley
In my view, pdf and docx shoud be regarded as publication formats for
content managed in a professional content management system. HTML and
odt and postscript also fall in to the category of publication formats.
Word documents suffer because back in the dim ages of the late 1980s
Microsoft decided to merge content managing with content editing with
content publishing and abysmally failed at all of them.
However, the easiest way to edit a pdf is convert it to word using
say https://pdf2docx.com/ There are also plenty of ways in linux to
do that but they all take time and effort to make work.
Russell L. Harris
2024-06-24 23:00:01 UTC
Permalink
Someone gave me an old SCEPTRE display with a screen 11.5 inch by 22
inch. I never before saw the usefulness of a wide screen.

A reader such as Atril can take advantage of the wide screen, allowing
me to zoom in until the type size is comfortable, without the need to
scroll left and right to read each line.

RLH
Richard
2024-06-25 08:50:01 UTC
Permalink
[...]You have your content in a neutral format [...]
ooxml is far from "neutral"...
What you don't do is use these output formats as your primary content.
Obviously not. That's why they are publishing formats, as in you send that in to be published, not to be further edited. That's why formats like odf and ooxml are content creation and editing formats, not publishing formats.
Examples relevant to debian include package documentation such as man pages, markdown, doxygen, docbook, latex.
That's just the same category like HTML, so that remark isn't adding anything to the discussion.
In my most recent experience, OCR of pdf documents is quite difficult if the layout is significant such as in bank statements. There are various tools to assist in extracting the content but it's quite marginal.
Depends on the software you use. For all I know Abbyy has very capable OCR software (I think it's called FineReader) that is very much capable of handling various layouts and difficult to read - as in very old - fonts. That was already the case about a decade ago and I doubt the software has gotten any worse. But of course it's not available on Linux.
On the other hand, give a screenshot of a bank statement to the 'ai' GPT4 and ask it to extract all transactions in csv format and it is done perfectly.
As I said, AI can help with that, Abbyy is using it too. Question only is, if locally run AI can do that too, as everything else is a guarantee for breaching data protection laws.
On a sidenote, PDF is basically Postscript on steroids. Its entire purpose is to describe how content is to be placed on a printed page. On a side-side note Postscript is actually a programming language with specialty in text layout but quite capable of doing significant computation activities - so long as your output eventually gets rendered on a page.
Never said anything contrary to that. Just that it's a very difficult to handle format because many things weren't defined prior to PDF 2.0. You had a predefined feature set, but nobody told you how to implement it, so chances were high that things wouldn't work as intended with every reader. But it seems the community of programmers for PDF readers has found common ground long before PDF 2.0 was a thing, so at least the standardized things would work everywhere.
Richard Owlett
2024-06-24 17:20:01 UTC
Permalink
Post by Richard
Hello,
this very much depends on what you are expecting it to do. In general, PDFs
are only meant to be viewed - and printed - they where never meant for
anything else. ...
... only meant to be viewed by those with *NORMAL* vision ...
I'm attempting to read a USDA document.[1]
The printed version of this document is marginally readable.

Tools such as "Atril Document Viewer" provide selected magnification.
For this particular document and monitor, 150% is comfortable. Requires
re-positioning the viewpoint 500 to 600 times to read document.

For _this_ document, Atril can select all the text on a page in a manner
that can be pasted in a "reasonable" manner to a Pluma document.

It will:
a. ignore actual graphics.
b. put title/headings/??? on a separate line.
c. all text between full page-width title/headings/??? will be
treated as a logical unit.
It will not:
1. put a blank line between paragraphs.
2. put a blank line above/below lines containing title/headings/???.
3. identify superscripts in some manner.

All this suggests that it should be able to extract text from a PDF and
create a HTML document likely using only <p>, <br>, <sup>, and <li> in
its <body>.


[1]
https://fns-prod.azureedge.us/sites/default/files/resource-files/TFP2021.pdf
_Thrifty Food Plan, 2021_
Food and Nutrition Service
August 2021
FNS-916
Karen Lewellen
2024-06-24 17:30:01 UTC
Permalink
Good afternoon.
I am providing another option that might help here.
robobraille,

www.robobraille.org
Provides services, free of charge, that will convert pdf files to a
number of different formats, including .html
They provide audio, mobi, and convert epub files too..but I digress.
As a test, consider sending your file to
convert at robobraille.org
correctly of course.
in the subjectline put html
leaving the body blank, and attach the file.
See if the .html file returned meets your needs.
Best,
Karen
Post by Richard Owlett
Post by Richard
Hello,
this very much depends on what you are expecting it to do. In general, PDFs
are only meant to be viewed - and printed - they where never meant for
anything else. ...
... only meant to be viewed by those with *NORMAL* vision ...
I'm attempting to read a USDA document.[1]
The printed version of this document is marginally readable.
Tools such as "Atril Document Viewer" provide selected magnification.
For this particular document and monitor, 150% is comfortable. Requires
re-positioning the viewpoint 500 to 600 times to read document.
For _this_ document, Atril can select all the text on a page in a manner that
can be pasted in a "reasonable" manner to a Pluma document.
a. ignore actual graphics.
b. put title/headings/??? on a separate line.
c. all text between full page-width title/headings/??? will be
treated as a logical unit.
1. put a blank line between paragraphs.
2. put a blank line above/below lines containing title/headings/???.
3. identify superscripts in some manner.
All this suggests that it should be able to extract text from a PDF and
create a HTML document likely using only <p>, <br>, <sup>, and <li> in its
<body>.
[1]
https://fns-prod.azureedge.us/sites/default/files/resource-files/TFP2021.pdf
_Thrifty Food Plan, 2021_
Food and Nutrition Service
August 2021
FNS-916
Nicolas George
2024-06-24 17:30:01 UTC
Permalink
Post by Karen Lewellen
Good afternoon.
I am providing another option that might help here.
robobraille,
www.robobraille.org
Provides services, free of charge, that will convert pdf files to a number
of different formats, including .html
They provide audio, mobi, and convert epub files too..but I digress.
As a test, consider sending your file to
convert at robobraille.org
correctly of course.
in the subjectline put html
leaving the body blank, and attach the file.
See if the .html file returned meets your needs.
Interesting.

Do you know how they fare with math? I mean real, non-trivial formulas
produced by LaTeX like you would find in
https://arxiv.org/abs/1803.05929 ?

(I know, I could test. I will if you do not know the answer.)

Regards,
--
Nicolas George
Richard
2024-06-24 19:00:01 UTC
Permalink
I wouldn't say PDFs are bad for visually impaired users. In fact, as bitmap
fonts are thankfully a thing of the past for almost everywhere, you can
zoom any document to your hearts desire. Though sometimes you need some
tricks, e.g. Evince is configured to only use 50 MB of storage by default
for caching, vastly limiting zoom capabilities. So you'll have to dig into
dconf to change that.

What you are looking for is ways to reflow text, but as a fixed layout
format, PDFs are just not meant for that. Not even the PDF/UA standard [1]
does require this, it only lays the ground rules for screen readers.
Supposedly the swiss-made "VIP PDF-Reader" was able to help, yet it seems
to have been abandoned as there doesn't seem to be any download options
anymore. And other than that, PDF readers with that capability are very
rare on any platform. No idea if anybody besides Adobe is doing that
because PDF is such a terribly complicated format.

In theory, this should all be doable with Tesseract, as it already does the
OCR part. Just nobody has bothered yet to support such use cases yet and
support an output format that can even handle more than just text.

Best
Richard

[1]: https://en.wikipedia.org/wiki/PDF/UA
Franco Martelli
2024-06-26 15:10:01 UTC
Permalink
Hello.
Is there a PDF editor that would work with Debian 12?
Time ago I used Qpdf to delete some pages in a .pdf, for a quick
description:

~$ apt show qpdf

in the manual there are some command examples, I used these command to
edit a pdf:

- To delete the last two pages of a pdf:

~$ qpdf 1.pdf --pages . 1-r3 -- test.pdf

- To merge two .pdf files:

~$ qpdf --empty --pages 1.pdf 2.pdf -- test.pdf

If you are interested in qpdf, once installed, read the
/usr/share/doc/qpdf/README-doc.txt file for a list of URL where to find
documentation.

Cheers,
--
Franco Martelli
Richard
2024-06-26 21:10:01 UTC
Permalink
qpdf is good for e.g. removing any password protection - given you know the
password. But I kinda doubt that's what's meant with editor. And quite
frankly, you can do most of what qpdf does more comfortably with tools like
PDFSam or PDF Arranger. The latter even lets you crop pages or rename the
document name (saved inside the pdf). If you want a reason to go CLI,
that's definitely ghostscript. It can compress (losslessly), decompress,
resize images and pages, have it conform to various PDF standards - to my
knowledge pretty much the only free piece of software that will write PDF
2.0 compatible files - merge files, embed fonts/font subsets or convert
them to outlines, convert to images...and that's far from a complete list.
Of course it's quite complex but there are many pages out there that will
tell you how to achieve what. I doubt there's a single program as capable
as ghostscript - maybe with the exception of Acrobat Pro.

Richard

Am Mi., 26. Juni 2024 um 21:48 Uhr schrieb Franco Martelli <
Post by Franco Martelli
Hello.
Is there a PDF editor that would work with Debian 12?
Time ago I used Qpdf to delete some pages in a .pdf, for a quick
~$ apt show qpdf
in the manual there are some command examples, I used these command to
~$ qpdf 1.pdf --pages . 1-r3 -- test.pdf
~$ qpdf --empty --pages 1.pdf 2.pdf -- test.pdf
If you are interested in qpdf, once installed, read the
/usr/share/doc/qpdf/README-doc.txt file for a list of URL where to find
documentation.
Cheers,
--
Franco Martelli
Loading...