I experienced a bug yesterday after spending hours getting pages scanned in just right.
There were two pages that when clicked on the left, page list area, would not display in the large view pane on the right. This was my first sign something was wrong.
I thought something was 'up' with those two pages. And the next step for me was running OCR on everything and then saving. So I can OCR on everything but those two and began saving. While typing the date on the first document to save, something happened. perhaps I touched the touch pad, or another window took focus while typing the date. (I DID NOT CLICK SAVE) but focus left the date field before I had completed typing the date. And gscan2pdf segfaulted because the date contents were invalid!!!!!
(The date contents, will always be invalid while typing, until the last character is typed.....)
When I reopened gscan2pdf, it checked the temp dir for failed sessions, and I selected the one offered. But it failed to recover at page 103. An error appeared: "Error importing page 103. Ignoring."
on the commandline it shows:
INFO - New page written as /home/myusername/.local/tmp/gscan2pdf-xC6W/Ixy6FR7HDX.png (8E095C62-82BF-11ED-9C4D-C654078829EA)
INFO - New page filename /home/myusername/.local/tmp/gscan2pdf-xC6W/97bV9Da5Bn.png, format Portable Network Graphics
INFO - New page written as /home/myusername/.local/tmp/gscan2pdf-xC6W/Ybby3hfp71.png (903E20F8-82BF-11ED-9C4D-C654078829EA)
ERROR - Open file, Error importing page 103. Ignoring.
WARN - *** unhandled exception in callback:
*** Can't call method "hide" on an undefined value at /usr/bin/gscan2pdf line 2502.
*** ignoring at /usr/lib64/perl5/vendor_perl/Glib/Object/Introspection.pm line 67.
I examined the tempdir and found hundreds more image files than pages that were in that session. It looks like gscan2pdf leaves files in that dir, forever. Even if you delete the corresponding pages.
I thought perhaps one of the images was corrupt, but they all rendered thumbnails in nautilus properly. None were zero bytes. All permissions were consistent. So why does it error importing the page? more detail needed in that error message. What specifically prevents gscan2pdf from importing that page?
HOW CAN I RECOVER CRASHED SESSIONS LIKE THIS?
This has happened several times now and every time it does it wastes over an hour of time to re-scan every page, do rotations, and correct ordering and rescans of scan errors, etc.
Please give me the contents of the temp dir so that I can investigate the error.
If you can reproduce the crash entering the date, I'll try and fix that, too.
Was gscan2pdf still usable after failing to import page 103?
Normally, gscan2pdf should only keep the files necessary for the current page, plus those in the undo and redo buffers. If you can show me a case where others are kept, too, I'll take a look.
I'm in the middle of a complete rewrite, which I hope will also eliminate problems like this.
If a crashed session cannot otherwise be recovered, then assuming the image files are still in the temp dir, they can be imported, but of course they will not be sorted.
I tend to work in batches of 20-30 pages at a time and don't see these problems.
The temp dir contains 8 gb of randomly ordered financnial and personal legal documents. I have no reason to distrust you, but I feel uneasy to post them publicly. For example, I typically shred them after verifying they were scanned and backed up properly
Despite the error message saying that page 103 will be ignored (which I would be fine with), it doesn't ignore it. It halts the recovery. I tried saying a prayer and waiting overnight and it was still halted.
All images read without error by imagemagick 'identify'.
It would be nice if when it has an error importing a page, it prints the actual file name of the page, so I could delete it, or replace it with a copy of a good page to keep gscan2pdf working on the recovery. But as far as I can tell it only prints to stdout the filenames of the pages it was able to successfully recover. And I can't see where in the session file to correlate a page number with a file name.
Most of the time I spent getting these pages into gscan2pdf was spend on sorting. So to reimport them and re-sort all of them would take longer than re-scanning them.
My documents are often longer than 30 pages in one document, some are single pages. some are doublesided some are single. Keeping it to under 30 at a time might be doable if I could run multiple instances of gscan2pdf concurrently, so as one was scanning I could be sorting or checking another one, but I can't.
If the expectation is 30 pages or less, can that be conveyed to first time users?
What should I look for in the temp dir to indicate the problem? permissions are fine, imagemagick validated the files. What causes this error?
Last edit: nwdm 2022-12-23
I completely understand that you don't want to share the files, even with me privately. My guess is that the problem was not the image of page 103, but maybe the metadata/OCR output.
It is not that I have an expectation of 30 pages or less, only that is what my workflow is, and therefore what gets tested most.
I very much expect that the rewrite I am working on will improve the general stability and scalability.
Do you still have the crashed session available? i.e. if I give you some patches to firstly work out what broke, and then better deal with it, would you be able to test them?
Yes, I still have it. I wonder if ocr output is part of it too. It's general fuzziness would support the random nature of this bug. I've run into it four times so far. I'd be happy to run some patches and report what I find.
Anything that increases that debug log verbosity to give us a greater idea what's going on. Maybe something to verify start and successful completion of reading Ocr data from the session file? Or of each "page object"?
I wish I was scanning in old books or something like that but almost every page i scan has financial account numbers on it.
Gscan2pdf is all perl right, so I should be able to apply patches in place without recompiling I imagine, and restore to stock with dnf reinstall.
I'm not against compilation either if that's needed. It's would be far from the first time I've compiled something, but I usually prefer scripting languages.
and Merry Christmas Jeffery!
Last edit: nwdm 2022-12-25
Merry Christmas to you, too! Thanks.
Yes gscan2pdf is Perl.
Just for clarity, what is line 2502 of /usr/bin/gscan2pdf ?
Here's the first patch:
On my Debian system, the above file is installed to: /usr/share/perl5/Gscan2pdf/Document.pm but it may be different on yours.