gscan2pdf / Bugs / #410 Total loss of work after "Error importing page"

Jeffrey Ratcliffe - 2022-12-23

Please give me the contents of the temp dir so that I can investigate the error.

If you can reproduce the crash entering the date, I'll try and fix that, too.

Was gscan2pdf still usable after failing to import page 103?

Normally, gscan2pdf should only keep the files necessary for the current page, plus those in the undo and redo buffers. If you can show me a case where others are kept, too, I'll take a look.

I'm in the middle of a complete rewrite, which I hope will also eliminate problems like this.

If a crashed session cannot otherwise be recovered, then assuming the image files are still in the temp dir, they can be imported, but of course they will not be sorted.

I tend to work in batches of 20-30 pages at a time and don't see these problems.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

nwdm - 2022-12-23

The temp dir contains 8 gb of randomly ordered financnial and personal legal documents. I have no reason to distrust you, but I feel uneasy to post them publicly. For example, I typically shred them after verifying they were scanned and backed up properly

Despite the error message saying that page 103 will be ignored (which I would be fine with), it doesn't ignore it. It halts the recovery. I tried saying a prayer and waiting overnight and it was still halted.

All images read without error by imagemagick 'identify'.

It would be nice if when it has an error importing a page, it prints the actual file name of the page, so I could delete it, or replace it with a copy of a good page to keep gscan2pdf working on the recovery. But as far as I can tell it only prints to stdout the filenames of the pages it was able to successfully recover. And I can't see where in the session file to correlate a page number with a file name.

Most of the time I spent getting these pages into gscan2pdf was spend on sorting. So to reimport them and re-sort all of them would take longer than re-scanning them.

My documents are often longer than 30 pages in one document, some are single pages. some are doublesided some are single. Keeping it to under 30 at a time might be doable if I could run multiple instances of gscan2pdf concurrently, so as one was scanning I could be sorting or checking another one, but I can't.

If the expectation is 30 pages or less, can that be conveyed to first time users?

What should I look for in the temp dir to indicate the problem? permissions are fine, imagemagick validated the files. What causes this error?

Last edit: nwdm 2022-12-23

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeffrey Ratcliffe - 2022-12-24

I completely understand that you don't want to share the files, even with me privately. My guess is that the problem was not the image of page 103, but maybe the metadata/OCR output.

It is not that I have an expectation of 30 pages or less, only that is what my workflow is, and therefore what gets tested most.

I very much expect that the rewrite I am working on will improve the general stability and scalability.

Do you still have the crashed session available? i.e. if I give you some patches to firstly work out what broke, and then better deal with it, would you be able to test them?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

nwdm - 2022-12-25

Yes, I still have it. I wonder if ocr output is part of it too. It's general fuzziness would support the random nature of this bug. I've run into it four times so far. I'd be happy to run some patches and report what I find.

Anything that increases that debug log verbosity to give us a greater idea what's going on. Maybe something to verify start and successful completion of reading Ocr data from the session file? Or of each "page object"?

I wish I was scanning in old books or something like that but almost every page i scan has financial account numbers on it.

Gscan2pdf is all perl right, so I should be able to apply patches in place without recompiling I imagine, and restore to stock with dnf reinstall.

I'm not against compilation either if that's needed. It's would be far from the first time I've compiled something, but I usually prefer scripting languages.

and Merry Christmas Jeffery!

Last edit: nwdm 2022-12-25

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Merry Christmas to you, too! Thanks.

Yes gscan2pdf is Perl.

Just for clarity, what is line 2502 of /usr/bin/gscan2pdf ?

Here's the first patch:

diff --git a/lib/Gscan2pdf/Document.pm b/lib/Gscan2pdf/Document.pm
index 76cd0bcd..0e9df73a 100644
--- a/lib/Gscan2pdf/Document.pm
+++ b/lib/Gscan2pdf/Document.pm
@@ -1961,9 +1961,12 @@ sub open_session {
             if ( $options{error_callback} ) {
                 $options{error_callback}->(
                     undef, 'Open file',

-                    sprintf __('Error importing page %d. Ignoring.'), $pagenum
+                    sprintf __('Error importing page %d. Ignoring: %s'),
+                    $pagenum, $_
                 );
             }
+            $logger->info(
+                "Page $pagenum, filename $session{$pagenum}{filename}, ");
         };
     }
     if ( defined $self->{row_changed_signal} ) {

On my Debian system, the above file is installed to: /usr/share/perl5/Gscan2pdf/Document.pm but it may be different on yours.

Total loss of work after "Error importing page"

Group

Searches

Help

#410 Total loss of work after "Error importing page"

Discussion