Recent changes to support-requests

PDFBox jar file as Reference for MS Access VBA project?

Tony Macelli — Tue, 29 Nov 2022 18:40:10 -0000

Intending to read and work with my PDF files through VBA API calls from PDFBox library,, I've downloaded a PDFBox jar file from https://pdfbox.apache.org/download.cgi, placed it in the Program Files folder in my PC (Win 11 Home), opened a MS Access VBA database (part of MS Office 365), and in the VBA interface used Tools | Reference ... and found the jar file. An error resulted: "Can't add a reference to the specified file."

I tried this three times, with each of the following files: pdfbox-2.0.1.jar , pdfbox-2.0.16.jar ,
pdfbox-app-3.0.0-alpha3.jar - but in each case the resulting error was the same.

Wha tam I doing wrong and ho wcan I achieve my aim? Thanks.

can it be used with php

Anonymous — Wed, 07 Oct 2009 08:10:57 -0000

Hi,

I am desperately trying to extract plain text from PDF documents. I bashed up a code that can only process 80% of my PDF file collection.
Cand PDFBox be pf any help to me ?

can it be used with php

Anonymous — Wed, 07 Oct 2009 07:24:35 -0000

Hi,

I am desperately trying to extract plain text from PDF documents. I bashed up a code that can only process 80% of my PDF file collection.
Cand PDFBox be pf any help to me ?

Extracting subscript char - Issue in some pdf

eclipse79 — Tue, 05 May 2009 09:47:22 -0000

Hello,
I'm trying PDFBox 0.7.3 in order to extract text from PDF files, but I have noticed a problem on subscript chars. This issue occurs in the most PDF that I have (not in all). I have very often the word "CO2", where 2 is subscript char. Some files extract the text putting a CRLF before and after the "2".
These are some examples:

Inoltre, utilizzando unicamente combustibili fossili, il comparto non ha la possibilità di ridurre le
emissioni di CO
2
.

- la riduzione dell?impronta CO
2
complessiva,

Can anybody help me?
Thank you
Eclipse79

PDFBox has moved to Apache

Ben Litchfield — Wed, 23 Jul 2008 00:12:33 -0000

All new bugs should be posted on the Apache PDFBox page.

https://issues.apache.org/jira/browse/PDFBOX

or visit the current Apache PDFBox project page

http://incubator.apache.org/projects/pdfbox.html

ClassCastException issue when extracting graphics

Anonymous — Tue, 17 Jun 2008 08:28:03 -0000

Hello

I am evaluating PDFBox 7.0.13 to extract images out of a bunch of PDF files. These PDF files are all scanned documents. The graphics will then be passed to an OCR program to extract the text.
During the execution, about 15% of the documents fail with 2 types of errors:
-------------------------------------------------
java.lang.ClassCastException: org.pdfbox.cos.COSArray cannot be cast to org.pdfbox.cos.COSDictionary
at org.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.buildHeader(PDCcitt.java:501)
at org.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.<init>(PDCcitt.java:363)
at org.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.<init>(PDCcitt.java:354)
at org.pdfbox.pdmodel.graphics.xobject.PDCcitt.write2OutputStream(PDCcitt.java:128)
at PDFBox1.parseDocument(PDFBox1.java:237)
at PDFBox1.processAll(PDFBox1.java:108)
at PDFBox1.main(PDFBox1.java:468)
Failed to process - reason: Failed to parse file
-------------------------------------------------
java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at org.pdfbox.pdmodel.graphics.predictor.None.decode(None.java:71)
at org.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDPixelMap.java:154)
at org.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStream(PDPixelMap.java:166)
at PDFBox1.parseDocument(PDFBox1.java:237)
at PDFBox1.processAll(PDFBox1.java:108)
at PDFBox1.main(PDFBox1.java:468)
-------------------------------------------------
My problem is that these documents are classified, so I cannot submit a test case.
Basically, I have 2 questions:
1. since these problem always occur at the same address, can you identify the problem without a test case?
2. does the CVS version (7.0.14) contain a fix for these problems?

Best regards

JP
dev@softpark.ws

embedded tif extraction from pdf

Anonymous — Tue, 11 Mar 2008 13:15:33 -0000

After extraction the tiff image from the pdf file (see attachment ae90.pdf), you get changed "Photometric Interpertation" (white pixels are black, black pixels are white) in tiff image. I used PDFBox-0.7.2.jar and PDFBox-0.7.3.jar and finally PDFBox-0.7.4-dev-20080306.jar

Thank you for your help

christian

cnczech@web.de

PDFTextStripper not handling some Japanese

sflaumen — Thu, 29 Nov 2007 15:33:43 -0000

Using this code sequence:

PDDocument document = PDDocument.load(stream);
PDFTextStripper stripper = new PDFTextStripper();
String contents = stripper.getText(document);

some Japanese documents are handled properly. This is shown by viewing the chars in the String "contents".
However, other Japanese documents produce garbage non-Japanese characters as viewed in the String contents.

The ones that are not handled properly in PDFTextStripper display a prompt when opened in Acrobat Reader which says that a Japanese language support pack needs to be installed to view the document properly. The ones that are handled properly display Japanese characters fine when viewed through Acrobat Reader. Installing the language support pack is not a solution since it would only resolve the display in Acrobat Reader. This code needs to run on a Unix server so even if the support pack would provide help on a PC (unlikely) it would have no affect on the task when run in Unix.

This appears to be an encoding issue however, unlike similar issues that have been reported, the above code completes successfully. It is just that the results are as described above.

Attached is an example of a PDF file that is not handled properly by PDFTextStripper and requires a Japanese language pack to view in Acrobat Reader.

File size shrinks after populating data into the fields

chapsi — Wed, 12 Sep 2007 14:44:58 -0000

Hi all,

I am trying to populate the fields in a PDF with data using PDFBox API. I notice that the original document is around 124 KB and the newly populated pdf is only about 82 KB. I am trying to print this modified PDF and I cant print. But I can open the modified PDF in Adobe Reader and do a File --> Print.

Has anyone seen this kind of a problem before ? Why would the file size be smaller after it gets populated. It should only be higher.

Appreciate any responses...

Thanks,
Chapsi

Extracting text by ID

Umkhulubaas — Tue, 04 Sep 2007 19:51:01 -0000

Attachment upload for:

http://sourceforge.net/forum/forum.php?thread_id=1812274&forum_id=267205

Recent changes to support-requests

PDFBox jar file as Reference for MS Access VBA project?

can it be used with php

can it be used with php

Extracting subscript char - Issue in some pdf

*****PDFBox has moved to Apache*****

ClassCastException issue when extracting graphics

embedded tif extraction from pdf

PDFTextStripper not handling some Japanese

File size shrinks after populating data into the fields

Extracting text by ID

PDFBox has moved to Apache