#7 pdf forms to xml forms

2013-12-16T14:02:50.292000Z

Lines roughly correspond to TEXT tags. A simple concatenation of TOKEN content creates the line. TOKEN are generated since they carry typographical information for each token.

RE: forms, pdf2xml extracts information found in the PDF. Your PDF form is a set of text and graphical information. The form structure is not explicitly given. It has to be generated.

pdf forms to xml forms

2013-11-24T21:17:08.060000Z

Using:

http://hivelocity.dl.sourceforge.net/project/pdf2xml/binaries/Linux%2064%20v1.2.7/pdftoxml.linux64.exe.1.2_7.gz

downloaded on:

2013-01-12

and applied to:

http://www.irs.gov/pub/irs-pdf/f1040.pdf

which was downloaded on:

2013-03-11

produces what looks like a ... element for each word.
For example, the attachment shows a portion of the xml output after
running thru xmlindent.

Could pdf2xml be modified so that words on same line are concatenated
in a single say, ... element to make the xml easier to read?
The code here:

http://www.mobipocket.com/dev/pdf2xml/pdf2xml.zip

does that; hence, it must be possible.

Also, the f1040.pdf has many pdf form fields which don't appear in the
resulting .xml file produced by pdf2xml. Could pdf2xml be modified to
produce some type of xform fields, something like that shown here:

http://xformsinstitute.com/essentials/browse/ch02s02.php

Thanks for all the work on this.

I'm a pretty good c++ programmer and I'm trying to understand pdf;
hence, maybe I could provide some help on these features.

-regards,
Larry

Recent changes to 7: pdf forms to xml forms

#7 pdf forms to xml forms

pdf forms to xml forms