Retrive PDF Data

Question

dukleatish 0 Newbie Poster

9 Years Ago

Hi
I m currently working on windows application that is loading electoral roll pdf file. What I m trying to do is to get the data as per Sr. No., Epic No., Name, Father's / Husbands Name, Age, Sex, House No. and pincode. Data is available in 3 columns and 10 rows i.e. 30 persons details per page (Some pages may have less).
Developer: VB.Net 2010
.Net 4.5 Framework
Acrobat Reader DC
OS: Windows 7

Thanks in advance

pdf vb.net

3 Contributors
4 Replies
163 Views
1 Day Discussion Span
Latest Post 9 Years Ago Latest Post by rproffitt

jwenting 1,905 duckman

9 Years Ago

whether you can decipher the information in a pdf in any way depends on how the pdf was created. One can create pdf files as documents with paragraphs and tables of text, in which case it is possible (with the right libraries or a lot of work to write them) to extract data from them.
However many pdf generators are more lazy and create the pdf as a single bitmap image per page in which case you're pretty much buggered.
You might be able to extract the images and then use OCR software to try to find text in them, but it's much less reliablee and much more messy.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

rproffitt 2,701 https://5calls.org Moderator · Answer 1 · 2015-12-10T17:55:03+00:00

So while there are PDF addons to VB, if I was to do this I might use my old methods instead. My old way was to push the file (doesn't matter what format) through a command line app that outputs to plain text for me to scan with my code for the content I need.

So to google and try PDF2TXT.

dukleatish 0 Newbie Poster · Answer 2 · 2015-12-11T04:03:38+00:00

In this electoral roll pdf the data boxes will have same pattern like Sr. No. at top left corner, besides it is epic no.,next line name and so on. Is there any way to use 3rd party OCR software dll and make it identify boxes, text, fields etc. If anyone knows the coding or how to do it please help. Thank You

rproffitt 2,701 https://5calls.org Moderator · Answer 3 · 2015-12-11T15:31:05+00:00

Here's another approach if the PDF is indeed image based and not charaters. There's this project I worked on last year and sorry if I can't share code on it but the ideas are open for re-use.

The scenario was to automate the storage and association of pictures taken on an assembly line.

This was quite a lot of fun to figure out so I'll share the highlights.
1. Pictures were taken with a mid 100 dollar point and shoot camera.
2. Transfer was automated using an EyeFi SD card. (Hey! Secret Sauce ingredient!)
3. The rules were simple for the end user.
a. First 2 shots were of items that clearly showed the work order tag (bar code and text.)
b. Pictures followed of product as the management wanted.
4. Now the work began as the application, all automated would process the images through:
a. PBMPLUS and NETPBM tools to create a B/W image for next step.
b. Run the image through TESSERACT OCR to output text files.
c. The application would sift through the text files looking for the work order number.
d. The pictures would be moved to the folder for the work order and logged in the application's log.

This was one heck of fun project as it pulled together ideas from many areas and leveraged open source code.

What I learned about how to get good Tesseract OCR output was nothing you'd learn in school. But for you, there is a way.

After all that, you still have to write your code to sift out the data you want.