Search for a string in pdf document using java

Question

karthikprs 0 Newbie Poster

12 Years Ago

I am trying to extract the number in the string "(c) 2010 Elsevier Ltd" from a PDF document . I found that the textsearch method of PDFtron package would help me find the string as stated in the example code:

import pdftron.Common.PDFNetException;
import pdftron.PDF.*;
import pdftron.SDF.SDFDoc;

// This sample illustrates the basic text search capabilities of PDFNet.
public class TextSearchTest 
{
	
	public static void main(String[] args)
	{
		PDFNet.initialize();
		String input_path =  "../../TestFiles/";

		try	
		{
			PDFDoc doc = new PDFDoc(input_path + "credit card numbers.pdf");
			doc.initSecurityHandler();
			
			TextSearch txt_search = new TextSearch();
			int mode = TextSearch.e_whole_word | TextSearch.e_page_stop;

			String pattern = "joHn sMiTh";

			//call Begin() method to initialize the text search.
			txt_search.begin( doc, pattern, mode, -1, -1 );

			int step = 0;
		
			//call Run() method iteratively to find all matching instances.
			while ( true )
			{
				TextSearchResult result = txt_search.run();
	
				if ( result.getCode() == TextSearchResult.e_found )
				{
					if ( step == 0 )
					{
						//step 0: found "John Smith"
						//note that, here, 'ambient_string' and 'hlts' are not written to, 
						//as 'e_ambient_string' and 'e_highlight' are not set.
						System.out.println(result.getResultStr() + "'s credit card number is:");
	
						//now switch to using regular expressions to find John's credit card number
						mode = txt_search.getMode();
						mode |= TextSearch.e_reg_expression | TextSearch.e_highlight;
						txt_search.setMode(mode);
						String new_pattern = "\\d{4}-\\d{4}-\\d{4}-\\d{4}"; //or "(\\d{4}-){3}\\d{4}"
						txt_search.setPattern(new_pattern);
	
						step = step + 1;
					}
					else if ( step == 1 )
					{
						//step 1: found John's credit card number
						System.out.println("  " + result.getResultStr());
	
						//note that, here, 'hlts' is written to, as 'e_highlight' has been set.
						//output the highlight info of the credit card number
						Highlights hlts = result.getHighlights();
						hlts.begin(doc);
						while ( hlts.hasNext() )
						{
							System.out.println("The current highlight is from page: " + hlts.getCurrentPageNumber());
							hlts.next();
						}
						
						//see if there is an AMEX card number
						String new_pattern = "\\d{4}-\\d{6}-\\d{5}";
						txt_search.setPattern(new_pattern);

						step = step + 1;
					}
					else if ( step == 2 )
				    {
						//found an AMEX card number
						System.out.println("\nThere is an AMEX card number: ");
						System.out.println("  " + result.getResultStr());
	
						//change mode to find the owner of the credit card; supposedly, the owner's
						//name proceeds the number
						mode = txt_search.getMode();
						mode |= TextSearch.e_search_up;
						txt_search.setMode(mode);
						String new_pattern = "[A-z]++ [A-z]++";
						txt_search.setPattern(new_pattern);
	
						step = step + 1;
					}
					else if ( step == 3 )
					{
						//found the owner's name of the AMEX card
						System.out.println("Is the owner's name:");
						System.out.println("  " + result.getResultStr() + "?");
						
						//add a link annotation based on the location of the found instance
						Highlights hlts = result.getHighlights();
						hlts.begin(doc);
						while ( hlts.hasNext() )
						{
							Page cur_page= doc.getPage(hlts.getCurrentPageNumber());
							double[] q = hlts.getCurrentQuads();
							int quad_count = q.length/8;
							for ( int i = 0; i < quad_count; ++i )
							{
								//assume each quad is an axis-aligned rectangle
								int offset = 8*i;
								double x1 = Math.min(Math.min(Math.min(q[offset+0], q[offset+2]), q[offset+4]), q[offset+6]);
								double x2 = Math.max(Math.max(Math.max(q[offset+0], q[offset+2]), q[offset+4]), q[offset+6]);
								double y1 = Math.min(Math.min(Math.min(q[offset+1], q[offset+3]), q[offset+5]), q[offset+7]);
								double y2 = Math.max(Math.max(Math.max(q[offset+1], q[offset+3]), q[offset+5]), q[offset+7]);
								pdftron.PDF.Annots.Link  hyper_link =  pdftron.PDF.Annots.Link.create(doc, new Rect(x1, y1, x2, y2), Action.createURI(doc, "http://www.pdftron.com"));
								cur_page.annotPushBack(hyper_link);
							}
							hlts.next();
						}
						String output_path = "../../TestFiles/Output/";
						doc.save((output_path + "credit card numbers_linked.pdf"), SDFDoc.e_linearized, null);
						break;
					}
				}
				else if ( result.getCode() == TextSearchResult.e_page )
				{
					//you can update your UI here, if needed
				}
				else
				{
					break;
				}
			}
			
			doc.close();
		}
		catch (PDFNetException e)
		{
			System.out.println(e);
		}
		
		PDFNet.terminate();
	}
}

Is there any way by which i can get the number alone from the string stored in separate file?

java pdf string-search

3 Contributors
12 Replies
3K Views
1 Week Discussion Span
Latest Post 12 Years Ago Latest Post by karthikprs

dantinkakkar 19 Junior Poster

12 Years Ago

So basically you want to store the number in a separate PDF file after extracting it or what?

dantinkakkar 19 Junior Poster

12 Years Ago

You're looking for file input and output. The Java Tutorials would help you here.

This link is one of the best available, I guess:

http://docs.oracle.com/javase/tutorial/essential/io/

JamesCherrill 4,733 Most Valuable Poster

12 Years Ago

How far have you got? Have you been able to use the code above as a base to find the year for a single pdf and just print it out? If not, that would be a good place to start.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

karthikprs 0 Newbie Poster · Answer 1 · 2012-02-28T15:50:45+00:00

karthikprs 0 Newbie Poster

12 Years Ago

Yes exactly but in a txt file.

karthikprs 0 Newbie Poster · Answer 2 · 2012-02-28T22:46:20+00:00

My actual concept is to work on a sorting journals in a folder according to its year of publication... So for that i'm extracting the year and trying to proceed further in the sorting process with the year obtained.

karthikprs 0 Newbie Poster · Answer 3 · 2012-02-29T14:29:27+00:00

How far have you got? Have you been able to use the code above as a base to find the year for a single pdf and just print it out? If not, that would be a good place to start.

Yes my friend, I got the year for almost 200 files using a regex some don't print the year as the copyright symbol is considered as some garbage value for them by the compiler.

JamesCherrill 4,733 Most Valuable Poster Team Colleague Featured Poster · Answer 4 · 2012-02-29T15:11:57+00:00

Great. Writing them to to text file is just like printing, you just need a PrintStream, eg new PrintStream("myfile.txt") and print to that instead.

karthikprs 0 Newbie Poster · Answer 5 · 2012-02-29T15:46:50+00:00

Great. Writing them to to text file is just like printing, you just need a PrintStream, eg new PrintStream("myfile.txt") and print to that instead.

Fine. but how to sort the files with one of the years as input.

JamesCherrill 4,733 Most Valuable Poster Team Colleague Featured Poster · Answer 6 · 2012-02-29T15:52:58+00:00

That's easier to do in memory after reading the pdfs but before writing your output txt file.
There are many many ways to sort data in Java, but here's one simple way:
As you process the pdfs, put the results into a TreeMap (documentation in the usual places) with the year as the key and the file name as the value. (TreeMaps are kept sorted in key order.)
When all the pds are processed you can write the contents of the KeyMap to the file.

But do you mean sort or search?

karthikprs 0 Newbie Poster · Answer 7 · 2012-02-29T15:57:05+00:00

That's easier to do in memory after reading the pdfs but before writing your output txt file.
There are many many ways to sort data in Java, but here's one simple way:
As you process the pdfs, put the results into a TreeMap (documentation in the usual places) with the year as the key and the file name as the value. (TreeMaps are kept sorted in key order.)
When all the pds are processed you can write the contents of the KeyMap to the file.
But do you mean sort or search?

yes. i mean sort.

JamesCherrill 4,733 Most Valuable Poster Team Colleague Featured Poster · Answer 8 · 2012-02-29T16:13:54+00:00

OK then, KeyMap is a simple way to go.
If this is going to get any more complicated than just year+file you should think about creating a small class to hold all the info about each journal.

karthikprs 0 Newbie Poster · Answer 9 · 2012-03-06T17:14:16+00:00

Hi James , I tried the sorting of pdfs as u said using map it gave a good output.
I need ur help in this regex [&] \\d* [a-z A-z \\.]*[Allrightsreserved \\.]. The regex is for finding the text"©2006 Elsevier Ltd. All rights reserved.". The copyright symbol is recognized by the compiler for each file as any one of these symbols &,q,r,'or ©. How do i modify the regex so as to recognize all these symbols. Please help me in this context.