Pattern recognition among records

Question

prakash89.gitam -3 Newbie Poster

12 Years Ago

Hi

I have a task to do. I will just brief it in a sentence,

I have a file of 20000 lines. Now task is to recognize number of similar lines.

ex1 : "The quick brown fox jumps on a lazy dog"
ex2 : "The quick brown dog jumps on a very lazy fox"

I want these two sentences as one group. So among 20000 lines i have to recognize how many groups are present.

Time complexity is major issue. Please tell me is there any approach i can go ahead with??

java regex seo

3 Contributors
2 Replies
181 Views
2 Days Discussion Span
Latest Post 12 Years Ago Latest Post by Taywin

NormR1 563 Posting Sage

12 Years Ago

An idea: assign each new word a number and save the numbers representing the words in each line in an array for that line. Comparing numbers should be faster than comparing Strings.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Taywin 312 Posting Virtuoso · Answer 1 · 2012-08-13T15:43:57+00:00

When you talk about Time complexity, do you consider space as well? If not, then you may try...

/*
LineDataObject
  - lineNumber (int)
  - content (String[])  // each string is a word of the line
LineDataComparison
  - lineObj (LineDataObject)
  - duplicatedLineNumber (ArrayList<Integer>)
LineComparison
  - allLines (ArrayList<LineDataComparison>)
lineComp <- a LineComparison object
lineNumber <- 0
read <- each line of the file starting from the first line

while read!=EOF
  lineNumber <- lineNumber + 1
  aLine <- read the line
  lObj <- create LineDataComparison using the line data (aLine) and lineNumber
  if lObj is not one of lineComp line data
    add lObj to lineComp
  else  // discard the newly created object but add line number instead
    add the line number to the matched LineDataComparison in lineComp
  end if
  read <- next line
end while
*/

The reading file part will be O(n). The cost (big O) for time is going to depend on the comparison of lObj with lineComp because the worst case scenario would be O(nlogn*m) where m is the number of words in a line. The cost for space is the whole content of the file (big O) but it may not be up to the whole file because of duplicated. Using break once it found a match could improve the performance too.