Hi,

I have a requirement where I have to identify duplicates from a file based on the first 6 chars (It is fixed width file of 12 chars length) and whenever a duplicate row is found, its original and duplicate row's last 2 chars should be updated to all 0's if they are not same. (I mean last 2 digits of original and duplicate row should be same, if not then default to 00 else keep them as is)

I can use multiple loops and get the results but I would need something which will be faster

here is the sample input and output

input:
1251233Y1234
1221249N8821
1231116Y9945
1231113Y2123
1231109Y3212
1231123N1214
1231126N1214

output should be:
1251233Y1234
1221249N8821
1231116Y9900
1231113Y2100
1231109N3212
1231123N1214
1231126N1214 (Since last 2 digits are same nothing changed)

Any help in achieving the above result using either awk/sed will be greatly appreciated.

Thanks,
Faraway

Are you required to use awk/sed? This would be much easier if you could employ something like perl/ruby/python.
In general, the loop body would look something like:

if first_line
    previous_line = current_line
    continue
end if

saved_current_line = current_line

if previous_line[0..6] == current_line[0..6]
    previous_line[-2..-1] = "00"
    current_line[-2..-1] = "00"
end if

previous_line = saved_current_line

output previous_line
output current_line

Of course, this completely ignores the case when you have an odd number of sequential lines with a matching prefix - you'd have to add logic in to handle that.

Whoops. The last part of that should output before reassigning to previous_line as well as checking for the last line instead of blindly printing current_line at each loop iteration.

So something like:

output previous_line

previous_line = saved_current_line

if last_line
    output current_line
end if
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.