Member Avatar for brakeb

I have been beating my head into the desk with this issue, and I don't think it's a simple 'uniq' or 'sort' issue.

I have a file, with many duplicate values in them.

File

dog
dog
cat
owl
owl
turkey
weasel
giraffe
giraffe
rooster

The output I am looking for would only have the following from the above file:

Output:

cat
turkey
weasel
rooster

Everything I've found so far either removes all the dupes, and keeping one copy of 'owl' or 'dog', which is not what I need. If it's duplicated, I don't want it at all in the output. The file I have is one I've merged from two other files, and they have nearly 50,000 lines in each one, so you can understand why it takes so long.

I could do this simply in C, C++, or Java. You keep a map of where the key is the data, and the value is the number of times you have seen it. With each input, you look up the data in the map. If not found, then you add a new item with a value of 1. If found, you can increment the value. When done reading the data, you walk through the map, and only output those with a value of 1.

Maps are normal constructs for C++ and Java. For C you would use a structure with a character array (or pointer) for the data, and an integer for the value, and use an array of these structs to substitute for a map.

Member Avatar for brakeb

So it's going to be way more complicated than I thought :( Guess I'll stick with the spreadsheet method. I'm under a bit of a time crunch.

Thank you anyway.

You can easily do this in a shell. If your file is named foo.in then you would get what you want with:

sort foo.in | 
   uniq -c | 
   awk '($1 == 1) {print $2}'

This gives you the entries that occur exactly once in the input file.

Now, depending on how large your file is it may take some time (50K records is not going to be bad at all).

Member Avatar for brakeb

My issue was more complicated than I originally had guessed. I figured out a solution that works for me. It's probably a bit convoluted, but it works for me, and that is what's important.

I was doing a firewall ACL audit. My latest ACL list had hitcounts on them, which was throwing off my attempts to compare an older ACL list, which each line had a different hitcount or different line numbers.

Obviously, comparing the lines wouldn't work, as 'diff' and 'comm' see the 'hitcnt' or 'line' number and report it as a different line. So what I did was take and grep out the hashes only for each file:

grep -o '0x[0-9A-Fa-f]\{4,\}' old_acl_list.txt | sort >> hashes_old
grep -o '0x[0-9A-Fa-f]\{4,\}' new_acl_list.txt | sort >> hashes_new

Then I ran 'diff' on those. Found a way for 'diff' to only show uniques from the newest file here: http://www.linuxquestions.org/questions/linux-newbie-8/comparing-two-linux-files-for-diffirences-and-similarities-822245/

diff --changed-group-format='%<' --unchanged-group-format='' hashes_new hashes_old >> final_hashes

and then all I did was grep for each of the unique values in the new acl list.

for line in `cat final_hashes`; do
    grep -i $line new_acl_list.txt >> final_acl_audit.txt

done

I thought I'd post this, as I have seen too many posts in forums where you see 'nevermind, I figured it out' without any info. Yea, it's not pretty, but if you're doing PCI firewall audits, and have to compare two firewall ACL lists, this will do it...

commented: Kudos for giving back the answer! +9
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.