Comparing 2 columns of 2 different .tsv files and write the output in a new tsv file

Question

red711 9 Newbie Poster

12 Years Ago

Hi,
I'm new to perl, and need help about comparing 2 columns within 2 different .tsv files.
I have search through the forum for some similar case of mine, such as http://www.daniweb.com/software-development/perl/threads/335711, http://www.daniweb.com/software-development/perl/threads/336421 and http://www.daniweb.com/software-development/perl/threads/311399 , I have tried to modified them according to my needs but somehow the final output is not right, so I decided to make a new thread.

I have 2 input tsv files,
The first one is archive.tsv. It is a paper database with the format of
<Paper ID>\t<Paper Title>\t<Author>\t<url>

P08-1016	Lexicalized Phonotactic Word Segmentation	Margaret M. Fleck	[url]http://aclweb.org/anthology-new/P/P08/P08-1016.pdf[/url]
P08-1021	Correcting Misuse of Verb Forms	John Lee; Stephanie Seneff	[url]http://aclweb.org/anthology-new/P/P08/P08-1021.pdf[/url]
P08-1030	Refining Event Extraction through Cross-Document Inference	Heng Ji; Ralph Grishman	[url]http://aclweb.org/anthology-new/P/P08/P08-1030.pdf[/url]
P08-1038	A Logical Basis for the D Combinator and Normal Form in CCG	Frederick Hoyt; Jason Baldridge	[url]http://aclweb.org/anthology-new/P/P08/P08-1038.pdf[/url]
P08-1039	Parsing Noun Phrase Structure with CCG	David Vadas; James R. Curran	[url]http://aclweb.org/anthology-new/P/P08/P08-1039.pdf[/url]
P08-1040	Sentence Simplification for Semantic Role Labeling	David Vickrey; Daphne Koller	[url]http://aclweb.org/anthology-new/P/P08/P08-1040.pdf[/url]
P08-1042	Ad Hoc Treebank Structures	Markus Dickinson	[url]http://aclweb.org/anthology-new/P/P08/P08-1042.pdf[/url]
P08-3003	Inferring Activity Time in News through Event Modeling	Vladimir Eidelman	[url]http://aclweb.org/anthology-new/P/P08/P08-3003.pdf[/url]
P08-5003	Semi-Supervised Learning for Natural Language Processing	John Blitzer; Xiaojin Jerry Zhu	[url]http://aclweb.org/anthology-new/P/P08/P08-5003.pdf[/url]
P08-5004	Advanced Online Learning for Natural Language Processing	Koby Crammer	[url]http://aclweb.org/anthology-new/P/P08/P08-5004.pdf[/url]

and the second file is program.tsv. It is a conference program database with the format of
<Program Session>\t<Paper Title>\t<Author>

Information Extraction 2	Refining Event Extraction through Cross-Document Inference	Ji, Heng;  Ralph Grishman
Syntax & Parsing 1	A Logical Basis for the D Combinator and Normal Form in CCG	Hoyt, Frederick;  Jason Baldridge
Speech Processing	Lexicalized Phonotactic Word Segmentation	Fleck, Margaret M.
Syntax & Parsing 1	Parsing Noun Phrase Structure with CCG	Vadas, David;  James R. Curran
Syntax & Parsing 1	Sentence Simplification for Semantic Role Labeling	Vickrey, David;  Daphne Koller
Student Research Workshop	A Supervised Learning Approach to Automatic Synonym Identification Based on Distributional Features	Hagiwara, Masato
Student Research Workshop	An Integrated Architecture for Generating Parenthetical Constructions	Banik, Eva

These datas that I posted are only some of the data that I think will represent my questions.
My objectives:
1. I want to match the Paper Title each line of the archive.tsv file with the Paper Title in the program.tsv files.
If the title in the archive file exists in the program file, I should print exactly each line in the archive file, then added the respective session from the program file.
If the title in the archive file does not exist in the program file, I should print exactly each line in the archive file, then added *NA* in the session column (added in the last column).
2. If there is a Title that is available in the program.tsv files but not available in the archive.tsv files, it should be ignored.

Here's my code:

#!/usr/bin/perl

open(FILE1,"<archive.tsv");
open(FILE2,"<program.tsv");
open (OUT,">combine.tsv");
my @array1=<FILE1>;
my @array2=<FILE2>;
close FILE1;
close FILE2;

for (@array2){
	chomp;
	my($session,$titleprog,$authorprog)=split(/\t/);
	push(@sesname,"$session"); 
	push(@title2, "$titleprog");
}

for (@array1){
	chomp;
	my($paperid,$titlearcv,$authorarcv,$url)=split(/\t/);
	push(@all, "$_");
	push(@title1, "$titlearcv");
}
$count = 0;
foreach my $progline (@title2){
	foreach my $arcline (@all){
		if ($arcline =~ m/$progline/i){
			print OUT "$arcline\t$sesname[$i]\n";
			$count++;
		}
		else {$count++;}print OUT "$arcline\t*NA*\n" if ($count == $#all);
	}
	$i++;
}
close OUT;

and I got the following output:

P08-1030	Refining Event Extraction through Cross-Document Inference	Heng Ji; Ralph Grishman	[url]http://aclweb.org/anthology-new/P/P08/P08-1030.pdf[/url]	Information Extraction 2
P08-5003	Semi-Supervised Learning for Natural Language Processing	John Blitzer; Xiaojin Jerry Zhu	[url]http://aclweb.org/anthology-new/P/P08/P08-5003.pdf[/url]	*NA*
P08-1038	A Logical Basis for the D Combinator and Normal Form in CCG	Frederick Hoyt; Jason Baldridge	[url]http://aclweb.org/anthology-new/P/P08/P08-1038.pdf[/url]	Syntax & Parsing 1
P08-1016	Lexicalized Phonotactic Word Segmentation	Margaret M. Fleck	[url]http://aclweb.org/anthology-new/P/P08/P08-1016.pdf[/url]	Speech Processing
P08-1039	Parsing Noun Phrase Structure with CCG	David Vadas; James R. Curran	[url]http://aclweb.org/anthology-new/P/P08/P08-1039.pdf[/url]	Syntax & Parsing 1
P08-1040	Sentence Simplification for Semantic Role Labeling	David Vickrey; Daphne Koller	[url]http://aclweb.org/anthology-new/P/P08/P08-1040.pdf[/url]	Syntax & Parsing 1

My desired output should be like this:
format: <Paper ID>\t<Paper Title>\t<Author>\t<url>\t<Program Session>

P08-1016	Lexicalized Phonotactic Word Segmentation	Margaret M. Fleck	[url]http://aclweb.org/anthology-new/P/P08/P08-1016.pdf[/url]	Speech Processing
P08-1021	Correcting Misuse of Verb Forms	John Lee; Stephanie Seneff	[url]http://aclweb.org/anthology-new/P/P08/P08-1021.pdf[/url]	*NA*
P08-1030	Refining Event Extraction through Cross-Document Inference	Heng Ji; Ralph Grishman	[url]http://aclweb.org/anthology-new/P/P08/P08-1030.pdf[/url]	Information Extraction 2
P08-1038	A Logical Basis for the D Combinator and Normal Form in CCG	Frederick Hoyt; Jason Baldridge	[url]http://aclweb.org/anthology-new/P/P08/P08-1038.pdf[/url]	Syntax & Parsing 1
P08-1039	Parsing Noun Phrase Structure with CCG	David Vadas; James R. Curran	[url]http://aclweb.org/anthology-new/P/P08/P08-1039.pdf[/url]	Syntax & Parsing 1
P08-1040	Sentence Simplification for Semantic Role Labeling	David Vickrey; Daphne Koller	[url]http://aclweb.org/anthology-new/P/P08/P08-1040.pdf[/url]	Syntax & Parsing 1
P08-1042	Ad Hoc Treebank Structures	Markus Dickinson	[url]http://aclweb.org/anthology-new/P/P08/P08-1042.pdf[/url]	*NA*
P08-3003	Inferring Activity Time in News through Event Modeling	Vladimir Eidelman	[url]http://aclweb.org/anthology-new/P/P08/P08-3003.pdf[/url]	*NA*
P08-5003	Semi-Supervised Learning for Natural Language Processing	John Blitzer; Xiaojin Jerry Zhu	[url]http://aclweb.org/anthology-new/P/P08/P08-5003.pdf[/url]	*NA*
P08-5004	Advanced Online Learning for Natural Language Processing	Koby Crammer	[url]http://aclweb.org/anthology-new/P/P08/P08-5004.pdf[/url]	*NA*

Can anyone please help me, what is wrong with my code?

compare file-manipulation perl

4 Contributors
7 Replies
547 Views
3 Years Discussion Span
Latest Post 9 Years Ago Latest Post by 2teez

d5e5 commented: Good sample data and script wrapped in code tags. +9

d5e5 109 Master Poster

12 Years Ago

I'll have another look when I get some time but the first thing I'd suggest:

use strict;
use warnings;

belong in every non-trivial Perl script. Add them to your script and declare any undeclared variables, such as @sesname, @title1, @all, @title2 and $i.

Another general suggestion: when opening files, use lexical filehandles instead of bare words and test for success or failure to open by including an or die "blah blah: $!" clause.

open my $fh1, '<', 'archive.tsv' or die "Failed to open archive.tsv: $!";
open my $fh2, '<', 'program.tsv' or die "Failed to open program.tsv: $!";
open my $fh3, '>', 'combine.tsv' or die "Failed to open combine.tsv: $!";

Edited 12 Years Ago by d5e5 because: n/a

d5e5 109 Master Poster

12 Years Ago

Because you only need to read each record from the archive file once, you don't need to save it into an array. Just iterate through it.

Because each program title should occur no more than once in the program file, you can read it into a hash for more efficient lookup than having to iterate through an array each time you want to search for a program title. The following works for me:

#!/usr/bin/perl
#match_archive.pl
use strict;
use warnings;

my %programs;#Hash to save program title => program session for each program rec
open my $fh1, '<', 'program.tsv' or die "Failed to open program.tsv: $!";
foreach (<$fh1>){
    chomp;#Remove newline character from end of line
    my ($ps, $pt) = split(/\t/);
    $programs{$pt} = $ps;
}
close $fh1;

open my $fh2, '<', 'archive.tsv' or die "Failed to open archive.tsv: $!";
open my $fh3, '>', 'combine.tsv' or die "Failed to open combine.tsv: $!";#Output
foreach (<$fh2>){
    chomp;#Remove newline character from end of line
    my $pt = (split(/\t/))[1];#Paper title
    if (exists $programs{$pt}){
        print $fh3 "$_\t$programs{$pt}\n";
    }
    else{
        print $fh3 "$_\t*NA*\n";
    }
}

red711 commented: useful suggestions, nice and efficient codes. +0

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

red711 9 Newbie Poster · Answer 1 · 2011-09-01T05:21:38+00:00

Hi, d5e5!
thanks, it works like a charm.
I have one question regarding the code,
During the first 'foreach' loop, you split each column into $pt and $ps , and then assign the values of $programs{$pt} into $ps .

foreach (<$fh1>){
    chomp;#Remove newline character from end of line
    my ($ps, $pt) = split(/\t/);
    $programs{$pt} = $ps;
}

$programs{$pt} refers to the content of program session, am I correct?
and, during the 'if' loop in the second 'foreach' loop,

foreach (<$fh2>){
    chomp;#Remove newline character from end of line
    my $pt = (split(/\t/))[1];#Paper title
    if (exists $programs{$pt}){
        print $fh3 "$_\t$programs{$pt}\n";
    }
    else{
        print $fh3 "$_\t*NA*\n";
    }
}

if (exists $programs{$pt}){ isn't that mean you're trying to check the program session whether it exists in the archive file?
I can't understand the difference between $programs{$pt} in if (exists $programs{$pt}){ and in print $fh3 "$_\t$programs{$pt}\n";

d5e5 109 Master Poster · Answer 2 · 2011-09-02T00:24:55+00:00

During the first 'foreach' loop, you split each column into $pt and $ps...

I split each row and assign the first two elements of the resulting list to the variables $pt (program title) and $ps (program session).

...and then assign the values of $programs{$pt} into $ps

No, I assign the content of $ps to the $programs{$pt} entry in the %programs hash. A hash contains one or more key-value pairs, where the key must be unique for the hash. This allows us to look up any value in the hash by specifying its key. The key of this particular hash entry is $pt and the value is $ps.

if (exists $programs{$pt}){ isn't that mean you're trying to check the program session whether it exists in the archive file?

No, I'm checking to see if there is an entry in the %programs hash having that program title as key. Remember that the %program hash contains all data read from program.tsv so in effect I'm checking to see if that program title (from the archive file) exists in the program file.

I can't understand the difference between $programs{$pt} in if (exists $programs{$pt}){ and in print $fh3 "$_\t$programs{$pt}\n";

There is no difference. I check if $programs{$pt} exists before trying to print $programs{$pt} because if I try to print a hash value for a key that doesn't exist in the hash perl will give me an error message. For an introduction to using hashes in Perl see http://www.perltutorial.org/perl-hash.aspx

red711 9 Newbie Poster · Answer 3 · 2011-09-02T04:12:01+00:00

Thanks so much for the detailed explanations and the tutorial! I really appreciate it.

alexei_ 0 Newbie Poster · Answer 4 · 2014-09-21T02:28:36+00:00

Hi,

First thanks a lot for the code, it matches something I am trying to reach.

I know this topic is old but I thought someone good at perl might connect :)

I am kinda stuck with this script trying to do something very similar.

My first tsv file is a list of servers / hostnames that looks like this (groups.tsv):

Groupname | Servername

group1      server1     
group2      server2     
group3      server3 server4 server5
group4      server6     
group5      server7     
group6      server7

Second one is a list of IPs / hostnames (addresses.tsv):

Hostname | IP | Purpose

server1    192.168.1.29 255.255.255.255    application1
server2    172.16.9.38 255.255.255.255  
server3    10.50.110.28 255.255.255.255 
server4    10.0.0.1 255.255.255.255        application2
server5    192.168.1.1 255.255.255.255  
server6    172.16.32.32 255.255.255.255 
server7    10.0.0.3 255.255.255.255 
server8    1.1.1.1 255.255.255.255         application3
server9    127.0.0.1 255.255.255.255    
server10   10.28.28.28 255.255.255.255

I am trying to obtain a new tsv file like this:

Hostname | IP | Purpose | Groupname

server1    192.168.1.29 255.255.255.255    application1     group1
...

At some point I have partly retrieved groupnames but it was because I have it confused with inverted key / values. I believe I would rather have to check on the values of the hash instead of the key, but I got no success so far.

For now I am having this result.tsv:

server1 192.168.1.29 255.255.255.255    application1    *NA*
server2 172.16.9.38 255.255.255.255     *NA*
server3 10.50.110.28 255.255.255.255    *NA*
....

In the same spirit, I will then have a policy.tsv

Groupname | Service   
group1  ALL
group2  HTTP
group3  HTTP HTTPS TCP_3389
group3 group4   SSH

So my final goal would be to have a final tsv file like this:

Hostname | IP | Purpose | Groupname | Service

server1    192.168.1.29 255.255.255.255    application1     group1      ALL
...

http://www46.zippyshare.com/v/30272792/file.html

But I assume it might be easier to adapt as soon as the first part is OK.

Anybody clear with perl / hash could put me on a lead ?

Thanks in advance for the help!

2teez 43 Posting Whiz · Answer 5 · 2014-09-25T00:37:39+00:00

Hi Alexei,
As you saw on this thread. This particular question was solved 3 years ago.
You should start a new thread where others could see your question clearly and be able to help you out.