Hi,
I'm new to perl, and need help about comparing 2 columns within 2 different .tsv files.
I have search through the forum for some similar case of mine, such as http://www.daniweb.com/software-development/perl/threads/335711, http://www.daniweb.com/software-development/perl/threads/336421 and http://www.daniweb.com/software-development/perl/threads/311399 , I have tried to modified them according to my needs but somehow the final output is not right, so I decided to make a new thread.
I have 2 input tsv files,
The first one is archive.tsv. It is a paper database with the format of
<Paper ID>\t<Paper Title>\t<Author>\t<url>
P08-1016 Lexicalized Phonotactic Word Segmentation Margaret M. Fleck [url]http://aclweb.org/anthology-new/P/P08/P08-1016.pdf[/url]
P08-1021 Correcting Misuse of Verb Forms John Lee; Stephanie Seneff [url]http://aclweb.org/anthology-new/P/P08/P08-1021.pdf[/url]
P08-1030 Refining Event Extraction through Cross-Document Inference Heng Ji; Ralph Grishman [url]http://aclweb.org/anthology-new/P/P08/P08-1030.pdf[/url]
P08-1038 A Logical Basis for the D Combinator and Normal Form in CCG Frederick Hoyt; Jason Baldridge [url]http://aclweb.org/anthology-new/P/P08/P08-1038.pdf[/url]
P08-1039 Parsing Noun Phrase Structure with CCG David Vadas; James R. Curran [url]http://aclweb.org/anthology-new/P/P08/P08-1039.pdf[/url]
P08-1040 Sentence Simplification for Semantic Role Labeling David Vickrey; Daphne Koller [url]http://aclweb.org/anthology-new/P/P08/P08-1040.pdf[/url]
P08-1042 Ad Hoc Treebank Structures Markus Dickinson [url]http://aclweb.org/anthology-new/P/P08/P08-1042.pdf[/url]
P08-3003 Inferring Activity Time in News through Event Modeling Vladimir Eidelman [url]http://aclweb.org/anthology-new/P/P08/P08-3003.pdf[/url]
P08-5003 Semi-Supervised Learning for Natural Language Processing John Blitzer; Xiaojin Jerry Zhu [url]http://aclweb.org/anthology-new/P/P08/P08-5003.pdf[/url]
P08-5004 Advanced Online Learning for Natural Language Processing Koby Crammer [url]http://aclweb.org/anthology-new/P/P08/P08-5004.pdf[/url]
and the second file is program.tsv. It is a conference program database with the format of
<Program Session>\t<Paper Title>\t<Author>
Information Extraction 2 Refining Event Extraction through Cross-Document Inference Ji, Heng; Ralph Grishman
Syntax & Parsing 1 A Logical Basis for the D Combinator and Normal Form in CCG Hoyt, Frederick; Jason Baldridge
Speech Processing Lexicalized Phonotactic Word Segmentation Fleck, Margaret M.
Syntax & Parsing 1 Parsing Noun Phrase Structure with CCG Vadas, David; James R. Curran
Syntax & Parsing 1 Sentence Simplification for Semantic Role Labeling Vickrey, David; Daphne Koller
Student Research Workshop A Supervised Learning Approach to Automatic Synonym Identification Based on Distributional Features Hagiwara, Masato
Student Research Workshop An Integrated Architecture for Generating Parenthetical Constructions Banik, Eva
These datas that I posted are only some of the data that I think will represent my questions.
My objectives:
1. I want to match the Paper Title each line of the archive.tsv file with the Paper Title in the program.tsv files.
If the title in the archive file exists in the program file, I should print exactly each line in the archive file, then added the respective session from the program file.
If the title in the archive file does not exist in the program file, I should print exactly each line in the archive file, then added *NA* in the session column (added in the last column).
2. If there is a Title that is available in the program.tsv files but not available in the archive.tsv files, it should be ignored.
Here's my code:
#!/usr/bin/perl
open(FILE1,"<archive.tsv");
open(FILE2,"<program.tsv");
open (OUT,">combine.tsv");
my @array1=<FILE1>;
my @array2=<FILE2>;
close FILE1;
close FILE2;
for (@array2){
chomp;
my($session,$titleprog,$authorprog)=split(/\t/);
push(@sesname,"$session");
push(@title2, "$titleprog");
}
for (@array1){
chomp;
my($paperid,$titlearcv,$authorarcv,$url)=split(/\t/);
push(@all, "$_");
push(@title1, "$titlearcv");
}
$count = 0;
foreach my $progline (@title2){
foreach my $arcline (@all){
if ($arcline =~ m/$progline/i){
print OUT "$arcline\t$sesname[$i]\n";
$count++;
}
else {$count++;}print OUT "$arcline\t*NA*\n" if ($count == $#all);
}
$i++;
}
close OUT;
and I got the following output:
P08-1030 Refining Event Extraction through Cross-Document Inference Heng Ji; Ralph Grishman [url]http://aclweb.org/anthology-new/P/P08/P08-1030.pdf[/url] Information Extraction 2
P08-5003 Semi-Supervised Learning for Natural Language Processing John Blitzer; Xiaojin Jerry Zhu [url]http://aclweb.org/anthology-new/P/P08/P08-5003.pdf[/url] *NA*
P08-1038 A Logical Basis for the D Combinator and Normal Form in CCG Frederick Hoyt; Jason Baldridge [url]http://aclweb.org/anthology-new/P/P08/P08-1038.pdf[/url] Syntax & Parsing 1
P08-1016 Lexicalized Phonotactic Word Segmentation Margaret M. Fleck [url]http://aclweb.org/anthology-new/P/P08/P08-1016.pdf[/url] Speech Processing
P08-1039 Parsing Noun Phrase Structure with CCG David Vadas; James R. Curran [url]http://aclweb.org/anthology-new/P/P08/P08-1039.pdf[/url] Syntax & Parsing 1
P08-1040 Sentence Simplification for Semantic Role Labeling David Vickrey; Daphne Koller [url]http://aclweb.org/anthology-new/P/P08/P08-1040.pdf[/url] Syntax & Parsing 1
My desired output should be like this:
format: <Paper ID>\t<Paper Title>\t<Author>\t<url>\t<Program Session>
P08-1016 Lexicalized Phonotactic Word Segmentation Margaret M. Fleck [url]http://aclweb.org/anthology-new/P/P08/P08-1016.pdf[/url] Speech Processing
P08-1021 Correcting Misuse of Verb Forms John Lee; Stephanie Seneff [url]http://aclweb.org/anthology-new/P/P08/P08-1021.pdf[/url] *NA*
P08-1030 Refining Event Extraction through Cross-Document Inference Heng Ji; Ralph Grishman [url]http://aclweb.org/anthology-new/P/P08/P08-1030.pdf[/url] Information Extraction 2
P08-1038 A Logical Basis for the D Combinator and Normal Form in CCG Frederick Hoyt; Jason Baldridge [url]http://aclweb.org/anthology-new/P/P08/P08-1038.pdf[/url] Syntax & Parsing 1
P08-1039 Parsing Noun Phrase Structure with CCG David Vadas; James R. Curran [url]http://aclweb.org/anthology-new/P/P08/P08-1039.pdf[/url] Syntax & Parsing 1
P08-1040 Sentence Simplification for Semantic Role Labeling David Vickrey; Daphne Koller [url]http://aclweb.org/anthology-new/P/P08/P08-1040.pdf[/url] Syntax & Parsing 1
P08-1042 Ad Hoc Treebank Structures Markus Dickinson [url]http://aclweb.org/anthology-new/P/P08/P08-1042.pdf[/url] *NA*
P08-3003 Inferring Activity Time in News through Event Modeling Vladimir Eidelman [url]http://aclweb.org/anthology-new/P/P08/P08-3003.pdf[/url] *NA*
P08-5003 Semi-Supervised Learning for Natural Language Processing John Blitzer; Xiaojin Jerry Zhu [url]http://aclweb.org/anthology-new/P/P08/P08-5003.pdf[/url] *NA*
P08-5004 Advanced Online Learning for Natural Language Processing Koby Crammer [url]http://aclweb.org/anthology-new/P/P08/P08-5004.pdf[/url] *NA*
Can anyone please help me, what is wrong with my code?