Hi guys,
I am writing this perl script.
Basically, what I want to do with this is that I have a text file vec.txt which looks something like this:
<T> chemical- and bio-terrorism </T>
<T> <C> nerve agents </C> </T> , <T> toxic proteins </T>
<T> toxic protein </T>
<T> terroristic chemical attacks </T>
<T> chemical terroristic attacks </T>
<T> terroristic bombing </T> . To fight <T> chemical terrorism </T>
<T> antiterroristic </T>
<T> terroristic chemical attack </T>
<T> chemical terrorism </T>
<T> chemical-related events </T>
<T> terrorism attacks </T>
<T> human poisoning </T>
<T> toxicology </T>
<T> terrorist group </T>
<T> casualties </T> living close to the <C> sarin </C> release <T> died </T>
<T> kill </T>
<T> murdered </T>
<T> coordinated attack </T>
<T> "casualties" </T>
<T> poisoned </T>
<T> mass casualty </T>
<T> terrorism preparedness </T>
<T> chemical </T>
<T> radiological </T>
<T> anthrax, chemical, and radiological exposures </T>
<T> chemical </T>
<T> terrorism </T>
<T> radiological terrorism </T>
<T> radiological terrorism </T>
<T> suicide scenarios </T>
<T> Weapons of mass destruction </T>
<T> weapons of mass destruction </T>
<T> chemical terrorism </T>
<T> radiological terrorism </T>
<T> chemical, or radiological terrorism </T>
<T> radiological terrorism </T>
<T> acts of terrorism </T>
<T> terrorism </T> scenario is the use of a conventional <T> explosive </T>
I wrote this script so that I can get whatever is inside the <T>.. </T> tags and write it line by line in a separate file.
The script is as follows:
#!/usr/bin/perl
#PERL SCRIPT BEGINS
open(FILE,"vec.txt");
open(FF,">vec2.txt");
while (<FILE>)
{
chomp($_);
@arr=split("",$_);
$len=@arr;
for ($i=0;$i<$len;$i++)
{
if (($arr[$i]=="<")&&($arr[$i+1]=="T")&&($arr[$i+2]==">"))
{
$line="";
chomp($line);
$i=$i+4;
do
{
$line=$line.$arr[$i++];
chomp($line);
}
while (!(($arr[$i]=="<")&&($arr[$i+1]=="/")&&($arr[$i+2]=="T")&&($arr[$i+3]==">")));
print FF "$line\n";
}
else
{
next;
}
}
}
close FILE;
close FF;
Now I expect the output to be something like:
chemical- and bio-terrorism
<C> nerve agents </C>
toxic proteins
toxic protein
terroristic chemical attacks
chemical terroristic attacks
terroristic bombing
chemical terrorism
antiterroristic
terroristic chemical attack
chemical terrorism
chemical-related events
.....
But instead what I get is:
a
d
t
i
T
<
r
e
/
T
T
i
t
<
t
p
n
t
i
c
a
a
/
c
a
r
i
a
/
t
i
b
g
.....
Please tell me what is going wrong.
Any help shall be appreciated.
Thanks.