hi,
i am trying to parse a collection of text files that look like following. there a number of them but not all are structured exactly the same way.if i figure to parse this file i can handle the rest.i am novice in bash, and i want to parse the size in bytes column only, but the document is non uniform, even the delimiters are not same.for example between collection name and no of documents the delimiter is <tab,tab,space> but the delimiter option accepts only a single char. please i really want to know how to handle this weird parsing.
<COMMENT> TestbedName: trec123-100-sample300-callan99
<COMMENT>
<COMMENT> RevisionHistory:
<COMMENT> v1a, October 28, 1999:
<COMMENT> - Initial release.
<COMMENT>
<COMMENT> NumberOfCollections: 100
<COMMENT> NumberOfDocuments: 30,000
<COMMENT> SizeInBytes: 270,145,023
<COMMENT>
<COMMENT> CollectionName NumberOfDocuments SizeInBytes
<COMMENT> ap88_1 300 805,137
<COMMENT> ap88_2 300 762,665
<COMMENT> ap88_3 300 787,242
<COMMENT> ap88_4 300 823,265
<COMMENT> ap88_5 300 728,748
<COMMENT> ap88_6 300 835,178
<COMMENT> ap88_7 300 751,507
<COMMENT> ap88_8 300 788,955
<COMMENT> ap89_1 300 853,772
<COMMENT> ap89_2 300 824,850
<COMMENT> ap89_3 300 804,463
<COMMENT> ap89_4 300 807,761
<COMMENT> ap89_5 300 885,290
<COMMENT> ap89_6 300 838,144
<COMMENT> ap89_7 300 793,021
<COMMENT> ap89_8 300 834,519
<COMMENT> ap90_1 300 774,047
<COMMENT> ap90_2 300 875,806
<COMMENT> ap90_3 300 874,445
<COMMENT> ap90_4 300 845,463
<COMMENT> ap90_5 300 776,059
<COMMENT> ap90_6 300 820,116
<COMMENT> ap90_7 300 810,200
<COMMENT> ap90_8 300 794,558
<COMMENT> doe_1 300 229,975
<COMMENT> doe_2 300 245,362
<COMMENT> doe_3 300 255,834
<COMMENT> doe_4 300 236,867
<COMMENT> doe_5 300 255,995
<COMMENT> doe_6 300 255,932
<COMMENT> fr88_1 300 5,227,664
<COMMENT> fr88_2 300 12,109,095
<COMMENT>
<COMMENT> ap88_1: 300 docs, 805137 bytes
AP880224-0321 ap88_1
AP880218-0282 ap88_1
AP880225-0251 ap88_1
AP880217-0097 ap88_1
AP880324-0012 ap88_1
AP880322-0004 ap88_1
AP880217-0216 ap88_1
AP880220-0003 ap88_1
AP880309-0328 ap88_1
AP880319-0122 ap88_1
AP880321-0192 ap88_1
AP880225-0287 ap88_1
AP880319-0135 ap88_1
AP880322-0152 ap88_1
AP880222-0259 ap88_1
AP880222-0246 ap88_1
AP880223-0121 ap88_1
AP880225-0047 ap88_1
AP880312-0124 ap88_1
AP880311-0326 ap88_1
AP880219-0203 ap88_1
AP880319-0036 ap88_1
AP880316-0152 ap88_1
AP880219-0037 ap88_1