amnakhan786 0 Newbie Poster

hi,
i am trying to parse a collection of text files that look like following. there a number of them but not all are structured exactly the same way.if i figure to parse this file i can handle the rest.i am novice in bash, and i want to parse the size in bytes column only, but the document is non uniform, even the delimiters are not same.for example between collection name and no of documents the delimiter is <tab,tab,space> but the delimiter option accepts only a single char. please i really want to know how to handle this weird parsing.

<COMMENT> TestbedName:          trec123-100-sample300-callan99
<COMMENT>
<COMMENT> RevisionHistory:
<COMMENT> v1a, October 28, 1999:
<COMMENT>   - Initial release.
<COMMENT>
<COMMENT> NumberOfCollections:            100
<COMMENT> NumberOfDocuments:           30,000
<COMMENT> SizeInBytes:            270,145,023
<COMMENT>
<COMMENT> CollectionName   NumberOfDocuments  SizeInBytes
<COMMENT>    ap88_1      300        805,137
<COMMENT>    ap88_2      300        762,665
<COMMENT>    ap88_3      300        787,242
<COMMENT>    ap88_4      300        823,265
<COMMENT>    ap88_5      300        728,748
<COMMENT>    ap88_6      300        835,178
<COMMENT>    ap88_7      300        751,507
<COMMENT>    ap88_8      300        788,955
<COMMENT>    ap89_1      300        853,772
<COMMENT>    ap89_2      300        824,850
<COMMENT>    ap89_3      300        804,463
<COMMENT>    ap89_4      300        807,761
<COMMENT>    ap89_5      300        885,290
<COMMENT>    ap89_6      300        838,144
<COMMENT>    ap89_7      300        793,021
<COMMENT>    ap89_8      300        834,519
<COMMENT>    ap90_1      300        774,047
<COMMENT>    ap90_2      300        875,806
<COMMENT>    ap90_3      300        874,445
<COMMENT>    ap90_4      300        845,463
<COMMENT>    ap90_5      300        776,059
<COMMENT>    ap90_6      300        820,116
<COMMENT>    ap90_7      300        810,200
<COMMENT>    ap90_8      300        794,558
<COMMENT>    doe_1       300        229,975
<COMMENT>    doe_2       300        245,362
<COMMENT>    doe_3       300        255,834
<COMMENT>    doe_4       300        236,867
<COMMENT>    doe_5       300        255,995
<COMMENT>    doe_6       300        255,932
<COMMENT>    fr88_1      300      5,227,664
<COMMENT>    fr88_2      300     12,109,095
<COMMENT>
<COMMENT>  ap88_1:  300 docs, 805137 bytes
AP880224-0321 ap88_1
AP880218-0282 ap88_1
AP880225-0251 ap88_1
AP880217-0097 ap88_1
AP880324-0012 ap88_1
AP880322-0004 ap88_1
AP880217-0216 ap88_1
AP880220-0003 ap88_1
AP880309-0328 ap88_1
AP880319-0122 ap88_1
AP880321-0192 ap88_1
AP880225-0287 ap88_1
AP880319-0135 ap88_1
AP880322-0152 ap88_1
AP880222-0259 ap88_1
AP880222-0246 ap88_1
AP880223-0121 ap88_1
AP880225-0047 ap88_1
AP880312-0124 ap88_1
AP880311-0326 ap88_1
AP880219-0203 ap88_1
AP880319-0036 ap88_1
AP880316-0152 ap88_1

AP880219-0037 ap88_1

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.