extract multiple cloumns from multiple files; skip rows and include filenames; awk

Question

biobee07 0 Newbie Poster

15 Years Ago

Hello,

I am trying to write a bash shell script that does the following:

I would really appreciate if someone can help me correct my code that i have written below:

1.Finds all *.txt files within my directory of interest (files are in sub-directories)
2. reads each of the files (25 files) one by one (tab-delimited format and have the same data format)
3. skips the first 10 rows of the file
4. extracts and prints out columns 2,14 , 15 into one output file
5. adds a new column to the final output file with the name of the txt file from where the data was extracted.

I have written a shell script which is not working properly and doesnot have the code for the part to skip 10 rows.

Below I have pasted a sample input file, output file and my code
Input file format: The actual data starts from the line:
DATA 1 1 1 0

TYPE	text	text	text	text	integer	float	float	text	text	text	integer	integer	integer	integer
FEPARAMS	Protocol_Name	Protocol_date	Scan_Date	Scan_ScannerName	Scan_NumChannels	Scan_MicronsPerPixelX	Scan_MicronsPerPixelY	Scan_OriginalGUID	Grid_Name	Grid_Date	Grid_NumSubGridRows	Grid_NumSubGridCols	Grid_NumRows	Grid_NumCols
DATA	miRNA-v1_95_May07 (Read Only)	5/2/2007 12:14	1/26/2008 11:25	Agilent Technologies Scanner G2505B US45102930	1	5	5	a18d8bd4-628a-4054-b2ba-45c7a66de583	016436_D_20070426	4/26/2007 0:00	1	1	192	82
*														
TYPE	float	float	float	integer	integer	float	integer	float	float	float	integer	float	float	integer
STATS	gDarkOffsetAverage	gDarkOffsetMedian	gDarkOffsetStdDev	gDarkOffsetNumPts	gSaturationValue	gAvgSig2BkgNegCtrl	gNumSatFeat	gLocalBGInlierNetAve	gLocalBGInlierAve	gLocalBGInlierSDev	gLocalBGInlierNum	gGlobalBGInlierAve	gGlobalBGInlierSDev	gGlobalBGInlierNum
DATA	26.709	27	5.44777	1000	1203179	1.11899	0	38.7173	65.4263	2.95429	12029	65.4263	2.95429	12029
*														
TYPE	integer	integer	integer	text	integer	text	integer	integer	text	text	text	text	float	float
FEATURES	FeatureNum	Row	Col	chr_coord	SubTypeMask	SubTypeName	ProbeUID	ControlType	ProbeName	GeneName	SystematicName	Description	PositionX	PositionY
DATA	1	1	1		0		0	1	miRNABrightCorner30	miRNABrightCorner30	miRNABrightCorner30		6774.29	228.723
DATA	2	1	2		66	Structural	2	1	DarkCorner	DarkCorner	DarkCorner		6800.2	229.421
DATA	3	1	3	chr14:100595916-100595897	0		3	0	A_25_P00010115	hsa-miR-154*	hsa-miR-154*	NA	6826.51	228.385
DATA	4	1	4	chr8:135881995-135882010	0		5	0	A_25_P00010390	hsa-miR-30b	hsa-miR-30b	NA	6850.48	228.853

Output format: tab delimited file. The last column shows the filename from which the data was extracted

1 6774.29 228.723 ABC.txt 
2 6800.2 229.421 ABC.txt 
3 6826.51 228.385 DEF.txt 
4 6850.48 228.853 DEF.txt 
5 6875.37 228.408 XYZ.txt 
6 6900.98 229.321 XYZ.txt

My incomplete code: It is missing the skipping rows steps. Also it throws an error:

'test1.sh: line 3: syntax error near unexpected token `do
'test1.sh: line 3: `do

for filename in $(find -iname '*.txt') 
do
 awk -F"\t" ' 
    BEGIN {OFS="|"} {print $2,$14,$15,FILENAME}
    ' $filename > output.txt
done

shell-scripting

2 Contributors
1 Reply
190 Views
19 Hours Discussion Span
Latest Post 15 Years Ago Latest Post by sknake

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

sknake 1,622 Senior Poster Featured Poster · Answer 1 · 2009-08-20T11:34:48+00:00

The data actually starts on line 14 from what I could tell, not line 10. Try this:

sk@sk:/tmp/txt$ cat ba.sh
#!/bin/bash
for i in `find ./ -iname \*.txt`;
do
  more +14 ${i} | egrep -v '^$' | awk -v filename=${i} -F"\t" '  BEGIN {OFS="|"} {print $2,$14,$15,filename} '
done

sk@sk:/tmp/txt$ ./ba.sh
1|6774.29|228.723|./file.txt
2|6800.2|229.421|./file.txt
3|6826.51|228.385|./file.txt
4|6850.48|228.853|./file.txt