Dear Sir,
I would like to extract only unique terms from all sdrf.txt files but this python code outputs unique terms for every file individually. Like Array Data File , Array Design REF ... are repeated in most of sdrf.txt files so i don't wanna print it as unique terms. Could you please tell me to hide case sensitive in python because Characteristics[OrganismPart] is printed as unique term to Characteristics[organism part] similarly for Characteristics[Sex] with Characteristics[sex]. I am eagerly waiting for your support and positive reply.
Regards,
Haobijam
#!/usr/bin/python
import glob
import string
outfile = open('output.txt' , 'w')
files = glob.glob('*.sdrf.txt')
previous = set()
for file in files:
print('\n'+file)
infile = open(file)
#previous = set() # uncomment this if do not need to be unique between the files
for line in infile:
lineArray = line.rstrip()
if not line.startswith('Source Name') : continue
lineArray = line.split('%s\t')
output = "%s\t\n"%(lineArray[0])
outfile.write(output)
uniqwords = set(word.strip() for word in lineArray[0].split('\t')
if word.strip() and word.strip() not in previous)
print('The %i unique terms are:\n\t%s' % (len(uniqwords),'\n\t'.join(sorted(uniqwords))))
previous |= uniqwords
infile.close()
outfile.close()
print('='*80)
print('The %i terms are:\n\t%s' % (len(previous),'\n\t'.join(sorted(previous))))