Dear All

I am working with a tab-delimeted files - first two columns are pairs of identifiers and third column is floating point number (0-1) denoting interaction strength between colmn1 and 2. sample file is also attached.


eg-

column1 column2 column3

john steve 0.67588
john matt 1.00
red blue 0.90
yellow steve 0.02

and so on...

from this file I need to know for each unique identifier in first column how many connections are there in second column (example john have two connections and red and yellow have one) at different threshold (column3) from 0 to 1. and writing an output to a file.

the final output look something like this -

for column3 >= 0.6

john 2
red 1
yellow 0

Thanks..

From your description, column2 seems to be unnecessary. If that is the case, you
can read and store your data in a dictionary in the structure below:

connections = {'john':[0.67588, 1.00],
               'red':[0.9],
              }

If you want to print for column3 > 0.06:

for id in connections:
    print id, len([x for x in connections[id] if x > 0.06])

There are several different ways to do this. One is a dictionary pointing to a list that contains the number of records found for the key, and the thresholds found to test if greater than (if I am reading the question correctly). You could also use two dictionaries, one as the counter, and one to hold the thresholds if that is easier to understand. An SQL file would be in order if this is a large data set. You could also create a class instance for each unique name, but that is probably more trouble and more confusing than the other solutions. A simple example:

test_list = [
"john steve 0.67588",
"john matt 1.00",
"red blue 0.90",
"yellow steve 0.02" ]

test_dict = {}
for rec in test_list:
    substrs = rec.split()
    key = substrs[0]
    if key not in test_dict:
        ## add the new key, counter=1, and a list containing the threshold
        test_dict[key] = [1, [float(substrs[2])]]
    else:
        test_dict[key][0] += 1     ## add one to counter (zero location in list
        test_dict[key][1].append(float(substrs[2]))   ## interaction strength

for key in test_dict:
    ## are any > 0.5
    values = test_dict[key][1]
    for v in values:
        if v > 0.5:
            print key, v, test_dict[key][0]

This would produce your desired output:

import itertools as it

data="""john steve 0.67588
john matt 1.00
red blue 0.90
yellow steve 0.02"""

data_pairs = sorted((first, value)
                    for first,second,value in (d.split()
                                               for d in data.splitlines()))
limit = 0.6
for name, group in it.groupby(data_pairs, lambda x: x[0]):
    print name, len([ value for _,value in group if float(value) >= limit])
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.