This post is a followup to an older dicsussion on recordmanaging. I am writing this to share with colleagues, so it may be a bit fluffy and not straight to the point. I apologize to all the experts.
Good programs always start by managing data in a flexible and robust manner. Python has great builtin container datatypes (lists, tuples, dictionaries), but often, we need to go beyond these and create flexible data handlers which are custom-fitted to our analysis. To this end, the programmer really does become an architect, and a myriad of possible approaches are applicable. This is a double-edged sword, though, as inexperienced coders (like me) will tend to go down the wrong avenues, and implement poor solutions. Emerging from such a trajectory, I really want to share my experience and introduce what I feel is a very easy-to-use and applicable data container.
PyTony in the aforementioned link really turned me on to the underused Python data container,namedtuple. A namedtuple is an immutable array container just like a normal tuple, except namedtuples have field designations. Therefore, elements can be accessed by attribute lookup and itemlookup (ie x.a or x[0]); whereas, tuples have no concept of attribute lookup. Namedtuples are a really great option for storing custom datatypes for these reasons:
- They are lightweight (take up very little memory)
- They allow for manual creation, but are also seemlessly interfaced to file or sqldatabase input (see reference)
- They have many basic utilities builtin, such as the ability to instantiate directly from lists and dictionaries, or simple means for subclassing and prototyping.
Therefore, named tuples really are ideal for managing data which may come from various files, databases and from manual construction; they are not limited to a specific import domain. They also take up less memory than subclasses of Python's objectGreat example right here on DaniWeb, and they have many builtin methods which object subclassing would require the programmer to write herself. One should note that if data mutability (eg changing the data attributes in the program directly) is paramount to the analysis, object subclassing is probably the way to go. Given all the advantages of namedtuples, I realized that they do have some shortcomings. My biggest gripes with namedtuples are:
- Namedtuples don't inherently understand default field values.
- Namedtuples don't typcheck field values.
Let me demonstrate this with an example. I'm going to define a namedtuple class called "Person" which has three fields, name, age and height. I will then make a person object from it.
In [1]: from collections import namedtuple
In [2]: Person=namedtuple('Person', 'name, age, height')
In [3]: bret=Person(name='bret', age=15, height=50)
In [4]: bret
Out[4]: Person(name='bret', age=15, height=50)
This is a nice record. I can access values by attribute lookup and I can use builtin methods to do nice things like return a dictionary without building any extra code.
In [5]: bret.age, bret.name, bret.height
Out[5]: (15, 'bret', 50)
In [6]: bret._asdict()
Out[6]: OrderedDict([('name', 'bret'), ('age', 15), ('height', 50)])
Ok, so this works nicely, but what if we want to read in records with no height column. This is the first place that namedtuple will fail you.
In [59]: ted=Person(name='ted', age=50)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-59-23acd5446d52> in <module>()
----> 1 ted=Person(name='ted', age=50)
TypeError: __new__() takes exactly 4 arguments (3 given)
There are many instances when it is desirable to have this behavior; however, there are also instances when it is not desirable. For example, if we were storing data input from a survey and certain fields were left blank, do we really want this to crash the program? The alternative is to populate this with null or default data manually, so wouldn't it be great if namedtuples understood this implicitly? One can think of many other instances where defaulting is important, and it is especially helpful when fields have very obscure or misleading dataypes, which may confuse anyone else using your codebase
The second thing namedtuples don't do is enforce field types. Consider again our Person class. The attribute "name" implies that a string should be entered, but there's nothing to enforce this. The same is true for height; that is, certain information is presumed on the user's part.
In [10]: kevin=Person(name=32, age='string input', height=['a', 'list', 'has been entered'])
In [11]: kevin
Out[11]: Person(name=32, age='string input', height=['a', 'list', 'has been entered'])
Because a named tuple is a very basic container, it really doesn't care why types of objects you pass into the fields. Without getting into a philosophical argument on duck typing, I think we can all agree that there are times when this behavior is undesirable. Imagine you were going to share your codebase with someone else unfamiliar with the subject. Fieldnames might not be so obvious. Additionally, if you built your analysis assuming the height attribute had a very particular format, eg (6 foot 9 inches), everyone's life would be easier if the namedtuple new about it.
At the end of the day, I think all of these considerations fall under the umbrella of record keeping in Python. It is an interesting topic and certainly warrants further discussion.
Now let me get into my solution and why I think it's elegant. First, I should mention one cannot directly subclass a namedtupl; namedtuple is a function which builds classes, not a class in and of itself. To modify what is returned by namedtuple requires altering the source code directly, which is rather messy. The previous discussion actually was in regard to this. My solution then was pretty simple:
- Write a class that natively understands default values and types.
- Make sure the class can typecheck and fill in missing data in a light and syntactically nice way.
- Pass the adjusted data into a namedtuple, and hide most of this under the hood.
This way, the newclass does all of its field typechecking before initalizing namedtuples. Default data is passed at instantiaton, and the default values and types are stored from then on. I will demonstrate by example- let's create a Person named tuple with the new class. We will define our stringent fields and pass them right in to the class instantiation.
In [12]: from recordmanager import RecordManager
In [16]: personfields=[('name', 'unnamed',) , ('age', int()), ('height', float() )]
In [17]: personmanager=RecordManager('Person', personfields)
By passing in personfields, we implicitly are telling RecordManager the default value and type of each field. The personmanager can now make named tuples from lists or dictionaries in a similar manner as an ordinary namedtuple.
In [26]: bill=personmanager._make('Billy', 32, 10000.00)
In [27]: bill
Out[27]: Person(name='Billy', age=32, height=10000.0)
At first glance, this looks like no different from the standard namedtuple _make() method; however, this _make method() is being called on the RecordManager class; therefore, it will typecheck fields. We can make the typcechecking verbose with a keyword, "warning".
In [29]: jill=peronmanagerjill=personmanager._make('Jill', 40.0, 50, warning=True)
Recasting 40.0 to <type 'int'> as 40
Recasting 50 to <type 'float'> as 50.0
In [30]: jill
Out[30]: Person(name='Jill', age=40.0, height=50)
Of course, certain types can't be recast, therefore, an error will come up showing exactly why.
In [31]: adam=personmanager._make('Adam', 'teststring', 40.0)
Out[31]:TypeError: Argument: teststring to <type 'int'>
All of the returns are still namedtuples, so all standard methods natively work.
In [33]: bill._asdict()
Out[33]: OrderedDict([('name', 'Billy'), ('age', 32), ('height', 10000.0)])
I've also incoporated an optional way to create from incomplete lists (ie missing fields). A namedtuple will be returned with field defaults for non-entered fields; however, this ASSUMES one enters fields in their correct order from left to right.
In [35]: joe=personmanager._make(name='Joe', extend_defaults=True)
In [36]: joe
Out[36]: Person(name='unnamed', age=0, height=0.0)
The namedtuple is stored in an attribute, so it can still be accessed directly. The Person attribue in the personmanager class can be accessed directly to bypass any typechecking, and will function like an ordinary namedtuple. I can pass bad input in and it won't care.
In [38]: jenny=personmanager.Person(name='jenny', age='string input', height=30)
In [39]: jenny
Out[39]: Person(name='jenny', age='string input', height=30)
Therefore, one is by no means obligated to use these special class methods. Strictly typechecked namedtuples can be used generated alongside default ones with no extra hassle.
Namedtuples have a really nice feature of instantiating from a dictionary. Let me demonstrate this first by accessing the standard namedtuple directly:
In [43]: d={'name':'Larry', 'age':50, 'height':90}
In [44]: Larry=personmanager.Person(**d)
In [45]: Larry
Out[45]: Person(name='Larry', age=50, height=90)
Again, this has no concept of defaults and/or typed fields. To incorporate this, personamanger has a method called dict_make, whichlet's users pass incomplete fields with type recasting.
In [55]: d={'name':'Fred', 'age':30.0}
In [56]: Fred=personmanager.dict_make(warning=True, **d)
Recasting 30.0 to <type 'int'> as 30
In [57]: Fred
Out[157]: Person(name='Fred', age=30, height=0.0)
Notice that default height was used as well as recasting.
Eventually I will add a method to supbclass within this framework, and then I think this will completely mimic namedtuple functionality. I hope that you found this useful and looking forward to feedback.