I'm using Python 2.5.2 on a Ubuntu box for a research project based on data from the Fatality Analysis Reporting System (FARS) database (1975-), available at
http://www-fars.nhtsa.dot.gov/Main/index.aspx.
I found 115 characters of the form "\xzz" -- possibly hex? -- in 4+ card-image records on about 300k incidents from 1975-1981 so far. Each record is supposed to have up to 88 alpha-numeric characters. The record/card/field layout was constant for those years.
This is (obviously) not a huge problem, since 115 odd-balls is a tiny fraction of 80 (say) * 1.2m characters, but I'd like to figure out if those characters have any other interpretation before replacing them with spaces, question marks, or something else.
My processing loop begins with:
for line in fileinput.input(path_file):
All other returned characters are in string.printable.
Each record ends with "\r", so I'm guessing they were created on a Macintosh.
93 of the unexpected characters are in the field for VIN (vehicle identification number) values. The VIN code is clear and well known; see
http://www.autoinsurancetips.com/decoding-your-vin,
for example. These 93 instances involve:
1 time each: \x01, \x08, \x10, \x12, \x9b, and \xf9
3 times: \x19
5 times: \xf2 and
79 times: \x1b
The other 12 are in two other records (#1144677 (1979) and #1452856 (1980) of 2.whatever million read) which return \x00 (10 times each) and \x01 (once each) in the same fields. The fields report vehicle body type, truck characteristics (fuel, weight, series) and motorcycle engine displacement, not all of which apply to the same vehicle :).
Any thoughts on interpretting these strange character codes, other places to look, or should I conclude that they are random garbage?
Thanks very much!
HatGuy