Here is the CSV file that I am working with:
`"A","B","C","D","E","F","G","H","I","J"
"88",18,1,"<Req TID=""34"" ReqType=""MS""><IISO /><CID>2</CID><MemID>0000</MemID><MemPass /><RequestData><S>[REMOVED]</S><Na /><La /><Card>[REMOVED]</Card><Address /><HPhone /><Mail /></ReqData></Req>","<Response T=""3"" RequestType=""MS""><MS><Memb><PrivateMembers /><Ob>0-12-af</Ob><Locator /></Memb><S>[REMOVED]</S><CNum>[REMOVED]</CNum><FName /><LaName /><Address /><HPhone /><Email /><IISO /><MemID /><MemPass /><T /><CID /><T /></MS></Response>",0-JAN-10 12.00.02 AM,27-JUN-15 12.00.00 AM,"26",667,0
"22",22,1,"<Req TID=""45"" ReqType=""MS""><IISO /><CID>4</CID><MemID>0000</MemID><MemPass /><RequestData><S>[REMOVED]</S><Na /><La /><Card>[REMOVED]</Card><Address /><HPhone /><Mail /></ReqData></Req>","<Response T=""10"" RequestType=""MS""><MS><Memb><PrivateMembers /><Ob>0-12-af</Ob><Locator /></Memb><S>[REMOVED]</S><CNum>[REMOVED]</CNum><FName /><LaName /><Address /><HPhone /><Email /><IISO /><MemID /><MemPass /><T /><CID /><T /></MS></Response>",0-JAN-22 12.00.02 AM,27-JUN-22 12.00.00 AM,"26",667,0
"32",22,1,"<Req TID=""15"" ReqType=""MS""><IISO /><CID>45</CID><MemID>0000</MemID><MemPass /><RequestData><S>[REMOVED]</S><Na /><La /><Card>[REMOVED]</Card><Address /><HPhone /><Mail /></ReqData></Req>","<Response T=""10"" RequestType=""MS""><MS><Memb><PrivateMembers /><Ob>0-12-af</Ob><Locator /></Memb><S>[REMOVED]</S><CNum>[REMOVED]</CNum><FName /><LaName /><Address /><HPhone /><Email /><IISO /><MemID /><MemPass /><T /><CID /><T /></MS></Response>",0-JAN-20 12.00.02 AM,27-JUN-34 12.00.00 AM,"26",667,0`
The below function is annotated. Briefly,the function get_clientresponses_two
reads the above CSV, selects column E's data instances (XML data).There are two two generator functions to parse the XML data in **column E **in order to convert the XML tags and their text into a Python dictionary. Specifically, the flatten_dict(
) function returns an iterable sequence of (key, value) pairs. One can turn this to a list of pairs with list(flatten_dict(root))
.
The output, as it is written so far is generate a dictionary. Then, def allocate_and_write_data_
then takes those and creates two different collections. One is a set that is updated using the keys from flatten_dict(
) . This is to ensure that the element tags from the XML are included in the headers (along with their corresponding values) in the newly written CSV. The code is written such to maintain the integrity of the headers (no duplicates) and allow for new element tags to be converted into headers (along with their values). Moreover, the headers and values that already exist should be flexible enough to be updated with new instances (again - unique, as well). In addition all the other rows are intended to be stored and updated. I then conver the headers into a list and ensure that any missing data instances are accounted for (with a ' ') using the list comprehension data
import csv
from collections import OrderedDict
from xml.etree.ElementTree import ParseError
import collections
from __future__ import print_function
def get_clientresponses_two(filename = 's.csv'):
with open(filename, 'rU') as infile:
reader = csv.DictReader(infile) # read the file as a dictionary for each row ({header : value})
data = {}
for row in reader:
for header, value in row.items():
try:
data[header].append(value)
except KeyError:
data[header] = [value]
client_responses = data['E'] #returns a list
for client_response in client_responses:
xml_string = (''.join(client_response))
xml_string = xml_string.replace('&', '')
try:
root = ElementTree.XML(xml_string)
print(root) #check that I am getting the root
return dict(flatten_dict(root) ####return generator function with output as XML root as dictionary
except ET.ParseError:
print("catastrophic failure")
continue
def allocate_and_write_data_2(get_clientresponses_two):
with open(filename, 'r') as infile:
reader = csv.DictReader(infile) # read the file as a dictionary for each row ({header : value})
header = set()
results = []
# data = {} # this is not needed for the purpose of this organization
for row in reader:
xml_data = get_clientresponses_two()
row.update(xml_data) # just for XML data
results.append(row) # everything else
headers.update(row.keys()) # can't forget headers
# print(row) # returns dictionary of key values pairs (headers : values)
# print(results) # returns list wrapper for dictionary
# print(headers) #returns set of all headers
headers_list = list(headers)
# print(headers_list) #list form of set
with open('csv_output.csv', 'wt') as f:
writer = csv.writer(f)
writer.writerow(headers_list)
for row in results:
data = [row.get(x, '') for x in headers_list]
writer.writerow(data)
The output is the following:
C,HPhone,Locator,IISO,E,S,FName,LaName,J,D,MemID,ResponseRequestType,T,Email,I,Ob,G,MemPass,Address,A,PrivateMembers,H,CNum,ResponseT,CID,B,F
1,,,,"<Response T=""3"" RequestType=""MS""><MS><Memb><PrivateMembers /><Ob>0-12-af</Ob><Locator /></Memb><S>[REMOVED]</S><CNum>[REMOVED]</CNum><FName /><LaName /><Address /><HPhone /><Email /><IISO /><MemID /><MemPass /><T /><CID /><T /></MS></Response>",[REMOVED],,,0,"<Req TID=""34"" ReqType=""MS""><IISO /><CID>2</CID><MemID>0000</MemID><MemPass /><RequestData><S>[REMOVED]</S><Na /><La /><Card>[REMOVED]</Card><Address /><HPhone /><Mail /></ReqData></Req>",,MS,,,667,0-12-af,27-JUN-15 12.00.00 AM,,,88,,26,[REMOVED],10,,18,0-JAN-10 12.00.02 AM
1,,,,"<Response T=""10"" RequestType=""MS""><MS><Memb><PrivateMembers /><Ob>0-12-af</Ob><Locator /></Memb><S>[REMOVED]</S><CNum>[REMOVED]</CNum><FName /><LaName /><Address /><HPhone /><Email /><IISO /><MemID /><MemPass /><T /><CID /><T /></MS></Response>",[REMOVED],,,0,"<Req TID=""45"" ReqType=""MS""><IISO /><CID>4</CID><MemID>0000</MemID><MemPass /><RequestData><S>[REMOVED]</S><Na /><La /><Card>[REMOVED]</Card><Address /><HPhone /><Mail /></ReqData></Req>",,MS,,,667,0-12-af,27-JUN-22 12.00.00 AM,,,22,,26,[REMOVED],10,,22,0-JAN-22 12.00.02 AM
1,,,,"<Response T=""10"" RequestType=""MS""><MS><Memb><PrivateMembers /><Ob>0-12-af</Ob><Locator /></Memb><S>[REMOVED]</S><CNum>[REMOVED]</CNum><FName /><LaName /><Address /><HPhone /><Email /><IISO /><MemID /><MemPass /><T /><CID /><T /></MS></Response>",[REMOVED],,,0,"<Req TID=""15"" ReqType=""MS""><IISO /><CID>45</CID><MemID>0000</MemID><MemPass /><RequestData><S>[REMOVED]</S><Na /><La /><Card>[REMOVED]</Card><Address /><HPhone /><Mail /></ReqData></Req>",,MS,,,667,0-12-af,27-JUN-34 12.00.00 AM,,,32,,26,[REMOVED],10,,22,0-JAN-20 12.00.02 AM
However, my intention is to minimify, refactor and modularize the code. I would like to ideally have the processes still written to memory. Moreover, when I try to iterate through the generator, I receive an error:
TypeError: 'function' object is not iterable
. I would like guidance and feedback with specifically how to best rectify this.