I really need some help. I am a relatively new Python programmer, but I am not that familiar with XML. I have been tasked with creating an XML file from an Excel spreadsheet. The structure of the XML file is set by my vendor. I have a sample XML file, as well as an XSD file, but I am not sure how to progress.
I have been able to pull the data out of the Excel file in to a CSV using lxrd, but I am stuck there. I have created a header row in my CSV to create the appropriate tags in XML, but I cannot get the format to match the sample XML.
My next thought was to replace all the data in the sample XML file with variables (e.g. $2C), and to do a search and replace using the CSV. This seems to be very awkward (and truthfully, I don't know how to script that).
I am not opposed to doing the work, and I would love to actually learn how to do this, but I could really use some assistance.
# xlrd usage, see: http://scienceoss.com/read-excel-files-from-python/
import xlrd
# Open the XLS file as a "workbook" and select the first sheet as our source.
# We'll get our tag names from the sheet's first row contents.
wb = xlrd.open_workbook("my_excel_file.xls")
sh = wb.sheet_by_index(0)
tags = [n.replace(" ", "").lower() for n in sh.row_values(0)]
# This is going to come out as a string, which will write to a file in the end.
result = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<myItems>\n"
# Now, we'll just create a string that looks like an XML node for each row
# in the sheet. Of course, a lot of things will depend on the prescribed XML
# format but since we have no idea what it is, we'll just do this:
for row in range(1, sh.nrows):
result += " <item>\n"
for i in range(len(tags)):
tag = tags[i].encode("utf-8")
val = sh.row_values(row)[i].encode("utf-8")
result += " <%s>%s</%s>\n" % (tag, val, tag)
result += " </item>\n"
# Close our pseudo-XML string.
result += "</myItems>"
# Write the result string to a file using the standard I/O.
f = open("myfile.xml", "w")
f.write(result)
f.close()
But as I put in the comments, a lot depends on the prescribed XML format. Once we know that, we might (and I guess, should) use a real XML-processing module like xml.dom.minidom instead of writing strings out to file like this.
Thanks for the response Samhain!! If I sent you the XML and XLS, would that be more helpful?? I also have an XSD for validation (but that is a lesser issue at this point).
Woohoo, now that's an example! Anyway, it was a good exercise.
First things first: in order to keep my sanity, I needed to write a helper class that provides a few methods for manipulating the XML (and the nodes) that we have to build. I've attached the script as a TXT file, you should replace the extension to .py if you want to try it.
I also attached a sample usage script (just in case I get disconnected and not be able to get back right away). It has lots and lots of comments, which I'll just be repeating here in parts. Like the helper script, just change the extension to .py.
from xs2ml import xs2ml
# xs2ml takes the XLS filename as an argument on __init__.
x = xs2ml("sample.xls")
# Give this method the root node name and it's attribute-values, if any.
x.create_root_node("PIES", xmlns="http://www.aftermarket.org/eCommerce/Pies")
# Create our basic node structure. This method takes in the target parent node
# and the node names that need to be created under it. We'll get a list in return.
root_nodes = x.add_nodes(x.dom_.firstChild, "Header", "Items", "Trailer")
Start filling-up the Items node. This is a bit complicated because the values are coming from multiple worksheets, and some values are text nodes while others are attribute-value nodes. Anyway, we'll just use the `Main Item Data Tab` as the basis for our row count. Also, we'll start at row[2] because we know that's were the real data starts.
Code:
for i in range(2, x.sheets["Main Item Data Tab"].nrows):
# Create the Item node in Items. We'll modify this node as we go along.
item = x.add_nodes(root_nodes[1], "Item")[0]
# Create an attribute called MaintenanceType. We know that the data for
# this node is in the Digital Assets Tab, column[1].
x.cell_to_node(item, "Digital Assets Tab", col=1, row=i,
attr_name="MaintenanceType")
# It seems the MaintenanceType value is reusable, so let's save it.
MaintenanceType = item.attributes["MaintenanceType"].value
# Create the ItemLevelGTIN node. We know the value is in column[8] of
# Main Item Data. We also know about its GTINQualifier attribute in col[9].
ItemLevelGTIN = x.add_nodes(item, "ItemLevelGTIN")[0]
x.cell_to_node(ItemLevelGTIN, "Main Item Data Tab", col=8, row=i)
x.cell_to_node(ItemLevelGTIN, "Main Item Data Tab", col=9, row=i,
attr_name="GTINQualifier")
# This is going to be a long for loop... but you get the idea.
# One last example, Descriptions nodes:
Descriptions = x.add_nodes(item, "Descriptions")[0]
d_nodes = x.add_nodes(Descriptions, "Description", "Description")
# We know what DescriptionCode to put in. LanguageCode, I don't know
# where to find it but either use the setAttribute way or the cell_to_node
# way if you know the sheet and cell the value is in.
d_nodes[0].setAttribute("DescriptionCode", "DES")
d_nodes[1].setAttribute("DescriptionCode", "SHO")
# MaintenanceType, we'll just get from item since we know that already.
d_nodes[0].setAttribute("MaintenanceType", MaintenanceType)
d_nodes[1].setAttribute("MaintenanceType", MaintenanceType)
# Short and long descriptions, according to the XLS file.
x.cell_to_node(d_nodes[0], "Main Item Data Tab", col=16, row=i)
x.cell_to_node(d_nodes[1], "Main Item Data Tab", col=15, row=i)
The cell_to_node method, by the way, takes the following arguments:
parent node, worksheet name, column, row, and attribute name (optional, None is default)
cell_to_node reads the cell value of column:row. By default, this value is turned into a text node and appended to the parent node, which results in something like <parent>value</parent>. But if the attribute name is given, an attribute is created within parent and given the cell value, like: <parent attribute_name="value" />
---
Composing the Header and Trailer is more straight forward because their child node values come from a single worksheet and we can easily where the values are coming from. We'll just use some dictionaries here that contain each element's respective column and row. Then we'll feed these maps to a convenience method for generating simple, mapped nodes.
Wow!! Thank you so much!! I have read through the code, and it makes sense. I will try it out, and modify the code to accept the rest of the data. I cannot express how much I appreciate the help on this. I have read so many blogs, how-to sites, etc, but it didn't make a whole lot of sense.
Ok, so I have taken the code, and completed the project. I had to make a couple of small tweaks to the code, but all in all the class you created worked beautifully!!
The only question I had was concerning looping. In one area, I need to manually set the Language Code (as an attribute) to EN. Currently, I am using this:
Code:
# Manually set the language code to EN
e_nodes[0].setAttribute("LanguageCode", "EN")
e_nodes[1].setAttribute("LanguageCode", "EN")
e_nodes[2].setAttribute("LanguageCode", "EN")
e_nodes[3].setAttribute("LanguageCode", "EN")
e_nodes[4].setAttribute("LanguageCode", "EN")
e_nodes[5].setAttribute("LanguageCode", "EN")
e_nodes[6].setAttribute("LanguageCode", "EN")
e_nodes[7].setAttribute("LanguageCode", "EN")
e_nodes[8].setAttribute("LanguageCode", "EN")
e_nodes[9].setAttribute("LanguageCode", "EN")
I attempted to create a loop by:
Code:
for node in e_nodes:
e_node(node).setAttribute("LanguageCode", "EN")
However, I recieved an error. What would be the proper way to loop this?? Or did I do it correctly??
Once again, I can't thank you enough for your assistance!!!!
^Cool! This was actually a very enlightening side project, I'm now working on something that gets stuff from OpenDocument Spreadsheets. Putting what I've learned from here to practice.