logo4 Evolution is progress—                          
progress is creativity.        
vline

Python SimpleXML conversion to nested arrays of dictionaries

For a long time, I was looking for a python XML conversion tool that produces a structure of nested arrays of dictionaries as SimpleXML which exists with Perl.

I found some attempts to convert the xml structure into nested objects. All these attempts have drawbacks as either data access is too complicated (calling a function) or it is too restricted (by usage of the attribute functions there is no way to distinguish between attributes and text or sub elements).

Beautiful Soup (BS) is a good alternative, but most useful for HTML. Though there exists a feature="xml" option it makes not a great difference. The main drawback of BS is that it contains so many elements that are virtually empty. If you are interested in xml only data within opening and closing tag (<tag>data</tag>) are necessary not all the newline and spaces between a closing and the next opening tag (</tag1> <tag2>). Next BS does not allow tag sorting, which makes perfect sense in HTML but in XML it is a nice feature, as xml files with sorted tags can be easilly compared by any text file comparing software. Besides Perl's SimpleXML provides this feature.

So I set out to develop it by my own.

Given a simple text xml file

<a>
	<b cer="leap" dup="go">
		<c>1</c>
		<c>2</c>
		<c>3</c>
		<c>4</c>
	</b>
	<b cer="jump" dup="leave">
		<c len="11">cite</c>
		<c len="22">cute</c>
		<c len="33">diff</c>
		<c len="44">ex</c>
		<l></l>
	</b>
	<d>dummy</d>
	<q lang="Esperanto"></q>
</a>

This is my python code to convert the xml to a nested structure of arrays and dictionaries.

import xml.sax.handler

class TreeBuilder(xml.sax.handler.ContentHandler):
    def __init__(self):
        self.stack = []
        self.root = {}
    def startElement(self, name, attrs):
        if attrs:
            data = [{ k:v for k, v in attrs.items()}]
        else:
            data = []
        self.stack.append((name, data, ''))
    def endElement(self, name):
        (s_name,s_data,s_text) = self.stack.pop()
        if len(self.stack) == 0:
            self.root = {name:s_data}
        else:
            (p_name,p_data,p_text) = self.stack.pop()
            if len(s_data) == 0:
                s_data.append(s_text)
            elif len(s_data) != 0 and s_text != '':
                s_data[-1].update({u'content':s_text})
            else:
                pass
            
            if len(p_data)>0 and s_name in p_data[-1]:
                p_data[-1][s_name].extend(s_data)
            elif len(p_data) != 0 and len(s_data) != 0:
                p_data[-1].update({s_name:s_data})
            else:
                p_data.append({s_name:s_data})
                
            self.stack.append((p_name,p_data,p_text))
    
    def characters(self, content):
        if content.strip() != '':
            (name,data,text) = self.stack.pop()
            text = text + content
            self.stack.append((name,data,text))

builder = TreeBuilder()
xml.sax.parse('simple1.xml', builder)

The data can be accessed through the root attribute of the builder instance. The code below produces (newlines were added manually for readability):

>>>print builder.root
{u'a': [
        {u'q': [
                {u'lang': u'Esperanto'}
                ], 
         u'b': [
                {
                 u'dup': u'go', 
                 u'c': [
                        u'1', 
                        u'2', 
                        u'3', 
                        u'4'
                        ], 
                 u'cer': u'leap'
                 }, 
                {
                 u'dup': u'leave', 
                 u'c': [
                        {
                         u'content': u'cite', 
                         u'len': u'11'
                         }, 
                        {
                         u'content': u'cute', 
                         u'len': u'22'
                         }, 
                        {
                         u'content': u'diff', 
                         u'len': u'33'
                         }, 
                        {
                         u'content': u'ex', 
                         u'len': u'44'
                         }
                        ], 
                 u'cer': u'jump', 
                 u'l': [
                        ''
                        ]
                 }
                ], 
         u'd': [u'dummy']
         }
    ]
}

Which is virtually the same as the Perl result:

use XML::Simple;
use Data::Dumper;

$infile ="simple1.xml";
$xmlcontent = XMLin($infile);
print Dumper($xmlcontent);
exit;

$VAR1 = {
          'q' => {
                 'lang' => 'Esperanto'
               },
          'b' => [
                 {
                   'c' => [
                          '1',
                          '2',
                          '3',
                          '4'
                        ],
                   'dup' => 'go',
                   'cer' => 'leap'
                 },
                 {
                   'l' => {},
                   'c' => [
                          {
                            'len' => '11',
                            'content' => 'cite'
                          },
                          {
                            'len' => '22',
                            'content' => 'cute'
                          },
                          {
                            'len' => '33',
                            'content' => 'diff'
                          },
                          {
                            'len' => '44',
                            'content' => 'ex'
                          }
                        ],
                   'dup' => 'leave',
                   'cer' => 'jump'
                 }
               ],
          'd' => 'dummy'
        };

Data can be simply accessed as with any other nested array and dictionary.

>>>tree = builder.root
>>>print tree['a'][0]['b'][0]['c'][2]
3
>>>print tree['a'][0]['b'][1]['c'][2]['len']
33

The tree structure can be easily manipulated by python's array and dictionary functions append() and update() for instance.

To get back the xml data text file in an alphabetic order I wrote this little routine. Admittedly it still looks a bit clumsy. I'm working on it.

def xmlyfy(tree):
    def extract_list(sublist, tag, depth):
        pre = "\n" + "  " * depth
        global xml_ret
        if len(sublist) == 0 or isinstance(sublist[0], str) or isinstance(sublist[0], unicode):
            for content in sublist:
                xml_ret = xml_ret +  pre + "<" + tag + ">" + content + "</" + tag + ">"
        else:
            for item in sublist:
                extract_subtree(item,tag,depth)
        return
        
    def extract_subtree(subtree, tag, depth):
        pre = "\n" + "  " * depth
        global xml_ret
        content = ''
        xml_ret = xml_ret +  pre + "<" + tag
        for k in sorted(subtree.keys()):
            if not isinstance(subtree[k],list):
                if k != 'content':
                    xml_ret = xml_ret + " " + k + "=\"" + subtree[k] + "\""
                else:
                    content = subtree[k]

        xml_ret = xml_ret + ">"
        if content != '':
            xml_ret = xml_ret + content + "</" + tag + ">"
            return
        for k in sorted(subtree.keys()):
            if isinstance(subtree[k],list):
                content = pre
                if len(subtree[k]) == 0 : 
                    xml_ret = xml_ret +  pre + "  <" + k + "></" + k + ">"
                else:
                    extract_list(subtree[k], k, depth+1)
        xml_ret = xml_ret + content + "</" + tag + ">"
        return
    
    global xml_ret
    xml_ret = ''
    for k in sorted(tree):
        extract_list(tree[k], k, 0)
    return xml_ret

Which produces:

>>>tree = builder.root
>>>print xmlyfy(tree)           
<a>
  <b cer="leap" dup="go">
    <c>1</c>
    <c>2</c>
    <c>3</c>
    <c>4</c>
  </b>
  <b cer="jump" dup="leave">
    <c len="11">cite</c>
    <c len="22">cute</c>
    <c len="33">diff</c>
    <c len="44">ex</c>
    <l></l>
  </b>
  <d>dummy</d>
  <q lang="Esperanto"></q>
</a>

Tags: Software


Categories: Software

 
   

(c) Mato Nagel, Weißwasser 2004-2013, Disclaimer