Formatting XML using Python or PHP

To format XML files usually I use option 3 described in How to pretty print XML on GNU/Linux but recently I've needed to work with XML files which in addition to being obfuscated one part of it use html entities, to start suppose we have an XML file like the this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Edit_Mensaje SYSTEM "Edit_Mensaje.dtd" >
<Edit_Mensaje>
  <Mensaje>
    <Remitente>
      <Nombre>Nombre del remitente</Nombre>
      <Mail>Correo del remitente</Mail>
    </Remitente>
    <Destinatario>
      <Nombre>Nombre del destinatario</Nombre>
      <Mail>Correo del destinatario</Mail>
    </Destinatario>
    <Texto>
      <Asunto>
        Este es mi documento con una estructura muy sencilla
        no contiene atributos ni entidades…
      </Asunto>
      <Parrafo>
        Este es mi documento con una estructura muy sencilla
        no contiene atributos ni entidades…
      </Parrafo>
    </Texto>
  </Mensaje>
</Edit_Mensaje>

Note: File retrieved from Wikipedia: Extensible Markup Language

but instead of having it as shown above have it in the following way:

<Edit_Mensaje><Mensaje>&lt;Remitente&gt;&lt;Nombre&gt;Nombre del remitente&lt;/Nombre&gt;&lt;Mail&gt;Correo del remitente&lt;/Mail&gt;&lt;/Remitente&gt;&lt;Destinatario&gt;&lt;Nombre&gt;Nombre del destinatario&lt;/Nombre&gt;&lt;Mail&gt;Correo del destinatario&lt;/Mail&gt;&lt;/Destinatario&gt;&lt;Texto&gt;&lt;Asunto&gt;Este es mi documento con una estructura muy sencilla no contiene atributos ni entidades...&lt;/Asunto&gt;&lt;Parrafo&gt;Este es mi documento con una estructura muy sencilla no contiene atributos ni entidades... &lt;/Parrafo&gt;&lt;/Texto&gt;</Mensaje></Edit_Mensaje>

Not very nice right? and we cannot apply the solutions offered in format XML files since the XML contains html entities so I developed a script in Python and then in PHP:

Scripts

Both scripts run the following logic:

  1. It reads the filename from standard input
  2. Check that the file exists and it can be read it
  3. It stores the contents of the file into a variable
  4. Decode html entities
  5. Format the XML
  6. Show the XML in an understandable format

Python script

Fork me on Github
#!/usr/bin/python
import os
import re
import HTMLParser as parser
import xml.dom.minidom as minidom
import sys

try:
    # Read de file name from standard input
    filename = sys.argv[1]
    if os.path.isfile(filename) and os.access(filename, os.R_OK):
        # Open the file in read only mode
        file = open(filename, 'r')

        # Read the file and decode html entities
        xml = parser.HTMLParser().unescape(file.read())

        # Pretify the xml
        xml = minidom.parseString(xml).toprettyxml()

        # Handle issue with CDATA section due minidom add extraspace
        # before/after CDATA
        xml = re.sub('>\s+<!', '><!', xml)
        xml = re.sub(']>\s+<', ']><', xml)

        # Remove empty lines
        # Thanks to http://stackoverflow.com/questions/1140958/whats-a-quick-one-liner-to-remove-empty-lines-from-a-python-string
        print "".join([s for s in xml.strip().splitlines(True) if s.strip()])
    else:
        print "File is missing or is not readable!"
except IndexError:
    print "You must specify a file name!"
    

PHP script

Fork me on Github
#!/usr/bin/env php
<?php

// Check the scripts is called with arguments
if (empty($argv[1])) {
    die('You must specify a file!');
}

// Set the file name
$file = $argv[1];

// Verify if filename exists
if (!is_readable($file)) {
    die('File is missing or is not readable!');
}

// Get the file content
$content = file_get_contents($file);

// Verify the content is not empty
if (empty($content)) {
    die('File is empty nothing to do ;)');
}

// Decode html entities
$content = html_entity_decode($content);

// Parse the xml and format it
$doc = new DOMDocument();
$doc->preserveWhiteSpace = false;
$doc->formatOutput = true;
$doc->loadXML($content);

// Print the result
echo $doc->saveXML();

We can now integrate the previous scripts with gedit to do so (in this case will make it to the script developed in PHP):

  1. Run gedit
  2. Menu > Tools > Manage External Tool
  3. We add a new external tool and set the values as shown in the figure

gedit-external-tool

We opened our xml file which we call garbage.xml

garbage-xml

We then format it using the combination of keys that we established when we integrate the external tools in the gedit as be shown in the figure.

ppxml

Further readings

* Formatted XML files




Leave a Comment

Your email address will not be published. Required fields are marked *