Formatting XML using Python or PHP

To format XML files usually I use option 3 described in How to pretty print XML on GNU/Linux but recently I've needed to work with XML files which in addition to being obfuscated one part of it use html entities, to start suppose we have an XML file like the this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Edit_Mensaje SYSTEM "Edit_Mensaje.dtd" >
<Edit_Mensaje>
  <Mensaje>
    <Remitente>
      <Nombre>Nombre del remitente</Nombre>
      <Mail>Correo del remitente</Mail>
    </Remitente>
    <Destinatario>
      <Nombre>Nombre del destinatario</Nombre>
      <Mail>Correo del destinatario</Mail>
    </Destinatario>
    <Texto>
      <Asunto>
        Este es mi documento con una estructura muy sencilla
        no contiene atributos ni entidades…
      </Asunto>
      <Parrafo>
        Este es mi documento con una estructura muy sencilla
        no contiene atributos ni entidades…
      </Parrafo>
    </Texto>
  </Mensaje>
</Edit_Mensaje>

Note: File retrieved from Wikipedia: Extensible Markup Language

but instead of having it as shown above have it in the following way:

<Edit_Mensaje><Mensaje>&lt;Remitente&gt;&lt;Nombre&gt;Nombre del remitente&lt;/Nombre&gt;&lt;Mail&gt;Correo del remitente&lt;/Mail&gt;&lt;/Remitente&gt;&lt;Destinatario&gt;&lt;Nombre&gt;Nombre del destinatario&lt;/Nombre&gt;&lt;Mail&gt;Correo del destinatario&lt;/Mail&gt;&lt;/Destinatario&gt;&lt;Texto&gt;&lt;Asunto&gt;Este es mi documento con una estructura muy sencilla no contiene atributos ni entidades...&lt;/Asunto&gt;&lt;Parrafo&gt;Este es mi documento con una estructura muy sencilla no contiene atributos ni entidades... &lt;/Parrafo&gt;&lt;/Texto&gt;</Mensaje></Edit_Mensaje>

Not very nice right? and we cannot apply the solutions offered in format XML files since the XML contains html entities so I developed a script in Python and then in PHP:

Scripts

Both scripts run the following logic:

  1. It reads the filename from standard input
  2. Check that the file exists and it can be read it
  3. It stores the contents of the file into a variable
  4. Decode html entities
  5. Format the XML
  6. Show the XML in an understandable format

Python script

#!/usr/bin/env python3

from argparse import ArgumentParser
from pathlib import Path
from xml.dom.minidom import parseString
import html

parser = ArgumentParser(description = 'Format an XML file')
parser.add_argument('-f', '--file', type = Path, dest = 'file', required = True, help = 'File to format/pretty print')
args = parser.parse_args()

file = args.file

if not file.is_file():
    print(f"File {file} does not exist or is not readable!")
    exit(1)

content = file.read_text()
if not content:
    print(f"File {file} is empty nothing to do!")
    exit(2)

xml = parseString(html.unescape(content))
print(xml.toprettyxml())

PHP script

#!/usr/bin/env php

<?php

if (empty($argv[1])) {
    echo "El fichero a formatear es obligatorio\n";
    exit(1);
}

$file_path = $argv[1];

if (!is_readable($file_path)) {
    echo "El fichero $file_path no existe o no se puede leer\n";
    exit(2);
}

if (empty($content = file_get_contents($file_path))) {
    echo "El contenido del fichero $file_path es vacío\n";
    exit(3);
} 

$xml = new DOMDocument();

$xml->loadXML(html_entity_decode($content));
$xml->preserveWhiteSpace= false;
$xml->formatOutput = true;

$xml_formatted = $xml->saveXML();

echo $xml->saveXML();

//file_put_contents('formatted.xml', $xml_formatted);

exit(0);

We can now integrate the previous scripts with gedit to do so (in this case will make it to the script developed in PHP):

  1. Run gedit
  2. Menu > Tools > Manage External Tool
  3. We add a new external tool and set the values as shown in the figure

gedit-external-tool

We opened our xml file which we call garbage.xml

garbage-xml

We then format it using the combination of keys that we established when we integrate the external tools in the gedit as be shown in the figure.

ppxml

Further readings

* Formatted XML files


YouTube video

YouTube video

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.