To format XML files usually I use option 3 described in How to pretty print XML on GNU/Linux but recently I've needed to work with XML files which in addition to being obfuscated one part of it use html entities, to start suppose we have an XML file like the this:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Edit_Mensaje SYSTEM "Edit_Mensaje.dtd" >
<Edit_Mensaje>
<Mensaje>
<Remitente>
<Nombre>Nombre del remitente</Nombre>
<Mail>Correo del remitente</Mail>
</Remitente>
<Destinatario>
<Nombre>Nombre del destinatario</Nombre>
<Mail>Correo del destinatario</Mail>
</Destinatario>
<Texto>
<Asunto>
Este es mi documento con una estructura muy sencilla
no contiene atributos ni entidades…
</Asunto>
<Parrafo>
Este es mi documento con una estructura muy sencilla
no contiene atributos ni entidades…
</Parrafo>
</Texto>
</Mensaje>
</Edit_Mensaje>
Note: File retrieved from Wikipedia: Extensible Markup Language
but instead of having it as shown above have it in the following way:
<Edit_Mensaje><Mensaje><Remitente><Nombre>Nombre del remitente</Nombre><Mail>Correo del remitente</Mail></Remitente><Destinatario><Nombre>Nombre del destinatario</Nombre><Mail>Correo del destinatario</Mail></Destinatario><Texto><Asunto>Este es mi documento con una estructura muy sencilla no contiene atributos ni entidades...</Asunto><Parrafo>Este es mi documento con una estructura muy sencilla no contiene atributos ni entidades... </Parrafo></Texto></Mensaje></Edit_Mensaje>
Not very nice right? and we cannot apply the solutions offered in format XML files since the XML contains html entities so I developed a script in Python and then in PHP:
Scripts
Both scripts run the following logic:
- It reads the filename from standard input
- Check that the file exists and it can be read it
- It stores the contents of the file into a variable
- Decode html entities
- Format the XML
- Show the XML in an understandable format
Python script
#!/usr/bin/python import os import re import HTMLParser as parser import xml.dom.minidom as minidom import sys try: # Read de file name from standard input filename = sys.argv[1] if os.path.isfile(filename) and os.access(filename, os.R_OK): # Open the file in read only mode file = open(filename, 'r') # Read the file and decode html entities xml = parser.HTMLParser().unescape(file.read()) # Pretify the xml xml = minidom.parseString(xml).toprettyxml() # Handle issue with CDATA section due minidom add extraspace # before/after CDATA xml = re.sub('>\s+<!', '><!', xml) xml = re.sub(']>\s+<', ']><', xml) # Remove empty lines # Thanks to http://stackoverflow.com/questions/1140958/whats-a-quick-one-liner-to-remove-empty-lines-from-a-python-string print "".join([s for s in xml.strip().splitlines(True) if s.strip()]) else: print "File is missing or is not readable!" except IndexError: print "You must specify a file name!"
PHP script
#!/usr/bin/env php <?php // Check the scripts is called with arguments if (empty($argv[1])) { die('You must specify a file!'); } // Set the file name $file = $argv[1]; // Verify if filename exists if (!is_readable($file)) { die('File is missing or is not readable!'); } // Get the file content $content = file_get_contents($file); // Verify the content is not empty if (empty($content)) { die('File is empty nothing to do ;)'); } // Decode html entities $content = html_entity_decode($content); // Parse the xml and format it $doc = new DOMDocument(); $doc->preserveWhiteSpace = false; $doc->formatOutput = true; $doc->loadXML($content); // Print the result echo $doc->saveXML();
We can now integrate the previous scripts with gedit to do so (in this case will make it to the script developed in PHP):
- Run gedit
- Menu > Tools > Manage External Tool
- We add a new external tool and set the values as shown in the figure
We opened our xml file which we call garbage.xml
We then format it using the combination of keys that we established when we integrate the external tools in the gedit as be shown in the figure.