RancidMeat: XMLStarlet

Occasionally I have to parse xml, which is often not grep-friendly. Luckily, there’s XMLStarlet (the actual command’s named ‘xml’) to make it relatively easy, if longwinded for a prompt. It’s in Linux repositories, and there’s a Windows version.

For this post, I’ll be referring to an example file that looks like this.

<?xml version="1.0"?>
<Cast>
<Character gender='M'>
  <Name>Fry</Name>
  <Job>Delivery Boy</Job>
  <Species>Human</Species>
  <From>
    <Location>Earth</Location>
    <Location>New York</Location>
    <Location>Brooklyn</Location>
  </From>
</Character>

</Cast>

 
 
1) Loop through all characters, printing the value of their name tag and a newline.

xml sel -t -m "/Cast/Character" \
-v "Name" \
-n futurama_cast.xml

# Zapp
# Leela
# Amy
# Zoidberg
# Fry
# Bender
# Professor
# Kif

 

2) Count the characters (print the value returned by count()).

xml sel -t -v "count(/Cast/Character)" futurama_cast.xml

# 8

 
 
3) Loop through characters based on an attribute.

xml sel -t -m "/Cast/Character[@gender='F']" \
-v "Name" \
-n futurama_cast.xml

# Leela
# Amy

 
 
4) Loop through characters based on a child tag.

xml sel -t -m "/Cast/Character[Job='Captain']" \
-v "Name" \
-n futurama_cast.xml

# Zapp
# Leela

 
 
5) Find doctors’ home planets (loop over from tags that belong to characters with a certain job, and print the first location).

xml sel -t -m "/Cast/Character[Job='Doctor']/From" \
-v "Location" \
-n futurama_cast.xml

# Decapod 10

 
 
6) Print multiple values, with a hardcoded string between them.*

xml sel -t -m "/Cast/Character" \
-v "Name" -o ": " \
-v "Species" \
-n futurama_cast.xml

# Zapp: Human
# Leela: Mutant Human
# Amy: Human
# Zoidberg: Decapodian
# Fry: Human
# Bender: Robot
# Professor: Human
# Kif: Amphibiosan

 
 
7) For each character, print the name and a string, then begin a nested loop over any locations, printing their text and a string, then break out of that nest-level.

xml sel -t -m "/Cast/Character" \
-v "Name" -o ": " \
-m "From/Location" -v "text()" -o "," -b \
-n futurama_cast.xml

# Zapp: Earth,
# Leela: Earth,New New York,New New York City,
# Amy: Mars,
# Zoidberg: Decapod 10,
# Fry: Earth,New York,Brooklyn,
# Bender: Earth,Mexico,Tijuana,
# Professor: Earth,New New York,Manhattan,
# Kif: Amphibios 9,

 
 
8) Same as above, but this time nesting an IF to only print the hardcoded string on all except the last location, then break out of the IF.

xml sel -t -m "/Cast/Character" \
-v "Name" -o ": " \
-m "From/Location" -v "text()" \
-i "not(position()=last())" -o ", " -b \
-b \
-n futurama_cast.xml

# Zapp: Earth
# Leela: Earth, New New York, New New York City
# Amy: Mars
# Zoidberg: Decapod 10
# Fry: Earth, New York, Brooklyn
# Bender: Earth, Mexico, Tijuana
# Professor: Earth, New New York, Manhattan
# Kif: Amphibios 9

 
 
The args are XPath 1.0. -m takes a list of tags. -v takes the first or only tag/string/etc it sees. -i and [...] are also XPath, but they expect booleans. It’s a query grammar, so any serious text manipulation needs to be done with other commands/scripts later in the pipeline.
 
 
* I normally go with -o "\t" as a delimiter; XMLStarlet prints a literal slash-t, necessitating a pipe through sed for real tabs.

Example: Parsing the Freshmeat RSS feed…
Fetch (one try with a 2 sec timeout) the atom feed (whose tags have a namespace prefix, abbreviated to ‘x’), loop over the entries, and print the titles and links.

wget -t 1 -T 2 -O - -q \
"http://freshmeat.net/?format=atom" \
| xml sel -N x="http://www.w3.org/2005/Atom" -t \
-m "/x:feed/x:entry" \
-v "x:title" -o "\t" -v "x:link/@href" -n \
| sed "s/\\t/\t/g"

Tags: ,

Leave a Reply