Skip to content
Advertisement

Bash – Remove XML nodes if the attribute value of a child node does not equal a specific value?

I have RSS feed, like this:

<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>my feed</title>
  <link rel="self" href="http://myhomesite.com/articles/feed/"/>
  <updated>2019-11-04T12:45:00Z</updated>
  <id>http://myhomesite.com/articles/feed/?dt=2019-11-04T12:45:00Z</id>
  <entry>
    <id>id0</id>
    <link rel="alternate" type="text/html" href="https://yandex.ru/link123"/>
    <author>
      <name/>
    </author>
    <published>2019-11-04T12:45:00Z</published>
    <updated>2019-11-04T12:45:00Z</updated>
    <title type="html"><![CDATA[foo bar foo bar]]></title>
    <content type="html"><![CDATA[]]></content>
  </entry>
  <entry>
    <id>id2</id>
    <link rel="alternate" type="text/html" href="https://myhomesite.com"/>
    <author>
      <name/>
    </author>
    <published>2019-11-04T09:45:00Z</published>
    <updated>2019-11-04T09:45:00Z</updated>
    <title type="html"><![CDATA[foo bar foo bar]]></title>
    <content type="html"><![CDATA[]]></content>
  </entry>
....

I want remove all nodes (/feed/entry) where link href != http://myhomesite.com.

How do I remove XML node where value start at specified symbols using Bash?

Advertisement

Answer

Bash features by themselves are not very well suited parsing XML.

This renowned Bash FAQ states the following:

Do not attempt [to extract data from an XML file] with , , , and so on (it leads to undesired results).

Consider utilizing an XML specific command line tool, such as XMLStarlet. See download info here if you don’t already have XML Starlet installed.


Solution:

Using XML Starlet you can run the following command to output the desired results to your terminal:

xml ed -N x="http://www.w3.org/2005/Atom" -d '//x:entry[not(child::x:link[@href="https://myhomesite.com"])]' /path/to/file.rss

Note: The /path/to/file.rss part at the end of the command shown above should be substituted with the real pathname to the actual .rss file.

Explanation:

The parts of the aforementioned command breakdown as follows:

  • xml – invoke the XML Starlet command.

  • ed – Edit/Update the XML document.

  • -N x="http://www.w3.org/2005/Atom" – The -N option binds the namespace, i.e. http://www.w3.org/2005/Atom, to a prefix that we’ve arbitrarily named x.

  • -d – delete node(s) that are matched.

  • '//x:entry[not(child::x:link[@href="https://myhomesite.com"])]' The expression used to find/match the appropriate nodes as specified in your question.

    all nodes (/feed/entry) where link href != http://myhomesite.com.

    As you can see, in the XPath expression we prepend the x prefix to the element node names, i.e. x:entry and x:link to ensure we address the elements in the correct namespace.

  • /path/to/file.rss – A pathname to the source .rss file.

Saving the resultant XML (RSS)

To save the resultant XML you can either:

  1. Add the --inplace option to the aforementioned command – this will overwrite the original .rss with the desired result. For instance:

     xml ed --inplace -N x="http://www.w3.org/2005/Atom" -d '//x:entry[not(child::x:link[@href="https://myhomesite.com"])]' /path/to/file.rss
    
  2. Or, utilize the redirection operator (>) and specify a pathname to the the location at which to save the output. For instance the following compound command will save the results to a new file:

     xml ed -N x="http://www.w3.org/2005/Atom" -d '//x:entry[not(child::x:link[@href="https://myhomesite.com"])]' /path/to/file.rss > /path/to/results.rss
    

    Note: The /path/to/results.rss at the end of the aforementioned compound command should be substituted with a real pathname to where you want to save the new file.

XPath with local-name():

Given that your example source XML (RSS) does not include any QNames it’s also possible to utilize XPath’s local-name() function. This will negate the need to bind the namespace using XMLStarlet’s -N option. For example:

xml ed -d '//*[local-name() = "entry" and not(child::*[local-name() = "link"][@href="https://myhomesite.com"])]' /path/to/file.rss

IMPORTANT: You may need to substitute the leading xml part in all the example commands shown in this post with xmlstarlet instead. For example:

xmlstarlet ed -N x="http://www.w3.org/2005/Atom" -d '//x:entry[not(child::x:link[@href="https://myhomesite.com"])]' /path/to/file.rss.
^^^^^^^^^^

Edit:

Given your example XML it’s also possible to utilize a simplified syntax for the default namespace, which is to use _: instead x:. By using an underscore (_) you don’t need to utilize the -N option to bind the namespace to a prefix. Refer to the section titled 1.3. A More Convenient Solution in the XMLStarlet documentation for further information regarding this feature.

For instance:

xml ed -d '//_:entry[not(child::_:link[@href="https://myhomesite.com"])]' /path/to/file.rss

To further understand using XMLStarlet when your source XML uses namespaces I suggest also reading Namespaces and default namespace in the documentation.


Edit 2:

The author of the OP subsequently wrote the following in the comments:

One question more. Condition [not(child::_:link[@href="myhomesite.com"])] is strict. I wanna be something like start with myhomesite.com but URI not important i.e. myhomesite.com**anything**. It’s possible? [sic]

something like this.. xmlstarlet ed -N x="http://www.w3.org/2005/Atom" -d '//x:entry[not(child::x:link[matches(@href, '^https://myhomesite.com/' )]/@href)]' feed.rs

Consider utilizing Xpath’s starts-with() Function with any one of the previously given examples. For example:

  • Using the -N option and starts-with():

    xml ed -N x="http://www.w3.org/2005/Atom" -d '//x:entry[not(child::x:link[starts-with(@href, "https://myhomesite.com")])]' file.rss
    
  • Using the local-name() and starts-with():

    xml ed -d '//*[local-name() = "entry" and not(child::*[local-name() = "link"][starts-with(@href, "https://myhomesite.com")])]' file.rss
    
  • Using the simplified syntax for the default namespace, i.e. an underscore, and starts-with():

    xml ed -d '//_:entry[not(child::_:link[starts-with(@href, "https://myhomesite.com")])]' file.rss
    
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement