Bash – Remove XML nodes if the attribute value of a child node does not equal a specific value?

Question

I have RSS feed, like this: I want remove all nodes (/feed/entry) where link href != http://myhomesite.com. How do I remove XML node where value start at specified symbols using Bash? Answer Bash features by themselves are not very well suited parsing XML. This renowned Bash FAQ states the following: Do not attempt [to extract data from an XML file]

Accepted Answer

Bash features by themselves are not very well suited parsing XML.This renowned Bash FAQ states the following:Do not attempt [to extract data from an XML file] with sed, awk, grep, and so on (it leads to undesired results).Consider utilizing an XML specific command line tool, such as XMLStarlet. See download info here if you don&#8217;t already have XML Starlet installed.Solution:Using XML Starlet you can run the following command to output the desired results to your terminal:xml ed -N x="http://www.w3.org/2005/Atom" -d '//x:entry[not(child::x:link[@href="https://myhomesite.com"])]' /path/to/file.rssNote: The /path/to/file.rss part at the end of the command shown above should be substituted with the real pathname to the actual .rss file.Explanation:The parts of the aforementioned command breakdown as follows:xml &#8211; invoke the XML Starlet command.ed &#8211; Edit/Update the XML document.-N x="http://www.w3.org/2005/Atom" &#8211; The -N option binds the namespace, i.e. http://www.w3.org/2005/Atom, to a prefix that we&#8217;ve arbitrarily named x.-d &#8211; delete node(s) that are matched.'//x:entry[not(child::x:link[@href="https://myhomesite.com"])]' The xpath expression used to find/match the appropriate nodes as specified in your question.all nodes (/feed/entry) where link href != http://myhomesite.com.As you can see, in the XPath expression we prepend the x prefix to the element node names, i.e. x:entry and x:link to ensure we address the elements in the correct namespace./path/to/file.rss &#8211; A pathname to the source .rss file.Saving the resultant XML (RSS)To save the resultant XML you can either:Add the --inplace option to the aforementioned command &#8211; this will overwrite the original .rss with the desired result. For instance: xml ed --inplace -N x="http://www.w3.org/2005/Atom" -d '//x:entry[not(child::x:link[@href="https://myhomesite.com"])]' /path/to/file.rssOr, utilize the redirection operator (>) and specify a pathname to the the location at which to save the output. For instance the following compound command will save the results to a new file: xml ed -N x="http://www.w3.org/2005/Atom" -d '//x:entry[not(child::x:link[@href="https://myhomesite.com"])]' /path/to/file.rss > /path/to/results.rssNote: The /path/to/results.rss at the end of the aforementioned compound command should be substituted with a real pathname to where you want to save the new file.XPath with local-name():Given that your example source XML (RSS) does not include any QNames it&#8217;s also possible to utilize XPath&#8217;s local-name() function. This will negate the need to bind the namespace using XMLStarlet&#8217;s -N option. For example:xml ed -d '//*[local-name() = "entry" and not(child::*[local-name() = "link"][@href="https://myhomesite.com"])]' /path/to/file.rssIMPORTANT: You may need to substitute the leading xml part in all the example commands shown in this post with xmlstarlet instead. For example:xmlstarlet ed -N x="http://www.w3.org/2005/Atom" -d '//x:entry[not(child::x:link[@href="https://myhomesite.com"])]' /path/to/file.rss.^^^^^^^^^^Edit:Given your example XML it&#8217;s also possible to utilize a simplified syntax for the default namespace, which is to use _: instead x:. By using an underscore (_) you don&#8217;t need to utilize the -N option to bind the namespace to a prefix. Refer to the section titled 1.3. A More Convenient Solution in the XMLStarlet documentation for further information regarding this feature.For instance:xml ed -d '//_:entry[not(child::_:link[@href="https://myhomesite.com"])]' /path/to/file.rssTo further understand using XMLStarlet when your source XML uses namespaces I suggest also reading Namespaces and default namespace in the documentation.Edit 2:The author of the OP subsequently wrote the following in the comments:One question more. Condition [not(child::_:link[@href="myhomesite.com"])] is strict. I wanna be something like start with myhomesite.com but URI not important i.e. myhomesite.com**anything**. It&#8217;s possible? [sic]something like this.. xmlstarlet ed -N x="http://www.w3.org/2005/Atom" -d '//x:entry[not(child::x:link[matches(@href, '^https://myhomesite.com/' )]/@href)]' feed.rsConsider utilizing Xpath&#8217;s starts-with() Function with any one of the previously given examples. For example:Using the -N option and starts-with():xml ed -N x="http://www.w3.org/2005/Atom" -d '//x:entry[not(child::x:link[starts-with(@href, "https://myhomesite.com")])]' file.rssUsing the local-name() and starts-with():xml ed -d '//*[local-name() = "entry" and not(child::*[local-name() = "link"][starts-with(@href, "https://myhomesite.com")])]' file.rssUsing the simplified syntax for the default namespace, i.e. an underscore, and starts-with():xml ed -d '//_:entry[not(child::_:link[starts-with(@href, "https://myhomesite.com")])]' file.rss

Bash – Remove XML nodes if the attribute value of a child node does not equal a specific value?

Advertisement

Answer

Solution:

Saving the resultant XML (RSS)

XPath with `local-name()`: