Skip to content
Advertisement

Remove newline before a match – Linux

I want to remove the newline before the </script> in my HTML file with a Linux command (sed, awk…).

Sample input:

<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
        <title>JavaScript Ders 2</title>
        <script type="text/javascript" src="script1.js" language="javascript"> 
        </script>
        <script type="text/javascript" src="script2.js" language="javascript"> 
        </script>
        <script>
            // script kodumuz buraya yazılacak
        </script>
    </head>
    <body>
        <script type="text/javascript" src="script3.js" language="javascript"> 
        </script>
    </body>
</html>

Sample output:

<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
        <title>JavaScript Ders 2</title>
        <script type="text/javascript" src="script1.js" language="javascript"> </script>
        <script type="text/javascript" src="script2.js" language="javascript"> </script>
        <script>
        // script kodumuz buraya yazılacak</script>
    </head>
    <body>
        <script type="text/javascript" src="script3.js" language="javascript"> </script>
    </body>
</html>

I tried different syntax, but none of them could do.

Advertisement

Answer

First of all, as mentioned in the comments Don’t parse XML with Regex! Never do it, never think about it. Make it a habit not to think about it! Sometimes it might look to be a simple task that can be performed with or or any other regex parser, but no …

What you can do, on the other hand—if you really want to use or — processes the file first with and convert it into a PYX format.

The PYX format is a line-oriented representation of XML documents that is derived from the SGML ESIS format. (see ESIS – ISO 8879 Element Structure Information Set spec, ISO/IEC JTC1/SC18/WG8 N931 (ESIS))

So what you realy want to do is something like :

$ xmlstarlet pyx <file.html> | do_your_magic_here | xmlstarlet depyx > file.new.html

In your case this would be something like:

$ xmlstarlet pyx file.html 
  | awk 'c~/^- *\n *$/&&/^)script$/{c=$0;next}{print c; c=$0}END{print c}' 
  | xmlstarlet depyx

This will output

<html>
    <head>
        <meta content="text/html; charset=utf-8" http-equiv="Content-Type"></meta>
        <title>JavaScript Ders 2</title>
        <script language="javascript" src="script1.js" type="text/javascript"></script>
        <script language="javascript" src="script2.js" type="text/javascript"></script>
        <script>
            // script kodumuz buraya yazılacak
        </script>
    </head>
    <body>
        <script language="javascript" src="script3.js" type="text/javascript"></script>
    </body>
</html>
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement