Skip to content
Advertisement

Compare two websites and see if they are “equal?”

We are migrating web servers, and it would be nice to have an automated way to check some of the basic site structure to see if the rendered pages are the same on the new server as the old server. I was just wondering if anyone knew of anything to assist in this task?

Advertisement

Answer

Get the formatted output of both sites (here we use w3m, but lynx can also work):

w3m -dump http://google.com 2>/dev/null > /tmp/1.html
w3m -dump http://google.de 2>/dev/null > /tmp/2.html

Then use wdiff, it can give you a percentage of how similar the two texts are.

wdiff -nis /tmp/1.html /tmp/2.html

It can be also easier to see the differences using colordiff.

wdiff -nis /tmp/1.html /tmp/2.html | colordiff

Excerpt of output:

Web Images Vidéos Maps [-Actualités-] Livres {+Traduction+} Gmail plus »
[-iGoogle |-]
Paramètres | Connexion

                           Google [hp1] [hp2]
                                  [hp3] [-Français-] {+Deutschland+}

           [                                                         ] Recherche
                                                                       avancéeOutils
                      [Recherche Google][J'ai de la chance]            linguistiques


/tmp/1.html: 43 words  39 90% common  3 6% deleted  1 2% changed
/tmp/2.html: 49 words  39 79% common  9 18% inserted  1 2% changed

(he actually put google.com into french… funny)

The common % values are how similar both texts are. Plus you can easily see the differences by word (instead of by line which can be a clutter).

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement