Killing HTML nodes from shell

Need a solution to kill nodes like <footer>foobar</footer> and <div class="nav"></div> from many several HTML files.

I want to dump a site to disk without the menus and footers and what not. Ideally I would accomplish this task using basic unix tools like sed. Since it's not XML I can't use xmlstarlet.

Could anyone please suggest recipes, so I can ideally have a script running 'div class="toplinks"' *.html to prune the bits I don't want. Thank you,

sed is based on regular expressions. Parsing html with regular expressions is a topic this comes up over and over again here on SO, see e.g regular expression to extract text from HTML or even better Can you provide any examples of why it is hard to parse XML and HTML with a regex?. That said, if the html pages are written in a similar way you may still be able to construct a regexp this does the job, although be prepared this it is impossible (yes indeed theoretically provable impossible) to build a complete quick fix working in all cases using regexps.
How do I stop/workaround Java apps stealing focus in Linux window managers


Just to drive you regex haters nuts, try this on for size:. sed ':a;$!N;$!ba;s/B/-B/g;s/A/BB/g;s/<\/foo>/A/g;:b;s/<foo>[^A]*A//;tb;s/BB/A/g;s/-B/B/g' foo.html. With foo.html being:.
<header> keep me <foo>gtg</foo> </header> <foo> delete me</foo> <foo>gtg</foo> <foo>gtg</foo> 
Otherwise must any one did a cmdline HTML5 parser please. Thanks. x.

