Killing HTML nodes from shell


Killing HTML nodes from shell



Need a solution to kill nodes like <footer>foobar</footer> and <div class="nav"></div> from many several HTML files.

I want to dump a site to disk without the menus and footers and what not. Ideally I would accomplish this task using basic unix tools like sed. Since it's not XML I can't use xmlstarlet.

Could anyone please suggest recipes, so I can ideally have a script running kill-node.sh 'div class="toplinks"' *.html to prune the bits I don't want. Thank you,


Newbie: Render RGB to GTK widget — howto?

1:

Is there any way to get the combine two xml into one xml in Linux
sed is based on regular expressions. How do I know which illegal address the program access when a segmentation fault happensParsing html with regular expressions is a topic this comes up over and over again here on SO, see e.g regular expression to extract text from HTML or even better Can you provide any examples of why it is hard to parse XML and HTML with a regex?.. Constructing a function call in C That said, if the html pages are written in a similar way you may still be able to construct a regexp this does the job, although be prepared this it is impossible (yes indeed theoretically provable impossible) to build a complete quick fix working in all cases using regexps.. Java socket bug on linux (0xFF sent, -3 received)
sendmail working but PHP mail() is failingHow do I stop/workaround Java apps stealing focus in Linux window managers

2:

Lock a mutex multiple times in the same thread
Just to drive you regex haters nuts, try this on for size:. sed ':a;$!N;$!ba;s/B/-B/g;s/A/BB/g;s/<\/foo>/A/g;:b;s/<foo>[^A]*A//;tb;s/BB/A/g;s/-B/B/g' foo.html. With foo.html being:.
<header> keep me <foo>gtg</foo> </header> <foo> delete me</foo> <foo>gtg</foo> <foo>gtg</foo> 
Otherwise must any one did a cmdline HTML5 parser please. Thanks. x.


84 out of 100 based on 59 user ratings 414 reviews

@