Converting HTML to Text

6 March 2007

html2text is a little command-line program that is available in many Linux distros including Gentoo and Ubuntu. As you might expect, it converts HTML to txt format.

To convert an html file to text, you can use:

html2text input.html

This outputs to the screen, to make it a file, you can use this:

html2text input.html > output.txt

However, another way is to use the following command:

html2text -o output.txt input.html

It has to be in that order, I got confused with this because I put -o output.txt at the end. Thanks to a guy called uniplex on IRC who solved all the mysteries.

What about if you want to convert a whole directory of html files into text files? Well the program does not come with a batch argument, however, we can fake a batch mode with this command:

for file in *.html; do html2text -o "${file%.*}.txt" "$file" ; done

I found that this can convert an extremely large amount of files in a very short time.

1 Al says...

Heh, that's a pretty useful utility. I've been doing a fair bit with converting websites to documents at work recently, so this will come in handy. I might combine it with wget to make it even easier. Fedora Core 5 doesn't appear to have any knowledge of it, but it's incredibly simple to compile from source by following the install file given. A note for anyone using Fedora/Redhat, the install file tells you to copy the man files to /usr/local/man/man1 and /usr/local/man/man5, but on FC/RH the correct location will be /usr/share/man/en/man1 and /usr/share/man/en/man5 Cheers.

Posted at 3:30 p.m. on March 6, 2007


2 michael says...

i used html2text a while but in the meantime i am great fan of w3m because it does the best rendering of frames. it can also be used on the commandline even within a pipe to do a text dump of a page.

Posted at 4:23 a.m. on July 8, 2008


What do you have to say?

Show Editing Help


About

Hello, my name is Zeth, I'll be your host here.

Command Line Warriors is about taking control of your own technology, it looks at our experiences of computing; especially using GNU/Linux, the Python programming language, the command-line and issues such as techno-ethics, best practices and whatever is cool now. If you take control of your technology then you are a Warrior too!

This site is your site too which means that you can contribute and get involved. You can leave comments using the facility provided. For me, the comments and discussions are by far the best part of the site. So please do have your say!

Latest Discussions

Nui

July 18, 2008
Hmm, this would be more persuasive as an argument with some evidence. I am a happy admin of Windows and a novice user of Linux, so I have taken the ...
Give Linux a chance

Paddy3118

July 18, 2008
Hi, I too work with Electronic Design Automation tools, where Tcl is used extensively. I tend to only occasionally have to write in Tcl and so find the TclTutor utility: ...
Python and TCL

Cliff Wells

July 17, 2008
I personally cannot live without the Web Developer extension or Firebug. Unfortunately these are probably both among the more difficult to port extensions. Given how poorly Firefox functions on Linux ...
Will Epiphany be able to compete with Firefox's extensions?

making money on the internet

July 17, 2008
[url=http://www.divinecaroline.com/public/user/profile?user_id=83997]extra money 101waystoincome.com[/url]
A year after my 2007 predictions - the score card

Leatherjackets99

July 16, 2008
New Style in Leather Jackets For Man and Woman at http://www.Leatherjackets99.com They Offer Free Shipment Worldwide.
Email Syntax Check in Python

Åke Forslund

July 13, 2008
I'm pretty much a novice in both of these languages but I find them both easy to use and preform the tasks I give them. However I rarely use them ...
Python and TCL

Christopher Thoday

July 12, 2008
A single test is not sufficient to give you confidence that the algorithm is working. You should make 'number' an argument of 'main' so that you can test some boundary ...
Python and TCL

paul21

July 10, 2008
Shame on Mozilla. They should make developers specify the extension license before hosting it. They should show the license next to download button as well.
Are your Firefox extensions proprietary software?

Tris

July 8, 2008
Justin - You say they had not heard of Linux? That doesn't sound very professional to me!
Give Linux a chance

michael

July 8, 2008
what about Galeon? in Gnome i use Galeon mostly. it is fast and stable and has a nice portal with search masks for Debian, FSF, Freshmeat and so on. wtf ...
Will Epiphany be able to compete with Firefox's extensions?

vermin

July 7, 2008
> Eventually, after a bit of digging and Googling, I found their Toolbar-License... You simply found the license of the StumbleUpon Toolbar for Internet Explorer. This is another product, much ...
Are your Firefox extensions proprietary software?

Andrew West

July 6, 2008
Both the Python and the Tcl example could do with error checking. While at first this may not seem on topic with the post I think it better shows the ...
Python and TCL

Kurushiyama

June 30, 2008
XML is no replacement for SGML, it's a subset.
An Introduction to ReStructuredText