troff2html

v0.21 25 Oct 94

About troff2html

This is the v0.21 release of troff2html, a perl program to convert troff markup to HTML (hypertext markup language) for the World Wide Web. Despite its name and my good intentions, v0.2 of troff2html only supports the me macros. It has not been tested with groff.

I'm working on support for ms and man macros. Since I don't use troff much anymore (I've switched to LaTeX), I'm working very slowly on this. Any feedback on the current state of the program will motivate me to get the rest of this out the door and off my back. Comments, bugs, patches, and wish-lists please to troyer@cgl.ucsf.edu.

The troff2html information page is http://www.cmpharm.ucsf.edu/~troyer/troff2html/

troff2html is availble via ftp and WWW.

Features

Coming Soon

Wishlist

I am not likely to implement these any time soon.

Known bugs

Installing troff2html

  1. troff2html requires perl. Perl is available for almost any machine. Consult your local archie. troff2html was written for Perl 4.036, but has been tested and seems to run with Perl 5.000.
  2. Get the gzipped tar file (instructions above). Ungzip and untar it; it will create it's own troff2html directory.
  3. Edit the troff2html file. Don't worry if you don't know perl. First make sure that the
    #/usr/bin/perl       
    
    line points to your local copy of perl.

    Go down to the section called

    #### Start configuration information
           
    and follow the instructions. The only thing you must do is edit the following line to point to the directory where troff2html is going to live
    # Change this to the directory where troff2html and macro files reside
    $troff2htmldir = "/home/celeste/troyer/troff2html";
           
  4. If you want inline GIF files of tables, equations, etc., you will need ghostscript, which is easily obtainable at any GNU software archive and compiles on most machines.
  5. The file troff2html will need to be installed somewhere in your path. The rest of the files can stay in their own directory.
  6. If you are going to use icons as navigation links, you will need to install them somewhere in your web-accessible directories and edit the configuration to point to them. No icons are distributed with this version.

Running troff2html

The basic syntax is simply troff2html inputfile.me. Make sure you're in a separate, writable directory; troff2html assumes freedom to write a lot of files. troff2html will process all files on the command line as a single document.

A simple document will probably be best translated as:

  troff2html -nosplit -notoc simple.me
which gives a single page document simple.html with no table of contents. It's a good idea to tell troff2html explictly -nosplit -notoc if that's what you want; strange things can happen otherwise.

A bare troff2html will give the following usage statment:

usage: troff2html [arguments] input files ...
    troff2html translates [tng]roff to HTML
    -name string        basename for output html files [inputfilename]
    -title string       main title (not header) added to all pages []
    -split number       section at which to split HTML files [999]
                        -nosplit or -split 0 gives a single page
    -numbersection      section titles will have roff-like numbers 
    -nonumbersection    no section numbers [default]
    -nofill <method>    what to do with nofill regions [pre]
                        (supported methods are 'pre','br')
    -toc                Table of Contents generated  [default]
    -notoc              no Table of Contents generated
    -toc page           ToC gets separate page, even if -split 0
    -toc bottom         ToC is on bottom of first page [top]
    -headerlink         headers are links back to ToC [default if split = 0]
    -headerlink explicit explicit links are put near headers back to ToC
    -index              include link to index in navigation bar [no]
    -noindex            don't print any indices
    -notes              include link to endnotes in navigation bar [no]
    -notes <string>     the link to notes will be labelled ["Notes"]
    -nonotes            don't print any notes
    -nav                include a navigation bar [default if split > 0]
    -nonav              no navigation bar included [default if split = 0]
    -nav words          navigation bar will use word links [default]
    -nav icons          navigation bar will use icons
    -tbl <method>       how to handle tables [pre]
    -eqn <method>       how to handle equations [pre]
    -pic <method>       how to handle pictures [pre]
                        (currently supported: 'omit','pre','nroff', 'ps')
    -warn               warn about all ignored macros [yes]
    -macros             print names of all non-ignored macros and exit
    -perl <file>        additional macros defined in perl file
    -entity             use entity names, not numbers, for isolatin chars [no]

    arguments can be abbreviated, defaults in square brackets

troff2html arguments

All arguments can be abbreviated by their shortest unambiguous string. Thus, split -2 can be abbreviated -s 2. The option parsing package I am using does not recognize -s2.
-name string
If name is not given, the base name for all output files will be the base name (no extension) of the first input file. The top entry point to the document will always be "basename.html". If you want troff2html to read from standard input, supply a -name argument so it knows what to call the output files.
-title string
It is useful to have an descriptive HTML <title> so that browser histories are more meaningful. Normally files are titled after the basename or after their section title. This argument will be prepended to all document titles, which are normally seen only in a browser window or browser history. This text will not appear in the visible section headers.
-split number
Split into separate files at this level of section headers. -split 0 or -nosplit means no splitting at all; the resulting document will be a single page. -split 1 will split at the top level section headers (.sh 1 or .NH 1). troff2html tries very hard not to leave a single header alone on a page if you do something like
       .sh 1 Introduction
       .sh 2 In The Beginning
       
Right now, unnumbered headers are always counted as being one below the last numbered header. This should probably be configurable.

-numbersection
Normally, section numbers in the HTML document are not numbered. This follows the convention of latex2html, with the rationale that the individual files are usable as separate entities that way. -numbersection turns on explicit section numbering to mimic troff behavior.
-nofill method
What to do with the troff command .nf. Should be one of pre or br. The default is pre, which uses the HTML <pre> command. This will preserve indentation and line breaks, but it will change the font to fixed-pitch. Specifying br will use the HTML command <br> to put a line break after each line of the troff input. The font thus remains normal and proportionally-spaced, but you will lose any indentation, tabs, or internal spacing you had in the .nf section.
-toc
Add a Table of Contents (defined by section headers) to the document. -notoc turns off the Table of Contents. This is on by default if you have any section headers in your document. If no section headers are present, no Table of Contents will be printed. You may need to use -notoc even if you have no section headers, since if you use any footnotes troff2html will try to make a 'Notes' section.
-toc page
If you give this option a page argument, the Table of Contents will have it's own page (be a separate file), even if -nosplit is defined.
-toc bottom
Places the Table of Contents at the bottom of the first page instead of the top
-headerlink
If this option is given, section headings will be hyperlinks back to the Table of Contents. This is mainly useful for a single page document to jump up to the Table of Contents, and is enabled by default only if -nosplit. It is not enabled by default for multi-file documents, since each file/page/section has a navigation bar that also jumps back to the Table of Contents.
-headerlink explicit
If you give this option a explicit argument, the links back to the Table of Contents are an explicit phrase after each section header, instead of the section header itself. The exact phrase is configurable, but the default is "[To Table of Contents]".
-index
troff2html doesn't know ahead of time if you have an index, so no link to the index is included in the navigation bar on each page. Use this option to include a link to the index on each page. If you have index entries but don't explictly print the index, troff2html will print it at the end of the document. -noindex means don't print this final index, although you can still print one explicitly with .xp.
-notes
troff2html doesn't know ahead of time if you have footnotes or references, so no link to the endnotes is included in the navigation bar on each page. Use this option to include a link to the endnotes on each page. -nonotes means don't print any endnotes.
-notes string
Since all the footnotes and references are put at the end of the document, they're really endnotes. The default name for this section is simply "Notes". If you want something else, use this argument (e.g. -notes References).
-nav
Include a navigation bar on each page (What links are included and in what order is configurable; see the installation instructions.) This is the default if the document is going to be more than one HTML file (i.e., if -split 1 or higher). -nonav turns off the navigation bar.
-nav method
Method should be words or icons.n This pecifies if the navigation bar uses words (the default) or icons. The words and icons are configurable.
-tbl method
How to treat sections marked for the tbl preprocessor. Method should be one of
       omit           omit the section entirely
       pre            leave the raw tbl commands in the document and
                        include them as a preformatted section 
       nroff          run it through 'tbl | nroff' and include as a
                        preformatted section
       ps             render as postscript (with 'tbl | psroff'), 
                        use ghostscript to convert to GIF, and include
                        as an inline image.
       
-eqn method
As tbl, for the eqn equation preprocessor.
-pic method
As tbl, for the pic picture preprocessor.
-warn
Warn about all unknown macros. Use -nowarn to turn warnings off
-macros
Print the name of all understood macros and exit. Many macros that deal with formatting are understood in the sense that we know enough to ignore them.
-perl filename
Although you can add your own macro definitions and code when you configure troff2html (see the installation instructions), you can also add perl code to be evaluated at run-time with this argument. This is useful, for instance, to add heuristic code to detect section headers from font size changes for a specific file.
-entity
Many web browsers, Mosaic and NetScape among them, do support all of the iso-8859-1 character set, but do not support entity names (like &plusnm; for the plus/minus symbol) for all of this set. By default, troff2html translates these entities to their encoded number instead (&#177; in this case). Use -entity in the future when all browsers support all entity names.

Hints and Quirks

Some of these you might consider bugs. I prefer to think of them as unforunate details of the implementation, and I'm not likely to fix them soon, although I'm willing to help if you'd like to try.

Logical vs. visual markup

In the context of trying to convert to a structured markup language, troff is a nasty language. (Troff may be nasty in other ways as well, but that's a subject for another time.) Although higher-level markup, such as the me and ms macros are available, most of the troff documents I see continue to use low-level, presentation-based markup instead of logical markup. For example, people use centering and font size changes instead of .sh. I've tried to do the best I can under such impossible conditions.

Without simulating the troff internals in their entirety, all I can do is fudge and make good guesses. However, troff2html is not a shining example of computer science or elegant programming; it's more like a house of cards. I'm also not a troff expert. Feel free to hack on the code and send me the changes.

In working with documents in the World Wide Web, I've found that keeping them up-to-date is the hardest part. I hope that troff2html can be useful to you in this respect. It's almost certain that troff2html will not translate you document perfectly. You have two choices: you can change the resulting HTML or you can change the troff source.

If you even suspect you're going to update the paper (troff) copy of your document, it's worth your while to make the translation automatic, so that in future turning out an up-to-date HTML version is as painless as possible. If you're like me, editing a raw HTML file every time quickly becomes tedious and gets put on the bottom of the todo list.

My suggestions if you have problems:

Add your own macros

This can be done in two ways.
  1. For macros you're going to want for every file, put the code in a file (called something like "localmacros.pl" and add that file to the @localmacros array. See the customization instructions.
  2. Otherwise, use the -file command line argument to add code to a particular run.

In this added perl code, you add a new macro with two steps. Say you're adding a macro called .Ph that originally takes elements of path names as arguments and wrapped them, ommitting hyphens. For HTML, we just want to cram all the arguments together.

First add an entry to the macro subroutine table.

$macrosubs{'Ph'} = 'Phmacro';

Normally subroutines will use $command for the macro called, $args for any arguments on the same line, @args for their shellworded equivalent, and $text for the normal text follwing this command up to the next dot command. Remember that the following text doesn't isn't necessarily related to your command. Hands off unless appropriate.

If you want an argument passed to the subroutine, use a colon, followed by an eval-able expression (meaning foo:i passes the string 'i'. foo:$i passes the variable $i.) See the font examples in macrotable.pl.

The $command and the $args will not be printed. Most of the effects of macros should be appended to a variable called $pretext, which will be printed before $text. If $pretext contains characters that actually get printed (instead of just formatting changes like <b> or headers), set $pretextprints = 1;. This is just for bookkeeping to make sure a header doesn't get printed all by itself on a page.

Then add the subroutine in the code to be included.

sub Phmacro {
  $args =~ s/\s+//g;
  $pretext .= "$args ";
  $pretextprints = 1;
}

If your macro should reset fonts or lists to their default, use $pretext .= &pop_all_fonts; or $pretext .= &pop_all_lists;.

HTML 3.0 support

Eventually equations and tables will be in HTML 3.0. Mosaic 2.5b already has table support. Somebody will have to write eqn and tbl translators when enough browsers support this. I don't know if Greek is part of the HTML 3.0 equation spec or not, though.

As of October 1994, NetScape supports centering. This is implemented in a nonstandard way and not supported in most browsers, and so I haven't included it in troff2html. When the standard comes along, a lot of nasty code can be eliminated.


john <troyer@cgl.ucsf.edu> (Tue Oct 25 22:31:30 1994)