troff2html
v0.21 25 Oct 94
About troff2html
This is the v0.21 release of troff2html, a perl program to convert
troff markup to HTML (hypertext markup language) for the World Wide
Web. Despite its name and my good intentions, v0.2 of troff2html only
supports the me macros. It has not been tested with
groff.
I'm working on support for ms and man macros.
Since I don't use troff much anymore (I've switched to LaTeX), I'm working
very slowly on this. Any feedback on the current state of the program
will motivate me to get the rest of this out the door and off my back.
Comments, bugs, patches, and wish-lists please to troyer@cgl.ucsf.edu.
The troff2html information page is
http://www.cmpharm.ucsf.edu/~troyer/troff2html/
troff2html is availble via ftp and WWW.
Features
- understands -me macros
- understands strings and sourced files
- output from preprocessors (eqn, tbl, etc.) can be run through
nroff or inlined as GIF files.
- translates all iso-8859-1 entities
- Table of Contents configurable
- Navigation bar configurable
- splitting by section level configurable
- (relatively) easy to add your own macros
Coming Soon
- understands -ms, -man macros
- recognizes cross-references in man pages
- html-only/troff-only text for slightly different
hypertext/paper documents
- consolidated, documented ways to add new preprocessor and fonts
to be recognized.
Wishlist
I am not likely to implement these any time soon.
- second pass: add children node, navigation bar with titles
- add heuristics for non-number footnote mark recognition
- add ability to accumulate a single index, references, or TOC from
runs of separate files
- add heuristics for header recognition
- parse troff macro definitions or number registers
- option to put footnotes at the bottom of each page or on a
separate page for each page
Known bugs
- The icon navigation bar has not been tested, and no icons are
supplied with this release.
- list-guessing heuristics do not always work. Please send examples
of failures.
- nroff'd tables may not come out correctly.
- formatting in a .ds defined string is not always interpreted correctly
- groff features, like the CW font, will not work. I'll be glad to
put them in, though, if you tell me about them.
Installing troff2html
- troff2html requires perl. Perl is available for almost any
machine. Consult your local archie. troff2html was written for
Perl 4.036, but has been tested and seems to run with Perl 5.000.
- Get the gzipped tar file (instructions above). Ungzip and untar
it; it will create it's own troff2html directory.
- Edit the troff2html file. Don't worry if you don't know perl.
First make sure that the
#/usr/bin/perl
line points to your local copy of perl.
Go down to the section called
#### Start configuration information
and follow the instructions.
The only thing you must do is edit the following line to
point to the directory where troff2html is going to live
# Change this to the directory where troff2html and macro files reside
$troff2htmldir = "/home/celeste/troyer/troff2html";
- If you want inline GIF files of tables, equations, etc., you will
need ghostscript, which is easily obtainable at any GNU
software archive and compiles on most machines.
- The file troff2html will need to be installed somewhere
in your path. The rest of the files can stay in their own directory.
- If you are going to use icons as navigation links, you will need
to install them somewhere in your web-accessible directories and
edit the configuration to point to them. No icons are distributed
with this version.
Running troff2html
The basic syntax is simply troff2html inputfile.me.
Make sure you're in a separate, writable directory; troff2html assumes
freedom to write a lot of files. troff2html will process all files on the
command line as a single document.
A simple document will probably be best translated as:
troff2html -nosplit -notoc simple.me
which gives a single page document simple.html
with no table of contents. It's a good idea to tell troff2html explictly
-nosplit -notoc if that's what you want; strange things can
happen otherwise.
A bare troff2html will give the following usage statment:
usage: troff2html [arguments] input files ...
troff2html translates [tng]roff to HTML
-name string basename for output html files [inputfilename]
-title string main title (not header) added to all pages []
-split number section at which to split HTML files [999]
-nosplit or -split 0 gives a single page
-numbersection section titles will have roff-like numbers
-nonumbersection no section numbers [default]
-nofill <method> what to do with nofill regions [pre]
(supported methods are 'pre','br')
-toc Table of Contents generated [default]
-notoc no Table of Contents generated
-toc page ToC gets separate page, even if -split 0
-toc bottom ToC is on bottom of first page [top]
-headerlink headers are links back to ToC [default if split = 0]
-headerlink explicit explicit links are put near headers back to ToC
-index include link to index in navigation bar [no]
-noindex don't print any indices
-notes include link to endnotes in navigation bar [no]
-notes <string> the link to notes will be labelled ["Notes"]
-nonotes don't print any notes
-nav include a navigation bar [default if split > 0]
-nonav no navigation bar included [default if split = 0]
-nav words navigation bar will use word links [default]
-nav icons navigation bar will use icons
-tbl <method> how to handle tables [pre]
-eqn <method> how to handle equations [pre]
-pic <method> how to handle pictures [pre]
(currently supported: 'omit','pre','nroff', 'ps')
-warn warn about all ignored macros [yes]
-macros print names of all non-ignored macros and exit
-perl <file> additional macros defined in perl file
-entity use entity names, not numbers, for isolatin chars [no]
arguments can be abbreviated, defaults in square brackets
troff2html arguments
All arguments can be abbreviated by their shortest unambiguous string.
Thus, split -2 can be abbreviated -s 2. The option
parsing package I am using does not recognize -s2.
- -name string
- If name is not given, the base name for all output files
will be the base name (no extension) of the first input file.
The top entry point to the document will always be "basename.html".
If you want troff2html to read from standard input, supply a
-name argument so it knows what to call the output files.
- -title string
- It is useful to have an descriptive HTML <title> so that
browser histories are more meaningful. Normally files are titled
after the basename or after their section title. This argument
will be prepended to all document titles, which are normally seen
only in a browser window or browser history.
This text will not appear in the visible section headers.
- -split number
- Split into separate files at this level of section headers.
-split 0 or -nosplit
means no splitting at all; the resulting document
will be a single page. -split 1 will split at the
top level section headers (.sh 1 or .NH 1). troff2html tries
very hard not to leave a single header alone on a page if you do
something like
.sh 1 Introduction
.sh 2 In The Beginning
Right now, unnumbered headers are always counted as being one
below the last numbered header. This should probably be configurable.
- -numbersection
- Normally, section numbers in the HTML document are not numbered.
This follows the convention of latex2html, with the rationale that
the individual files are usable as separate entities that way.
-numbersection turns on explicit section numbering to
mimic troff behavior.
- -nofill method
- What to do with the troff command .nf. Should be one of
pre or br. The default is pre, which uses
the HTML <pre> command. This will preserve indentation
and line breaks, but it will change the font to
fixed-pitch. Specifying br will use the HTML
command <br> to put a line break after each line of the
troff input. The font thus remains normal and
proportionally-spaced, but you will lose any indentation, tabs,
or internal spacing you had in the .nf section.
- -toc
- Add a Table of Contents (defined by section headers) to the
document. -notoc turns off the Table of Contents.
This is on by default if you have any section headers in
your document. If no section headers are present, no Table of
Contents will be printed. You may need to use -notoc
even if you have no section headers, since if you use any
footnotes troff2html will try to make a 'Notes' section.
- -toc page
- If you give this option a page argument, the Table of
Contents will have it's own page (be a separate file), even if
-nosplit is defined.
- -toc bottom
- Places the Table of Contents at the bottom of the first page
instead of the top
- -headerlink
- If this option is given, section headings will be hyperlinks back
to the Table of Contents. This is mainly useful for a single page
document to jump up to the Table of Contents, and is enabled by
default only if -nosplit. It is not enabled by default
for multi-file documents, since each file/page/section has a
navigation bar that also jumps back to the Table of Contents.
- -headerlink explicit
- If you give this option a explicit argument, the links
back to the Table of Contents are an explicit phrase after each
section header, instead of the section header itself. The exact
phrase is configurable, but the default is "[To Table of Contents]".
- -index
- troff2html doesn't know ahead of time if you have an index, so no
link to the index is included in the navigation bar on each page.
Use this option to include a link to the index on each page.
If you have index entries but don't explictly
print the index, troff2html will print it at the end of
the document. -noindex means don't print this final
index, although you can still print one explicitly with .xp.
- -notes
- troff2html doesn't know ahead of time if you have footnotes or
references, so no link to the endnotes is included in the
navigation bar on each page. Use this option to
include a link to the endnotes on each page. -nonotes means
don't print any endnotes.
- -notes string
- Since all the footnotes and references are put at the end of the
document, they're really endnotes. The default name for this
section is simply "Notes". If you want something else, use this
argument (e.g. -notes References).
- -nav
- Include a navigation bar on each page (What links are included and
in what order is configurable; see the installation instructions.)
This is the default if the document is going to be more than one
HTML file (i.e., if -split 1 or higher). -nonav
turns off the navigation bar.
- -nav method
- Method should be words or icons.n
This pecifies if the navigation bar uses words (the default)
or icons. The words and icons are configurable.
- -tbl method
- How to treat sections marked for the tbl preprocessor.
Method should be one of
omit omit the section entirely
pre leave the raw tbl commands in the document and
include them as a preformatted section
nroff run it through 'tbl | nroff' and include as a
preformatted section
ps render as postscript (with 'tbl | psroff'),
use ghostscript to convert to GIF, and include
as an inline image.
- -eqn method
- As tbl, for the eqn equation preprocessor.
- -pic method
- As tbl, for the pic picture preprocessor.
- -warn
- Warn about all unknown macros. Use -nowarn to turn warnings
off
- -macros
- Print the name of all understood macros and exit. Many macros that
deal with formatting are understood in the sense that we know enough
to ignore them.
- -perl filename
- Although you can add your own macro definitions and code when you
configure troff2html (see the installation instructions), you can also
add perl code to be evaluated at run-time with this argument.
This is useful, for instance, to add heuristic code to detect
section headers from font size changes for a specific file.
- -entity
- Many web browsers, Mosaic and NetScape among them, do support
all of
the iso-8859-1 character set, but do not support
entity names (like &plusnm; for the plus/minus symbol) for
all of this set. By default, troff2html translates these entities
to their encoded number instead (± in this case).
Use -entity in the future when all browsers support
all entity names.
Hints and Quirks
Some of these you might consider bugs. I prefer to think of them as
unforunate details of the implementation, and I'm not likely to fix
them soon, although I'm willing to help if you'd like to try.
Logical vs. visual markup
In the context of trying to convert to a structured markup
language, troff is a nasty language. (Troff may be nasty in other ways
as well, but that's a subject for another time.) Although higher-level
markup, such as the me and ms macros are available,
most of the troff documents I see continue to use low-level,
presentation-based markup instead of logical markup. For example,
people use centering and font size changes instead of .sh.
I've tried to do the best I can under such impossible conditions.
Without simulating the troff
internals in their entirety, all I can do is fudge and make good
guesses. However, troff2html is not a shining example of computer
science or elegant programming; it's more like a house of cards. I'm
also not a troff expert. Feel free to hack on the code and send me the
changes.
In working with documents in the World Wide Web, I've found that
keeping them up-to-date is the hardest part. I hope that troff2html
can be useful to you in this respect. It's almost certain that
troff2html will not translate you document perfectly. You have
two choices: you can change the resulting HTML or you can change
the troff source.
If you even suspect you're going to update the paper (troff)
copy of your document, it's worth your while to make the translation
automatic, so that in future turning out an up-to-date HTML version
is as painless as possible. If you're like me, editing a raw HTML file
every time quickly becomes tedious and gets put on the bottom of the
todo list.
My suggestions if you have problems:
- If you think that troff2html is not rendering something correctly,
report it to me as a bug.
- Change your troff to use more logical markup
- Write some perl code for this special case and include it using
-perl on the command line.
- Use conditional HTML and troff text (not implemented in v0.2) to
tailor the output for the different formats.
- Write some postprocessing filter to munge the HTML. (Well, that's
not elegant, but at least it's automatic.)
- break down and learn LaTex
Add your own macros
This can be done in two ways.
- For macros you're going to want for every file, put the code in
a file (called something like "localmacros.pl" and add that
file to the @localmacros array. See the customization
instructions.
- Otherwise, use the -file command line argument to
add code to a particular run.
In this added perl code, you add a new macro with two steps. Say
you're adding a macro called .Ph that originally takes
elements of path names as arguments and wrapped them, ommitting hyphens.
For HTML, we just want to cram all the arguments together.
First add an entry to the macro subroutine table.
$macrosubs{'Ph'} = 'Phmacro';
Normally subroutines will use $command for the macro
called, $args for any arguments on the same line,
@args for their shellworded equivalent, and $text
for the normal text follwing this command up to the next dot
command. Remember that the following text doesn't isn't necessarily
related to your command. Hands off unless appropriate.
If you want an argument passed to the subroutine, use a colon,
followed by an eval-able expression (meaning foo:i passes the string
'i'. foo:$i passes the variable $i.) See the font examples
in macrotable.pl.
The $command and the $args will not be printed. Most of the
effects of macros should be appended to a variable called $pretext,
which will be printed before $text. If $pretext contains characters
that actually get printed (instead of just formatting changes like
<b> or headers), set $pretextprints = 1;
. This is
just for bookkeeping to make sure a header doesn't get printed all by
itself on a page.
Then add the subroutine in the code to be included.
sub Phmacro {
$args =~ s/\s+//g;
$pretext .= "$args ";
$pretextprints = 1;
}
If your macro should reset fonts or lists to their default, use
$pretext .= &pop_all_fonts;
or
$pretext .= &pop_all_lists;
.
HTML 3.0 support
Eventually equations and tables will be in HTML 3.0. Mosaic 2.5b
already has table support. Somebody will have to write eqn
and tbl translators when enough browsers support this.
I don't know if Greek is part of the HTML 3.0 equation spec or not, though.
As of October 1994, NetScape supports centering. This is
implemented in a nonstandard way and not supported in most browsers,
and so I haven't included it in troff2html. When the standard comes
along, a lot of nasty code can be eliminated.
john <troyer@cgl.ucsf.edu>
(Tue Oct 25 22:31:30 1994)