For an application
see htxp.
1. Introduction
Word processors
store formatting instructions of a document by
using markups and store them along with the document text in a marked-up
file. Most WYSIWYG word processors (a notable exception is Ami Pro)
use extended-ascii characters
as markups, which are not meant for the user to mess with.
Non-WYSIWYG text formatting languages, such as TeX and HTML,
use special ascii character sequences to form markups. A user prepares
a file with an ordinary ascii editor. A program must then be
used to process the file to produce the final manuscript.
The rationale for using "non-WYSIWYGness"
may be different in different cases. For instance,
TeX uses this approach to achieve machine independence and
gives the user precise and complete control over the formatting.
On the other hand, HTML gives the user only control over the
logical attributes of portions of the document and relies on the
browser to interpret the attributes to determine the final appearance
of the manuscript. We shall be dealing exclusively with
text-based marked-up documents.
Typically, markup commands are embedded among the regular document
text and are distinguished from the latter by the convention
that they start with one of several designated characters.
They are instructions telling the formatter
what to do with the text.
The text itself is free-flowing, with most implicit and
conventional formatting ignored -- these include line breaks,
blank lines, extra spaces, etc. Because of this, explicit
instructions are needed to specify a lot of formatting detail,
resulting in repetitive use of markups.
In this document we propose a generic all-purpose preprocessor
"magxp" that
allows a user to save time by using
abbreviations and command shorthands to prepare a raw file,
which is then expanded to a regular marked-up document of the specified
type. All abbreviations and almost all
"magxp" commands are user-modifiable.
We will call them "magxp" macros collectively although
they are not full-fledge programmable macros like those in TeX.
A special subset of the command shorthands are
designated as built-in; they
should be specifically designed to suit the particular
application in mind and general enough for all users.
The generality of "magxp" means that a user can use the
same program to preprocess different kinds of documents including,
in particular, plain ascii text files (such as emails,
notes, and program codes). Different document types are
processed by pointing to different initialization files.
We refer to the processing of one particular document type
by "magxp" as an application.
One can develop a core collection of abbreviations to be used
for all applications, and augment it with a more specialized
set for each document type.
I have a prototype C version of
"magxp" that implements most of the
features discussed in this article, but what I am
proposing here is more like a framework targeting for an
ideal program.
In the following, I outline the various ingredients of the
preprocessor, and the rationale for the specifications.
2. Objectives
- Easy to learn.
- User does not have to know everything about the preprocessor
before being able to start using it. Each command should be
independent of others as much as possible.
- A user can modify macros by simply editing some initialization
files. No recompiling of the program is needed.
- Macro definitions are independent of the programming language
used to write the code. A user does not have to know C
or C++ to be able to write his/her macros.
- Available in any platform.
- In the simplest case, a
command line executable is sufficient for most purposes.
- Incorporation into an editor with
on-the-fly expansion once the cursor leaves a
line would be neat - with an undo option that can take the
cursor back to the line and revert to the unexpanded line -
in case the user discovers a mistake and wishes to edit it
out.
- Macros are compatible with regular markups, so that a command is
(99%) interchangeable with the markup that it replaces.
- This means that one can edit any regular marked-up
document of the type being processed and use macros in the
altered text. When expanded, the unedited part of
the document should remain intact. (This can never be
100% guaranteed because of the use of meta-characters. But with
good common sense in designing the commands, surprises can
be kept to a minimum. The problem can also be solved by
first running the original document under a checker that
can flag any incompatible lines, and then mark them up
to suppress accidental expansion.)
- It also means that a document generated by expanding a
"magxp" file should not change when run under "magxp" a
second time.
- Never frustrate a user wish error messages. Expansion
will also be completed, albeit sometimes into something
different from what the user wants.
- There is no such thing as "command not found". If a command does
not exist (usually because the user mistypes the command or
remembers the wrong command), it is simply treated as
text and duplicated.
- The preprocessor will try to do something even though it suspects
of errors.
Being cryptic and other issues
- Too many abbreviations and macros make the document cryptic and hard to follow
by others.
The "magxp" raw files are not intended for other people to read. It
is for the user to save time. Show other people the expanded
regular markup file.
- "magxp" provides an intermediate stage, in which all abbreviations, but not
macros are expanded. The rationale is that there are usually not
too many macros and they may actually be easier to read than some
of the regular markups. On the other hand, a user's favorable
abbreviations tend to vary from file to file and from time to time.
- After expansion, some lines become annoyingly long and look ugly.
With practice and good design of the substitution text, one
can learn to make good line breaks in the raw file so that
the expanded document will not look too terrible. After all,
editing a few ugly lines still takes less time than typing
everything in long hand.
3. Initialization Files
"magxp" commands can be viewed as markups of yet another markup language.
There is, however, one difference between "magxp" markups and conventional
markups. Even though "magxp" macros always start with some special characters,
it is not true that any string that starts with those
character must be a macro. If a string of characters has not been defined
as "magxp" macros, either by default as built-in commands or by the user,
the string is treated as ordinary text.
Different document types are handled by pointing to different
initialization files. We use the HTML document type as a
concrete example of one such application.
- These files contain all the user-modifiable abbreviations
and commands, and special characters (chosen by the user)
used in the macros.
- As of now, there are no other options to be
specified by the user. If such options are needed
in future versions, they can go into these files.
- These files are to be placed in a designated directory and their
names have the same root, indicative of the application they belong to.
For our example of HTML application, the files have the names "html.ab1",
"html.ab2", "html.dcm", "html.ucm", and "html.rc".
- The file "html.dcm" contains the definitions of
the built-in "magxp" commands (dot commands). These commands should be
specially designed by an expert of the document type with the
the general user in mind and
they should cover comprehensively the commonly used markups
of the application. Once the design of these commands are
frozen, other users should refrain from modifying it.
- User-defined abbreviations and commands can be lumped together into one
file, "html.rc", or separately contained in both "html.ab1" and "html.rc", or
in the three files "html.ab1", "html.ab2", and "html.ucm". The precise
rules for each method will be given below. A user can choose
whichever alternative he/she prefers.
4. Abbreviations
Some of the abbreviation features listed below are not attainable by
the abbreviation features of "vi" or "emacs".
- Mainly used for straight text substitution.
- Requested by an abbreviation code of the
form ";ab".
- Starts with an abbreviation request symbol ";".
- Followed by a 1 or 2-character code.
- The code is case-sensitive.
So ";ab" is different from ";Ab" and from ";aB".
- Two lists, each with its own symbol.
- There are more codes available with two lists than with one.
- One list can be used for the core collection. The other
can be reserved for application-specific jargon.
- The symbols are user chosen and specified in the initialization
files. The symbols can also be changed dynamically (half way
within the document) using a dot command (of the form ".a1 \").
- Format of the definition lists.
- Blank lines can be inserted for easy reading.
- The first non-blank line contains the abbreviation symbol.
- Subsequent lines contain the definitions in the simplest possible form:
- The code "co" must start at column 1.
- "\" is the only special character in the "SUBSTITUTION TEXT".
- "SUBSTITUTION TEXT" can contain more than one word,
- or even more than one line, "\n" specifies a new line.
- "\\" gives "\".
- "\w" means the rest of the text be inserted after the next
alphanumeric "word". One blank space before
"word" is eaten up.
- This is used to construct abbreviations with non-local effects.
An example is to have ";bf word" expand to "boldface{word}"
when the definition is given by "bf boldface{\w}".
- "\W" means the rest of the text be inserted after the next
space-delimited "WORD". One blank space before
"WORD" is eaten up.
- (May be also "\e" for inserting at the end of the line.)
- "\12x" expands to 12 "x"'s. A useful example is when "x" is
a SPACE.
- Other suggestions?
- ";;" means capitalization/decapitalization (of the initial letter).
For instance,
if ";ab" is defined to be "abbreviation", then ";;ab" expands
to "Abbreviation", and if
"$u" is defined to be "University", then "$$u" expands to
"university". The effect is on the first letter of the text
only. Hence, a substitution text whose first character is not
an alphabet, such as, "(abbreviation)" will not be capitalized.
- With practice and care, an abbreviation can be used to replace part of a word.
- Alphabetical sorting of the definition list is not necessary but
advisable for the ease of maintaining the list from a user's
standpoint.
- 2-character codes have precedence over 1-character codes.
- A definition further down the list has precedence over an earlier one.
- Definitions can be added to the second list dynamically within the
document using a dot command (of the form ".ab co sub.text").
- The definition lists are specified in the two initialization files "html.ab1"
and "html.ab2".
If the file "html.rc" exists, the first list is constructed by
appending to "html.ab1" the first part of "html.rc". The second
list comes from the second part of "html.rc". "html.ab2", if present,
will be ignored.
- This feature is provided so the user can use "html.ab1" to hold the
core abbreviations used for all applications - the ".ab1" files for different
applications are simply links to the same file.
- If, half way through editing, a users decide that a new abbreviation
is needed, it can be added in-doc, or it can be added to the list
prior to expansion.
Abbreviation suppression
- In the occasion when some document text must contain special
characters used by "magxp", there should be "magxp" directives to
temporarily suppress expansion.
- The existence of
the string "NO XP" in a line suppresses ;xp
of the ;F line.
- The existence of the
string "NO EXPANSION" in a line suppresses expansion
of subsequent lines, until encountering another line containing
the string "RESUME EXPANSION". Such strings can be
embedded within comments.
- For document types that do not have comment markups (such as plain
ascii files), there is an obvious problem. Any suggestions?
- Dot command equivalents of these are provided. Each command will
generate a comment line with the appropriate directive.
- This is so that when the expanded file runs through "magxp" again,
suppression of expansion will still be in effect; the
original dot command will no longer be present in the expanded
document.
5. Built-in dot commands and ? abbreviations
- A few "magxp" commands are non-application-specific. These include commands
to change abbreviation and other special symbols, and to add
macros dynamically. These are truly built-in in the
sense that they cannot be modified without altering the
source code of the program.
- The other so-called built-in commands are a comprehensive set
of shorthands covering most of the basic markups used
in the application. Although such commands can be modified by the user,
they ought to be designed by an expert in the application and frozen
before distribution to general users. The set should only be
modified for necessary and substantial changes, such as to
accommodate new versions of the document type.
- Built-in commands and user-defined commands has the same format
except in the command request symbol.
- Built-in commands use the symbol "." which must be the first
nonblank character in the line. (Indentation is allowed for
easy reading.)
- The dot "." is followed by a command code of 1 or 2 characters.
The rest of the line is taken as arguments to the command,
and it may (or may not) be modified by the command.
- When expanded, the effect of a command
is to add text to the beginning of the
line, after each of the first few words, to the end of the line,
or at other locations specified by the user. Examples:
.t This is the title.
.1 Heading #1
.i abc.gif Do you like this picture?
.ah //www.mcs.anl.gov/ Click here.
expand (under "htxp") to
This is the title.
Heading #1
Do you like this picture?
Click here.
- Definitions of modifiable dot commands are given in "html.dcm".
- Definitions has the form
t __
i
ah __
In general, the syntax is
co [-o] Text1\wText2_Text3_Text4__Text5
- "co" is the command code.
- "[-o]" are optional options to be discussed below.
- "Text1" is to be added to the beginning of the line.
- "Text2" is to be inserted after the first pseudospace-delimited
word.
- "Text3" is to be inserted after the first space-delimited word.
- "Text4" after the second space-delimited word.
- "Text5" at the end of the line.
- "Text1", "Text2", "Text3" eat up
all
the spaces before the word they precede. So if you really would like
a space between them and the word, add the extra trailing spaces
to "Text1", etc. in the definition.
- "Text4", however, eats up only a single space.
- Special requests like "\\", "\n", and "\12x"
(but not "\W"; note also the slightly different meaning
of "\w") are honored as in abbreviation expansion.
- The pseudospace feature is proposed to handle a situation such as
.ah xyz.html,name="anchor" Click here.
"," is a pseudospace which will be changed into a space in
the expansion. It is used to mark where the extra " should be
added. The true space (after "anchor") triggers the
addition of ">" that encloses the HTML tag.
- If you need a true "," inside the string, use ",,", or
change the pseudospace symbol to another character.
- Options specified after the command code in the definition are used to
request additional actions. For now, there are only
-x suppress expansion for the next line
-v suppress exp for subsequent lines
-V resume expansion for subsequent lines
The idea is to use these to alter some flags or parameters
that can change the action of subsequent commands until the
parameters are restored later by a complementary command.
- Future enhancement of the preprocessor may allow a command to have
alternative definitions depending on the internal flags.
- As an example, a "." alone can be used to produce a bullet within a
list environment, but it can mean something else while outside.
Additional built-in 1-character abbreviations are used to provide shorthands for
writing commonly used markup
attributes, such as "?l" expands to
"align="left"" and "?a" to "alt=""". These are also
defined in "html.dcm".
6. User commands
- These are defined (in "html.ucm" or "html.rc")
and used in exactly the same way as dot commands, except that
they start with a different symbol chosen by the user.
- It is handy to create macros as shorthand for long chains
of markups, in particular, those needed to produce new environments.
- The use of macros has the desirable benefit of letting a user
modifier the environment by changing one macro
instead of having to change the long string of markups
everywhere it appears.
- The flexibility of having a large choice of attributes to include
in a markup make it hard to make built-in dot commands to cover every possible
situation. But a user usually has his/her own favorite
and more or less fixed choices of attributes
for most situations. Macros can be defined
to take care of such frequent choices. For instance, if an image is
almost always aligned at the middle and an alt is always present,
a macro like this can be defined
i
so that
]i abc.gif This is a picture of abc.
will be expanded into
7. Implementation
The C code of my implementation is actually contained in
the same code used to produce the command
"magicxp"
that is used to construct
"htxp".
You also need the header file
"magicxp.h".
If the macro MAGXP is defined at the beginning of the code,
then "magxp" is produced instead of "magicxp.c". The
easiest way is to create the file "magxp.c" that contains
only the following two lines:
#define MAGXP
#include "magicxp.c"
Then compile with
You need some initialization files before the program can work.
Retrieve the following sample initialization files:
"html.dcm",
"html.ab1",
and
"html.rc".
Put these files in a directory called DIR, and
make the alias
Expanding a raw file "file" is done by using the command
To expand only the abbreviations in "file" use the command
Suggestions for improvements and enhancements are always welcome.
Mathematics and Computer Science Division
Argonne National Laboratory
9700 S Cass Ave
Argonne, IL 60439