Paul DuBois
dubois@primate.wisc.edu
Wisconsin Regional Primate Research Center
Revision date: 21 May 1997
troffcvt turns troff input files into a more easily
parsed intermediate format to assist in the process of developing
troff-to-XXX translators. To provide further assistance,
the troffcvt distribution contains code for a library that
sequences troffcvt output into tokens. The library is called
the troffcvt reader (which means that it reads troffcvt
output, not that it is used by troffcvt). The combination
of troffcvt and the troffcvt reader essentially
turns troff files into a typed token stream. This simplifies
the job of writing postprocessors. Generally a postprocessor sits
on one side of a pipe reading the input from troffcvt,
which sits on the other side of the pipe. The reader code is linked
into the postprocessor and is called by it to get the next token
from the pipe.
This document describes troffcvt output format and discusses
how to write postprocessors that convert such output into some
target format. If you decide not to use the reader when writing
a postprocessor, you must understand how to interpret troffcvt
files. If you do use the reader, then you don't need to know as
much about troffcvt format since the reader tokenizes everything
for you. However, it's still useful to have at least a rudimentary
knowledge of the format.
troffcvt writes three kinds of lines:
\setup-begin \resolution N other setup lines... \setup-end rest of document...The first portion of the file consists of a setup section bracketed by \setup-begin and \setup-end lines. The lines in between indicate the initial document layout. The first line of this information is \resolution N, where N is the number of basic units per inch. This indicates the resolution at which troffcvt performed its calculations. Numbers obtained from other control lines may be converted to ems, points, etc., using this resolution. For instance, if the resolution is 1440, the control line \spacing 240 indicates a baseline spacing of 1/6 inch.
The default resolution used by troffcvt is 432, but can
be changed to whatever you want. Probably the lowest resolution
you want to use is the least common multiple of 72 and the resolution
you expect to use in the target format. Otherwise you may end
up with ugly round-off errors when you convert units back to ems,
points, etc.
If the resolution is r, other common troff units
may be calculated as follows. (S is the current point size.)
Unit | Name | Number of basic units |
i | inch | r |
c | centimeter | rx50/127 |
P | pica = 1/6 inch | r/6 |
m | em = S points | Sxr/72=Sr/72 |
n | en = em/2 | Sxr/72x1/2=Sr/144 |
p | point = 1/72 inch | r/72 |
u | basic unit | 1 |
v | vertical line space | varies; set by \spacing N |
The rest of the lines in the setup section contain information
for the page length, page width, indents, etc.
troffcvt knows the names of special characters from two
sources of information. The input sequence and output sequence
for a given special character is either built in and recognized
implicitly, or taken from an action file that is read at runtime.
Special characters with an input sequence of the form \(xx
or \[xxx] are always in the latter category.
Built-in characters form a short list. Most of these are listed
in the "Escape Sequences for Characters, Indicators, and
Functions" section of the troff Summary and Index
document.
Input Sequence | Output Sequence | Note |
\e | @backslash | affected by .ec |
` | @quoteleft |
|
´ | @quoteright |
|
`` | @quotedblleft |
|
´´ | @quotedblright |
|
\& | @zerospace |
|
\^ | @twelfthspace |
|
\| | @sixthspace |
|
\0 | @digitspace |
|
\(space) | @hardspace |
|
\- | @minus |
|
\` | @grave |
|
\´ | @acute |
|
\% | @opthyphen | affected by .hc |
\a,SOH | @leader |
|
\t,TAB | @tab |
|
\(backspace) | @backspace |
|
varies | @fieldbegin | affected by .fc |
varies | @fieldend | affected by .fc |
varies | @fieldpad | affected by .fc |
The output sequence for \e actually depends on the current
escape character, which may be changed with .ec. The input
sequence for the optional hyphenation character may be changed
with .hc.
The characters @ and \ have special meaning in troffcvt
files, so they are indicated in troffcvt output by the
specials @at and @backslash where they are to appear
in the final output literally. Postprocessors should convert them
back to @ and \ characters.
The field delimiter and field pad characters defined with .fc
are written out as @fieldbegin or @fieldend, and
@fieldpad. (Odd delimiters begin fields; even ones end
fields).
The number of non-built-in characters is not fixed. All the special
characters listed in the Ossanna troff manual are defined
in the default action file supplied with the troffcvt distribution,
but the list may be modified as necessary to reflect extra special
characters available in your local version(s) of troff.
The default action file as distributed with troffcvt includes
a number of special characters known by groff.
The troffcvt file reader reads troffcvt output and
tokenizes it, setting several global variables in the process:
tcrClass token class tcrMajor token major number tcrMinor token minor number tcrArgv[] token text vector tcrArgc number of elements in tcrArgv[] vectorAll tokens are assigned to a class. The other variables are set or not depending on the class. The classes are:
tcrEOF end of input tcrControl control line tcrText plain text character tcrSText special text characterElements of tcrArgv[] are null-terminated strings. There are tcrArgc elements in the vector. For plain text and special text tokens, tcrArgc is always 1. For control tokens, the rest of the line is automatically parsed to find any following arguments and these are placed into tcrArgv[1] through tcrArgv[tcrArgc-1]. tcrArgv[tcrArgc] is NULL in all cases.
All numbers on control lines are written as integers. Because
numbers may be quite large, postprocessors generally should convert
them to long rather than to short or int.
To use the reader, call TCRInit(), then call TCRGetToken()
repeatedly. TCRGetToken() returns the token class value,
which it also stores in the variable tcrClass. When TCRGetToken()
returns tcrEOF, the input stream is exhausted and postprocessor
can finish up.
Here is how the global variables are set for the various token
classes.
tcrClass = tcrEOF:
none of the other variables are settcrClass = tcrControl:
tcrMajor control major number (see tcr.h) tcrMinor major number subtype (not set for all control words, see tcr.h) tcrArgv[i] for i = 0, text of control word, including leading \ character for i > 0, argument following control word tcrArgc number of arguments, including control wordtcrClass = tcrText:
tcrMajor ASCII value of character (each character is a separate token) tcrArgv[0] one-byte string containing the character tcrArgc = 1tcrClass = tcrSText:
tcrMajor usually tcrSTUnknown, but see below tcrArgv[0] text of special character name, including leading @ character tcrArgc = 1All the built-in special characters are recognized and assigned distinct major numbers. Other specials are assigned the major number tcrSTUnknown and the postprocessor must examine the text of the token (tcrArgv[0]) to determine what it is and what to do with it. There is no way for the reader to assign fixed numbers to these since the set of special characters understood by troffcvt isn't fixed. One way of dealing with the problem is to read at runtime a file of all the special character names you expect to see. (Usually the same set of names specified in the action file used with troffcvt.)
Note: although built-in special characters do have fixed
major numbers assigned, there is nothing to prevent you from processing
them like other specials, i.e., by examining the token text. It
may be more convenient to treat all specials uniformly.
The reader changes the characteristics of the default token scanner.
This is done in TCRInit(). If you use the token scanning
library for other purposes in your application, you need to change
the scanner's characteristics to what you want and then restore
them, or TCRGetToken() may not work correctly.
The tcrLineNumber holds the current input line number.
This may be useful when printing error messages. Be aware that
this is not the line number of the original troff input
given to troffcvt; it's the line number of the output from
troffcvt.
I assume in this section that you use the troffcvt reader
to write a postprocessor. It's not necessary that you do so, but
if you don't, most of the following comments don't apply.
It's best that you examine the source for some of the postprocessors
supplied in the troffcvt distribution before trying to
write one of your own. You should also read the document troffcvt
-- Notes, Bugs, Deficiences to acquaint yourself with
troffcvt's many limitations.
A postprocessor can be set up this way:
How will you specify what to do with special characters? Remember
that the reader assigns distinct major numbers to only those special
characters for which recognition is built into troffcvt.
Generally, postprocessors read in a list of special characters
that parallels the list given in the action file used by troffcvt.
If the action file list is changed, all the lists used by various
postprocessors need to be changed, too. This is a headache, but
at least the changes can be made by editing text files rather
than by recompiling programs.
To get a list of all the special character names, run this command
in the misc directory:
% chk-specials /dev/null > junkThis puts into junk all the special character names that are not found in /dev/null, which, since that file is empty, will be all the names. You can use the contents of junk as a basis for constructing the output sequences you want the postprocessor to emit for various special characters.
To test the postprocessor, you can run list-specials, another
command in the misc directory that generates a troff-format
listing of all the special characters and their names. But running
the output of list-specials through troffcvt and
the postprocessor, you can see how each special character is actually
treated.
Text centering, filling, and adjusting interact in troff.
My understanding of how this works is indicated below. Since my
conceptual scheme is instantiated in the code, let's hope it's
correct.
Centering (.ce) takes precedence over filling and adjustment.
When centering is not on, no-fill mode (.nf) suspends filling
and adjustment; input lines are copied to the output, left justified.
If centering is off and filling is on (.fi), input lines
are joined as necessary to fill output lines, which are then adjusted
according to the current adjustment specified by .ad. Adjustment
may be suspended with .na.
Turning off filling merely suspends adjustment. The adjustment
setting is remembered and goes back into effect when filling is
turned back on. Similarly, centering doesn't change the filling
or adjustment settings; they are suspended while centering is
in effect and resume when centering terminates.
troffcvt removes the need for postprocessors to handle
these centering, filling and adjusting (CFA) interactions, by
always explicitly writing out which CFA control code to use. This
means the postprocessor only need remember the most recent one.
If troffcvt did not do this, postprocessors would need
to maintain a bunch of state variables (currently centering? currently
filling? currently adjusting? which type of adjustment?).
The CFA control words are:
\adjust-center \adjust-full \adjust-left \adjust-right \center \no-fillWhen \center occurs, centering should be turned on. All text up to a \break should be placed on a single output line and centered. Centering continues until a different CFA control occurs.
When \nofill occurs, no-fill mode should be turned on.
All text up to a \break should be placed on a single output
line, left-justified. No-fill mode continues until a different
CFA control occurs.
If neither centering nor no-fill are in effect, filling is on
and one of the adjustment modes \adjust-left, \adjust-right,
\adjust-full or \adjust-center will be issued. All
text up to the next \break should be used to fill output
lines. All output lines in a paragraph except the last should
be adjusted in the proper way.
Postprocessors can likely treat \center and \adjust-center
as equivalent. Ditto for \nofill and \adjust-left.
Note that there are no control words such as \nocenter,
\fill or \noadjust. Centering is turned off by \nofill
and the adjustment indicators. Filling is turned on by the adjustment
indicators. The troff no-adjust request .na seems
functionally equivalent to left-adjustment and so is indicated
with \adjust-left. The reason for the .na request
seems to be so that .na can be followed by .ad (with
no argument) to resume whatever adjustment mode was in effect
prior to the .na. Since troffcvt keeps track of
adjustment modes it can write out the proper indicator explicitly.
It is not the case that troffcvt output will contain a
single line of text corresponding to each input line when no-fill
or centering are in effect. For example, when input contains special
characters, each of these appears on a separate output line. Thus,
it's important to read text until a \break is seen.
Some document formats indicate paragraphs when they begin, others
when they end. The postprocessor will need to follow whichever
convention is used in the target format. This should be a simple
matter since paragraph beginnings and endings both are readily
located in troffcvt output. \break corresponds to
paragraph endings. Beginnings are easily found also: the first
text line begins one, and every time a \break occurs, the
following text line begins one. (Remember that there may be other
non-text lines between the \break and the following text
line, though.)
Paragraph text should be treated conceptually as one unbroken
string of text, even though it may appear physically on several
lines of troffcvt output. Thus, successive text lines (either
plain or special) should be considered to be part of the same
paragraph until a \break control line occurs. The postprocessor
should perform line filling and wrapping according to the most
recent centering, filling or adjustment control line (one of \center,
\nofill, \adjust-left, \adjust-right, \adjust-full
or \adjust-center).
All characters on plain text lines are significant except the
terminating linefeed, which should be ignored. Postprocessors
should not treat leading or trailing spaces as extraneous without
a good reason. Postprocessors also should not insert space characters
between successive text lines; where necessary, spaces will already
have been placed within the text itself. One exception is that
the decision as to whether to put one or two spaces between sentences
is left to the postprocessor. The main difficulty is determining
when a sentence ends. If the usual suggested style for creation
of troff input files is followed (i.e., that each sentence
should begin on a new line), sentence-terminating periods, question
marks and exclamation points will occur at the ends of lines.
This property is preserved in troffcvt output. Postprocessors
thus can locate sentence endings and have the information they
need for determining whether to insert extra spaces, should they
wish to do so.
Font handling can be a difficult issue. How do troff fonts
correspond to the fonts available in your target format? One problem
is that cannot predict in advance which fonts might be used in
a troff document (although you can probably determine which
ones are available at your site).
Another problem is that the way fonts are treated in troff
doesn't correspond well to the way they're treated in other document
formats (at least in my experience). In troff one switches
from plain text to italic or boldface by switching fonts, e.g.,
from R to I, or from R to B. It is
evident that troff collapses the two dimensions of typeface
and style onto a single-dimensional font namespace. For some formats
this can be handled by leaving the typeface the same but applying
different style attributes to it.
For purposes of font support in the troffcvt reader it
may be more fruitful to map troff font names onto typeface-style
pairs, where the typeface is the font family a given font derives
from and the style indicates those attributes that need to be
applied to the plain font in that family to produce the effect
of the troff font. For instance, the default troff
fonts R, I and B can be described as follows:
Font | Typeface | Style |
R | Times | plain |
I | Times | italic |
B | Times | bold |
Treating fonts this way allows troff fonts to be manipulated
so that "font" changes that really correspond to style
changes can be handled as such.
A simple font to typeface-style map is included in the distribution
(the tcr-fonts file). This file should be modified as necessary
to reflect fonts available locally at your site and installed
into the troffcvt library directory. r-font.c contains
the code to use the font map. The sample postprocessor tc2rtf.c
shows one way to use it.
Tab stops should be interpreted relative to the current indent,
not the page offset. This means that if tab stops are set
and then the indent is changed, the effective tab stops relative
to the page offset change. Some postprocessors may need to reset
tabs in the target format when that happens.
The section documents the syntax of all control lines produced
by troffcvt. The descriptions are grouped according to
the section of the Ossanna troff manual to which they are
most closely related. The exceptions are section 0, which contains
descriptions for miscellaneous controls that don't correspond
to anything in the troff manual, and section 15, which
describes controls for table processing.
Unless otherwise indicated, numeric values on control lines are
specified in basic units.
\setup-begin
\setup-end
\resolution N
\comment string
\pass string
\line filename linenumber
\other string
\font F
\constant-width F
\noconstant-width F
\embolden F N
\embolden-special F N
\point-size N
\space-size N
\begin-page [N]
\offset N
\page-length N
\page-number N
\need N
\mark
\adjust-center
\adjust-full
\adjust-left
\adjust-right
\nofill
\center
\break
\break-spread
\spacing N
\line-spacing N
\space N
\extra-space N
\indent N
\line-length N
\temp-indent N
\diversion-begin name
\diversion-append name
\diversion-end name
\reset-tabs
\first-tab N c
\next-tab N c
\tab-char [c]
\leader-char [c]
\underline
\cunderline
\nounderline
\underline-font F
\motion N c
\line N c
\bracket-begin
\bracket-end
\overstrike-begin
\overstrike-end
\zero-width c
\hyphenate N
\title-length N
\title-begin c
\title-begin l text of left title part \title-end \title-begin m text of middle title part \title-end \title-begin r text of right title part \title-endIf no text occurs between the \title-begin and \title-end lines, it means the specified title part is empty. No control words will occur between the \title-begin and \title-end lines.
\title-end
The troffcvt language contains special controls to indicate
table structure. These result when tblcvt is used to preprocess
troffcvt input. The controls should be written by troffcvt
in a particular order, but troffcvt itself does no checking
to verify the ordering. It relies on tblcvt to generate
table-related requests that specify table elements in the proper
sequence. For more details, see the document tblcvt
-- A troffcvt Postprocessor.
For testing tblcvt, see the tblcvt/tests directory,
which contains the tables from the Lesk tbl document, one
table per file.
\table-begin rows cols header-rows align expand box
allbox doublebox
\table-end
\table-column-info width sep equal
\table-row-begin
\table-row-end
\table-row-line N
\table-row-line 1 Table-width single line \table-row-line 2 Table-width double lineThere is no end marker for this control, as none is needed.
\table-cell-info type vspan hspan vadjust border
L Left-justified R Right-justified C Centered N Numeric (align to decimal point) A Alphanumericvspan and hspan are the number of rows and columns spanned by the cell, including itself. Interpret these values as follows:
|
hspan = 0 | hspan > 0 |
vspan = 0 | spanned both ways | spanned from above |
vspan > 0 | spanned from left | not spanned |
Bits Value Meaning 0-1 1 Left border, single line 3 Left border, double line 2-3 1 Right border, single line 3 Right border, double line 4-5 1 Top border, single line 3 Top border, double line 6-7 1 Bottom border, single line 3 Bottom border, double line
\table-cell-begin
\table-cell-end
\table-empty-cell
\table-spanned-cell
\table-cell-line N
\table-cell-line 0 Column-data-width single line \table-cell-line 1 Column-width single line \table-cell-line 2 Column-width double lineThere is no end marker for this control, as none is needed.