Google

Simple translations with Cost

Joe English
Last updated: Sunday 27 June 1999, 15:31 PDT



1 Introduction

Cost is a powerful but somewhat complex system. The Simple module provides a simplified, high-level interface for developing translation specifications.

2 Getting started

A large number of SGML translation tasks involve nothing more than

  • inserting some text around each element,
  • replacing each SDATA entity reference with a suitable output format representation, and
  • ``escaping'' certain characters or sequences of characters that might be interpreted as markup in the target language.

The Simple module is designed to handle these types of translations. It makes a single pass through the document, inserting text and optionally calling a user-specified script at the beginning and end of each element. The translated document is written to standard output.

To load this module, put the command

require Simple.tcl
at the beginning of the specification script. Next, define a translation specification as follows:
specification translate {
    specification-rules...
}

The specification-rules is a paired list matching queries with parameter lists. The queries are used to select element nodes, and are typically of the form

    {element GI}
or
    {elements "GI GI..."}
where each GI is the generic identifier or element type name of the elements to select.

Any Cost query may be used, including complex rules like

    {element TITLE in SECTION withattval SECURITY RESTRICTED}
or simple ones like
    {el}
The latter query -- el -- matches all element nodes; it can be used to specify default parameters for elements which don't match any earlier query.

The parameter lists are also paired lists, matching parameters to values. The Simple module translation process uses the following parameters:

before
Text to insert before the element (before evaluating startAction)
startAction
Tcl statements to execute at the beginning of the element
prefix
Text to insert at the beginning the element (after evaluating startAction)
suffix
Text to insert at the end of the element (before evaluating endAction)
endAction
Tcl statements to execute at the end of the element
after
Text to insert after the element (after evaluating endAction)

Tcl variable, backslash, and command substitution are performed on the before, after, prefix, and suffix parameters. This takes place when the element is processed, not when the specification is defined. The value of these parameters are not passed through the cdataFilter command before being output.

NOTE -- Remember to ``protect'' all Tcl special characters by prefixing them with a backslash if they are to appear in the output. The special characters are: dollar signs $, square brackets [], and backslashes \. See the Tcl documentation on the subst command for more details.

cdataFilter
A filter procedure for character data
sdataFilter
A filter procedure for system data (SDATA entity references).

The cdataFilter parameter is the name of a filter procedure. This is a one-argument Tcl command. Cost passes each chunk of character data to this procedure, and outputs whatever the procedure returns. The initial value of cdataFilter is the identity command, which simply returns its input:

proc identity {text} {return $text}

The sdataFilter parameter works just like cdataFilter, except that it is used for system data (the replacement text of SDATA entity references.) The initial sdataFilter is also identity.

The cdataFilter and sdataFilter parameters are inherited by subelements; that is, if they are not specified for a particular element then the currently active filter procedure will be used by default.

3 Other utilities

The translateContent procedure works just like the Cost built-in command content, except that the content of CDATA and SDATA nodes are filtered through the current cdataFilter and sdataFilter, respectively.

4 Example

The following specification translates a subset of HTML to nroff -man macros. (Well, actually it doesn't do anything useful, it's just to give an idea of the syntax.)

require Simple.tcl

specification translate {
	{element H1} {
		prefix 	"\n.SH "
		suffix 	"\n"
		cdataFilter	uppercase
	}
	{element H2} {
		prefix 	"\n.SS "
		suffix 	"\n"
	}
	{elements "H3 H4 H5 H6"} {
		prefix "\n.SS"
		suffix "\n"
		startAction {
		    # nroff -man only has two heading levels
		    puts stderr "Mapping [query gi] to second-level heading"
		}
	}
	{element DT} {
		prefix	"\n.IP \""
		suffix	"\"\n"
	}
	{element PRE} {
		prefix "\n.nf\n"
		suffix "\n.fi\n"
	}
	{elements "EM I"} {
		prefix "\\fI"
		suffix "\\fP"
	}
	{elements "STRONG B"} {
		prefix "\\fB"
		suffix "\\fP"
	}

	{element HEAD} {
		cdataFilter nullFilter
	}
	{element BODY} {
		cdataFilter nroffEscape
	}
}

proc nullFilter {text} {
    return ""
}

proc nroffEscape {text} {
    # change backslashes to '\e'
    regsub -all {\\} $text {\\e} output
    return $output
}

proc uppercase {text} {
    return [nroffEscape [string toupper $text]]
}

5 Notes

The specification order is important: queries are tested in the order specified, so more specific queries must appear before more general ones.

Parameters are evaluated independently of one another. For example,

specification translate {
    {element "TITLE"} {
	cdataFilter uppercase
    }
    {element TITLE in SECT in SECT in SECT} {
	prefix "<H3>"
	suffix "</H3>\n"
    }
    {element TITLE in SECT in SECT} {
	prefix "<H2>"
	suffix "</H2>\n"
    }
    {element TITLE in SECT} {
	prefix "<H1>"
	suffix "</H1>\n"
	startAction {
	    puts $tocfile [content]
	}
    }
}

The parameter cdataFilter uppercase applies to all TITLE elements, regardless of where they occur, and the startAction parameter applies to any TITLEs which are children of a SECT, even if an earlier matching rule specified a prefix or suffix.

As its name implies, the Simple module is not very sophisticated, but it should be enough to get you started.