

                                         COST 2 REFERENCE MANUAL
                                                     
   
   
   Joe English
   Last updated: Fri Oct 13 15:15:23 PDT 1995
     ____________________________________________________________________________________________
   
     * 1 Introduction
     * 2 Running CoST
     * 3 Getting Started
     * 4 Element Structure
          + 4.1 General properties
          + 4.2 Element nodes
          + 4.3 Data nodes
          + 4.4 Entities
          + 4.5 Attributes
     * 5 Queries
          + 5.1 Syntax
          + 5.2 Query commands
          + 5.3 Navigational clauses
          + 5.4 Addressing
          + 5.5 Miscellaneous clauses
     * 6 Event handlers
     * 7 Specifications
     * 8 Application Properties
     * 9 Miscellaneous utilities
          + 9.1 Environments
          + 9.2 Substitutions
     * 10 Examples
          + 10.1 Query examples
          + 10.2 A spell-checker
          + 10.3 Outline and index
          + 10.4 LaTeX conversion
     * 11 Changes from the B4 release
       
   
     ____________________________________________________________________________________________
   
1 Introduction

   
   
   CoST is a general-purpose SGML post-processing tool. It is a structure-controlled SGML
   application; that is, it operates on the element structure information set (ESIS) representation
   of SGML documents.
   
   CoST is implemented as a Tcl extension, and works in conjunction with the sgmls and/or nsgmls
   parsers.
   
   CoST provides a flexible set of low-level primitives upon which sophisticated applications can be
   built. These include
     * A powerful query language for navigating the document tree and extracting ESIS information;
     * An event-driven programming interface;
     * A specification mechanism which binds properties to nodes based on queries;
       
   
   
   CoST is a low-level programming tool. A working knowledge of SGML, Tcl, and [incr tcl] is
   necessary to use it effectively.
   
2 Running CoST

   
   
   Normally costsh is used in a pipeline with sgmls:

sgmls [ options ] sgml-document ...
    | costsh -S specfile [ script-options ... ]

   
   
   The -S flag specifies that costsh is to operate as a filter: it reads a parsed document instance
   from standard input, then evaluates the Tcl script specfile. The remaining script-options ... are
   available in the global list argv. Finally, costsh calls the Tcl procedure main if one was
   defined in specfile, then exits. main should take zero arguments.
   
   Calling costsh with no arguments starts an interactive shell:

costsh

   
   
   The Tcl command loadsgmls reads a document into memory:

loadsgmls filehandle

   
   
   Reads an ESIS event stream in sgmls format from filehandle and constructs the internal document
   tree. The current node is set to the root of the document. filehandle must be a Tcl file handle
   such as stdin or the return value of open.
   
   CoST provides two convenience functions as wrappers around loadsgmls. loadfile file reads a
   pre-parsed ESIS stream from a file and is essentially the same as

set fp [open "filename" r]
loadsgmls $fp
close $fp

   
   
   loaddoc invokes sgmls as a subprocess:

loaddoc args...

   
   
   Invokes sgmls with the arguments args... and reads the ESIS output stream. If the
   SGML_DECLARATION environment variable is set, passes that as the first argument to sgmls.
   
3 Getting Started

   
   
   NOTE -- CoST is a powerful but somewhat complex system. The Simple module provides a simplified,
   high-level interface for developing translation specifications.
   
   A large number of SGML translation tasks involve nothing more than
     * inserting some text around each element
     * replacing each SDATA entity reference with a suitable output format representation;
     * ``escaping'' certain characters or sequences of characters that might be interpreted as
       markup in the target language.
       
   
   
   The Simple module is designed to handle these types of translations. It makes a single pass
   through the document, inserting text and optionally calling a user-specified script at the
   beginning and end of each element. The translated document is written to standard output.
   
   To load this module, put the command

require Simple.tcl

   at the beginning of the specification script. Next, define a translation specification as
   follows:

specification translate {
    specification-rules...
}

   
   
   The specification-rules is a paired list matching queries with parameter lists. The queries are
   used to select elements, and are typically of the form

    {element GI}

   or

    {elements "GI GI..."}

   where each GI is the generic identifier (or ``element type name'') of the elements to select.
   
   Any CoST query may be used, including complex rules like

    {element TITLE in SECTION withattval SECURITY RESTRICTED}

   
   
   The parameter lists are also paired lists, matching parameters to values. See below for the list
   of parameters used.
   
   For example, the following specification translates a subset of HTML to nroff -man macros. (Well,
   actually it doesn't do anything useful, it's just to give an idea of the syntax.)

require Simple.tcl

specification translate {
        {element H1} {
                prefix  "\n.SH "
                suffix  "\n"
        }
        {element H2} {
                prefix  "\n.SS "
                suffix  "\n"
        }
        {elements "H3 H4 H5 H6"} {
                prefix "\n.SS"
                suffix "\n"
                startAction {
                    # nroff -man only has two heading levels
                    puts stderr "Mapping [query gi] to second-level heading"
                }
        }
        {element DT} {
                prefix  "\n.IP \""
                suffix  "\"\n"
        }
        {element PRE} {
                prefix "\n.nf\n"
                suffix "\n.fi\n"
        }
        {elements "EM I"} {
                prefix "\\fI"
                suffix "\\fP"
        }
        {elements "STRONG B"} {
                prefix "\\fB"
                suffix "\\fP"
        }

        {element HEAD} {
                cdataFilter nullFilter
        }
        {element BODY} {
                cdataFilter nroffEscape
        }
}

proc nullFilter {text} {
        return ""
}

proc nroffEscape {text} {
    # change backslashes to '\e'
    regsub -all {\\} $text {\\e} output
    return $output
}

   
   
   The Simple module translation process uses the following parameters:
   
   startAction
          Tcl statements to execute at the beginning of the element
          
   endAction
          Tcl statements to execute at the end of the element
          
   before
          Text to insert before the element (before evaluating startAction)
          
   prefix
          Text to insert at the beginning the element (after evaluating startAction)
          
   suffix
          Text to insert at the end of the element (before evaluating endAction)
          
   after
          Text to insert after the element (after evaluating endAction)
          
   cdataFilter
          The name of a one-argument Tcl command procedure; CoST passes all character data to this
          procedure and outputs the return value while the current element is active (unless
          overridden by another cdataFilter parameter for a child element).
          
   sdataFilter
          A filter procedure for system data (SDATA entity references).
          
   
   
   Tcl variable, backslash, and command substitution is performed on the before, after, prefix, and
   suffix parameters. This takes place when the element is processed, not when the specification is
   defined. The value of these parameters are not passed through the cdataFilter command before
   being output.
   
   NOTE -- Remember to ``protect'' all Tcl special characters by prefixing them with a backslash if
   they are to appear in the output. The special characters are: dollar signs $, square brackets [],
   and backslashes \. See the Tcl documentation on the subst command for more details.
   
   The default value of cdataFilter is the identity command, which simply returns its input. The
   default sdataFilter is also identity. The substitution command is useful for defining new
   character data filter procedures.
   
   As its name implies, the Simple module is not very sophisticated, but it should be enough to get
   you started. To do more powerful things with CoST, read on...
   
4 Element Structure

   
     ____________________________________________________________________________________________
   
     * 4.1 General properties
     * 4.2 Element nodes
     * 4.3 Data nodes
     * 4.4 Entities
     * 4.5 Attributes
       
   
     ____________________________________________________________________________________________
   
   
   
   An SGML document is represented in CoST as a hierarchical collection of nodes. Each node has an
   ordered list of children, and an unordered set of named attributes. Every node except the root
   node has a unique parent.
   
   There are several types of nodes, each with a different set of characteristics:
   
   SD
          An SGML document or subdocument
          
   EL
          An element
          
   PEL
          A ``pseudo-element'' or data container
          
   CDATA
          A sequence of data characters (excluding record-ends)
          
   RE
          A record-end character
          
   SDATA
          System data, from an SDATA entity reference
          
   ENTREF
          A data entity reference
          
   PI
          A processing instruction
          
   ENTITY
          An entity
          
   AT
          An attribute or data attribute
          
   
   
   The root node of a document is always an SD node. Elements are represented by EL nodes. Data
   content matched by a #PCDATA content model token is represented by a PEL node. Collectively,
   these three node types are called tree nodes.
   
   Sequences of characters other than record-ends are represented by CDATA nodes, and record-end
   characters appear as RE nodes.
   
   NOTE -- Technically, record-ends are character data, but it is often useful to process them
   separately so CoST creates distinguished nodes for them.
   
   PI nodes represent processing instructions (and references to PI entities).
   
   SDATA nodes represent internal system data entity references, and ENTREF nodes represent external
   data entity references. (References to other types of entities are expanded by the parser and are
   not directly represented as tree nodes.)
   
   CDATA, RE, SDATA, and ENTREF nodes always appear as children of PEL nodes; PI nodes may appear
   anywhere in the tree.
   
   AT and ENTITY nodes do not appear as children of any node in the tree; instead, they are accessed
   by name.
   
   Node properties are accessed with queries.
   
   NOTE -- In the following sections, node properties are described as subcommands of the query
   command; however, they may be used wherever a query clause is appropriate.
   
  4.1 GENERAL PROPERTIES

query nodetype

   
   
   Returns the node type of the current node (SD, EL, PEL, et cetera).
   
   Specific node types may be selected with the sd, el, pel, cdata, sdata, re, and pi query clauses.
   These test the type of the current node, and fail if it does not match.
   
  4.2 ELEMENT NODES

query? el

   
   
   Tests if the current node is an EL node.

query gi

   
   
   Returns the generic identifier (element type name) of the current node. Fails if the current node
   is not an EL node.

query withgi gi

   
   
   Tests if the current node is an EL node with generic identifier gi. Matching is case-insensitive.

query element gi

   
   
   Synonym for query withgi gi

query elements "gi..."

   
   
   The argument gi... is a space-separated list of name tokens. Succeeds if the current node's
   generic identifier is any one of the listed tokens. Matching is case-insensitive.
   
   Element nodes may also have a dcn (data content notation) property. The DCN of an element is the
   value of the attribute (if any) with declared value NOTATION.
   
  4.3 DATA NODES
  
   
   
   Data nodes are those which directly contain data. This includes CDATA, SDATA, RE, PI, and AT
   nodes (but not PEL nodes, which are containers for data nodes).

query content

   
   
   Returns the character data content of the current node. For RE nodes, this is always a newline
   character (\n). For SDATA nodes it is the system data of the referenced entity. For PI nodes it
   is the system data of the processing instruction. For AT nodes it is the attribute value. Fails
   for all other node types.
   
   The content query clause only returns the content of data nodes. The content command returns the
   character data content of any node:

content

   
   
   If the current node is a data node, equivalent to query content. Otherwise, equivalent to join
   [query* subtree textnode content] "", i.e., returns the text content of the current node.
   
   The textnode clause filters out data nodes which are not part of the document's ``primary
   content'' (e.g., processing instructions).

query textnode

   
   
   Tests if the current node is a CDATA (character data), RE (record end), or SDATA (system data)
   node.
   
  4.4 ENTITIES

query dataent

   
   
   Tests if the current node is an ENTITY (data entity) or ENTREF (entity reference) node.
   
   ENTREF nodes appear in the document tree at the point of a data entity reference. ENTITY nodes
   represent the entity itself and do not appear as children of any tree node.
   
   All properties of ENTITY nodes (including their content and data attributes) are accessible from
   ENTREF nodes which reference them.
   
   The entity query clause navigates directly to an ENTITY node:

query entity ename

   
   
   Selects the ENTITY node corresponding to the entity named ename in the current subdocument, if
   any. The entity name is case-sensitive.
   
   ENTITY nodes will only be present for external data entities which are referenced in the
   document, and data entities named in an attribute with declared value ENTITY or ENTITIES.

query ename

   
   
   Returns the entity name of the current node if it is a ENTITY or ENTREF node; fails otherwise.
   
   Note that the entity name is not available for SDATA nodes.
   
   The content command returns the replacement text of internal data entity nodes.
   
   External entities have a system identifier, a public identifier, or both.

query sysid

   
   
   Returns the system identifier of the entity referenced by the current node if one was declared;
   fails otherwise.

query pubid

   
   
   Like sysid but returns the public identifier of the entity referenced by the current node.
   
   External data entities have an associated data content notation.
   
   NOTE -- Elements (EL nodes) may also have a data content notation. This is determined by the
   value of an attribute with declared value NOTATION if one is specified for the element.

query dcn

   
   
   Returns the name of the current node's data content notation, if any.

query withdcn name

   
   
   Tests if the current node's data content notation is defined and is equal to name. Comparison is
   case-insensitive.
   
   External data entities may also have data attributes if any are declared for the entity's
   associated data content notation. Data attributes are accessed in the same way as regular
   attributes.
   
  4.5 ATTRIBUTES
  
   
   
   AT nodes do not appear in the tree directly; instead, they are accessed by name from their parent
   node.
   
   Only EL nodes and ENTITY nodes have attributes.

query attval attname

   
   
   Returns the value of attribute attname on the current node. If the attribute has an implied
   value, returns the empty string. Fails if attname is not a declared attribute of the current
   node.

query hasatt attname

   
   
   Tests if the current node has an attribute named attname with a non-implied value (i.e., the
   attribute was specified in the start-tag or a default value appeared in the <!ATTLIST>
   declaration).

query withattval attname value

   
   
   Tests if the value of the attribute attname on the current node has the value value. Comparison
   is case-insensitive.
   
   The attribute and attlist clauses navigate to AT nodes.

query attribute attname

   
   
   Selects the attribute named attname of the current node. Fails if no such attribute is present.

query attlist

   
   
   Selects each attribute (AT node) of the current node, in an unspecified order.

query attname

   
   
   Returns the attribute name of the current node, if it is an AT node.
   
   The content query clause returns the attribute value of the current node if it is an AT node.
   
5 Queries

   
     ____________________________________________________________________________________________
   
     * 5.1 Syntax
     * 5.2 Query commands
     * 5.3 Navigational clauses
     * 5.4 Addressing
     * 5.5 Miscellaneous clauses
       
   
     ____________________________________________________________________________________________
   
   
   
   The CoST query language is used in several places:
     * Accessing and testing node properties with the query, query*, and query? commands;
     * Locating nodes for processing with the withNode and foreachNode commands;
     * In the ``match'' part of specification clauses;
       
   
   
   CoST queries are similar to Prolog statements or ``generators'' in the Icon programming language.
   
  5.1 SYNTAX
  
   
   
   A query consists of a sequence of clauses. Each clause begins with an identifying keyword, and
   may contain further arguments. Clause keywords are case-insensitive. Arguments may or may not be
   case-sensitive depending on the clause.

query ::= clause [ clause ... ] ;
clause ::= keyword [ arg ...] ;

   
   
   Note that there is no ``punctuation'': clauses and arguments are delimited by spaces as per the
   usual Tcl parsing rules. Since each clause takes a fixed number of arguments, there is no
   ambiguity.
   
   Queries are evaluated from left to right, evaluating each clause in turn. Each clause is
   evaluated in the context of a current node.
   
   Clauses may take one of four actions:
     * succeed, possibly selecting a new current node
     * fail, causing the query to backtrack
     * return a value to the caller
     * abort, signalling an error
       
   
   
   If a clause succeeds, evaluation continues with the next clause. If it fails, evaluation
   backtracks to the previous clause, which will in turn either fail or select a new current node
   and continue again.
   
   When the query is complete, the original current node is restored.
   
   For example, the command

query ancestor attval "ID"

   is evaluated as follows:
     * the current node becomes the source node for the ancestor clause;
     * the ancestor clause passes the source node to the attval ID clause;
     * if the current node has an ID attribute, attval returns that to the query command;
     * otherwise, the clause fails and evaluation backtracks to the ancestor clause, and
     * ancestor sets the current node to the source node's parent and continues again; then its
       parent's parent, and so on until the rest of the query succeeds or the root node is reached.
     * If ancestor reaches the root node, then the whole query fails.
       
   
   
   Note that failure does not signal an error -- the query command just returns the empty string in
   this case.
   
  5.2 QUERY COMMANDS

query clause...

   
   
   Evaluates the query clause..., and returns the first successful result. If the query fails or
   does not return a value, returns the empty string. q is a synonym for query.

query? clause...

   
   
   Evaluates the query clause..., and returns 1 if the query succeeds, 0 otherwise.

query* clause...

   
   
   Returns a list of all values produced by the query clause....

countq clause...

   
   
   Returns the number of nodes selected or results returned by the query clause....

withNode clause...  { stmts }

   
   
   Evaluates stmts as a Tcl script with the current node set to the first node produced by the query
   clause.... If the query fails, does nothing.

foreachNode clause...  { stmts }

   
   
   Evaluates stmts with the current node set to every node produced by the query clause... in order.
   
   withNode and foreachNode both restore the original current node when evaluation is complete. The
   selectNode command sets the current node in the calling context:

selectNode clause...

   
   
   Sets the current node to the first node produced by evaluating the query clause....
   
  5.3 NAVIGATIONAL CLAUSES
  
Ancestors


query parent

   
   
   Selects the source node's parent.

query ancestor

   
   
   Selects all ancestors of the source node, beginning with the source node and ending with the root
   node.

query rootpath

   
   
   Selects all ancestors of the source node, beginning with the root node and ending with the source
   node.
   
   Note that a node is considered to be an ancestor of itself.
   
Siblings


query left

   
   
   Selects the source node's immediate left (preceding) sibling. Fails if the source node is the
   first child of its parent.

query right

   
   
   Selects the source node's immediate right (following) sibling. Fails if the source node is the
   last child of its parent.
   
   left and right only select a single node. prev and next select multiple siblings:

query prev

   
   
   Selects all earlier siblings of the source node, starting with the immediate left sibling and
   continuing backwards to the first child.

query next

   
   
   Selects all later siblings of the source node.
   
   The prev query clause selects nodes in ``reverse order''; the esib (``elder siblings'') clause
   selects them in the same order as they appear in the document:

query esib

   
   
   Selects all earlier siblings of the source node, starting with the first child node and ending
   with the immediate left sibling.
   
   The ysib (``younger siblings'') clause is present for symmetry with esib. It is a synonym for
   next.

query ysib

   
   
   Selects all later siblings of the source node.
   
   To select all of a node's siblings (including the node itself), use query parent child.
   
Descendants


query child

   
   
   Selects all children of the source node in order.

query subtree

   
   
   Selects all descendants of the source node in preorder traversal (document) order. Note that a
   node is considered to be a member of its subtree.

query descendant

   
   
   Preorder traversal. This is like subtree, but does not include the source node.
   
  5.4 ADDRESSING
  
   
   
   Every tree node (EL and PEL nodes) has a unique node address. This is an opaque string by which
   the node may be referenced.

query address

   
   
   Returns the node address of the current node; fails if the current node is not a tree node.

query node addr

   
   
   Selects the node whose address is addr.
   
   NOTE -- Actually, the node address is only semi-opaque: it is composed of the first and third
   components of the node's pathloc document position, separated by a colon.
   
  5.5 MISCELLANEOUS CLAUSES

query docroot

   
   
   Selects the root node of the document.
   
   The root node of a document is always an SD node. The top-level document element may be selected
   with query docroot child el.

query doctree

   
   
   Selects every node in the document. Equivalent to query docroot subtree.

query in gi

   
   
   Selects the parent node if it is an EL node with generic identifier gi, fails otherwise.
   Shorthand for parent withGI gi.

query within gi

   
   
   Selects all ancestor EL nodes with generic identifier gi. Equivalent to ancestor withGI gi.
   
6 Event handlers


process cmd

   
   
   Performs a preorder traversal of the subtree rooted at the current node, calling cmd for each
   ESIS event. cmd is invoked with one argument, the name of the event, with the current node set to
   the active node.
   
   The process command implements an event-driven processing model, calling a user-specified event
   handler at each node. It essentially reconstructs the source ESIS event stream for a particular
   subtree.
   
   The event handler may be any Tcl command, including an [incr tcl] object or a specification
   command. The handler is called with one argument, which is one of the following event types:
   
   START
          Invoked when entering an EL (element) node. The current node is set to the EL node.
          
   END
          Invoked when leaving an EL node. The current node is set to the EL node.
          
   CDATA
          Invoked for each CDATA (character data) node.
          
   RE
          Invoked for each RE (record end or ``newline'') node.
          
   SDATA
          Invoked for each SDATA (system data entity reference) node.
          
   PI
          Invoked for each PI (processing instruction) node.
          
   DATAENT
          Invoked for each ENTREF (data entity reference) or ENTITY (data entity) node.
          
   
   
   Most event types correspond directly to data node types. Two events are generated for each EL
   node, one at the start of the element and one at the end. No events are generated for PEL nodes
   (events are generated for each data node child, however).
   
   [incr tcl] classes which are to be used as event handlers should inherit from the EventHandler
   base class, which defines a do-nothing method for each event type.
   
Example



# File: printtree.spec
# Sample event handler
# Prints an indented listing of the tree structure

proc printtree {event} {
    global level
    switch $event {
        START {
            indent $level;
            puts "<[query gi]>";
            incr level;
        }
        END {
            incr level -1;
            indent $level;
            puts "</[query gi]>";
        }
        CDATA {
            indent $level; puts "\"[query content]\""
        }
        SDATA {
            indent $level; puts "|[query content]|"
        }
        RE {
            indent $level; puts "RE"
        }
        DATAENT {
            indent $level; puts "&[query ename];"
        }
        default {
            indent $level ; puts "$event"
        }
    }
}

global level; set level 0

proc main {} {
    process printtree
}

proc indent {n} {
    while {$n > 0} { puts stdout "   " nonewline; incr n -1 }
}

7 Specifications

   
   
   Specifications assign parameters to document nodes based on queries.

specification specName {
    { query }  { name value name value ... }
    { query }  { name value ... }
    ...
}

   
   
   Defines a new specification associating each query to the matching list of name-value pairs.
   Creates a Tcl access command named specName.
   
   Evaluating a specification tests each query in sequence, and looks for a matching name in the
   parameter list associated with every query that succeeds. Comparison is case-sensitive. All the
   names in a single parameter list must be unique.

specName has name

   
   
   Tests if there is a binding for name associated with the current node in specName. Returns 0 if
   no such binding exists, 1 otherwise.

specName get name [ default ]

   
   
   Returns the value paired with name associated with the current node in specName. If there is no
   such binding, then if a default argument was supplied, returns default; otherwise signals an
   error.
   
   Parameter bindings may also be Tcl scripts. The do subcommand is a convenient way to define
   ``methods'' for document nodes.

specName do name

   
   
   Equivalent to eval [specName get name ""] -- retrieves the binding (if any) of name in specName
   associated with the current node and evaluates it as a Tcl expression. If no match is found, does
   nothing.
   
   As a special case, specName event is equivalent to specName do event for each event type (START,
   END, CDATA, etc.). This allows specification commands to be used as event handlers by the process
   command.
   
   The order of entries in a specification is significant. More specific queries should appear
   before more general ones. For example, {element P withattval SECURITY TOP} {hide=1} must appear
   before {element P} {hide=0} or else the {hide=0} binding will always take precedence.
   
   NOTE -- Tcl-style comments -- beginning with a # and extending to the end of the line -- may not
   be used inside the specification definition.
   
8 Application Properties

   
   
   Document nodes may be annotated with application-defined properties. Property values are strings
   (like everything in Tcl), and are accessed by name.

setprop propname propval

   
   
   Assigns propval to the property propname on the current node.
   
   Property values are retrieved with queries:

query propval propname

   
   
   Returns the value of the property propname on the current node; fails if no such property has
   been assigned.

query hasprop propname

   
   
   Succeeds if the current node has been assigned a property named propname, fails otherwise.
   
   Property names are case-sensitive. Property names beginning with a hash sign (#, the SGML RNI
   delimiter) are reserved for internal use by CoST.
   
9 Miscellaneous utilities

   
     ____________________________________________________________________________________________
   
     * 9.1 Environments
     * 9.2 Substitutions
       
   
     ____________________________________________________________________________________________
   
  9.1 ENVIRONMENTS
  
   
   
   An environment is a set of name-value bindings, much like an associative array. Bindings may be
   saved and restored dynamically, similar to TeX's grouping mechanism. It is possible to create
   multiple independent environments.

environment envname [ name value ...]

   
   
   Creates a new environment and a Tcl access command named envname. The optional name and value
   argument pairs define initial bindings in the environment.

envname set name value [ name value... ]

   
   
   Adds the name-value pairs to the environment envname, overwriting the current binding of each
   name if it is already present.

envname get name [ default ]

   
   
   Returns the value currently bound to name in the environment envname. If no binding for name
   currently exists in envname and the default argument is present, returns that instead; otherwise
   signals an error.

envname save [ name value ... ]

   
   
   Saves the current set of name-value bindings in envname. If name and value argument pairs are
   supplied, adds new bindings to the environment after saving the current bindings.

envname restore

   
   
   Restores the bindings in envname to their settings at the time of the last call to envname save.
   
   If the set and save subcommands are passed one extra argument, it is treated as a list of
   name-value bindings.
   
  9.2 SUBSTITUTIONS
  
   
   
   When translating SGML documents to other formats (including other SGML document types), it is
   often necessary to ``escape'' or ``protect'' character data that might be interpreted as markup
   in the result language. For example, HTML requires all occurrences of <, > and & to be entered as
   entity references &lt;, &gt; and &amp;. TeX and LaTeX have many special characters which must be
   entered as control sequences.
   
   The substitution command provides an easy and efficient way to apply fixed-string substitutions.

substitution substName {
    string replacement
    string replacement
    ...
}

   
   
   Defines a new Tcl command substName which takes a single argument and returns a copy of the input
   with each occurrence of any string replaced with the corresponding replacement. If multiple
   strings match, the earliest and longest match takes precedence.
   
Example


substitution entify {
        {<} {&lt;}
        {>} {&gt;}
        {&} {&amp;}
        {<=} {&le;}
        {>=} {&ge;}
}
entify "a < b && b >= c"
# returns "a &lt; b &amp;&amp; b &ge; c"

10 Examples

   
     ____________________________________________________________________________________________
   
     * 10.1 Query examples
     * 10.2 A spell-checker
     * 10.3 Outline and index
     * 10.4 LaTeX conversion
       
   
     ____________________________________________________________________________________________
   
   
   
   NOTE -- Many of these examples are based on HTML; some familiarity with that document type is
   assumed.
   
  10.1 QUERY EXAMPLES
  
   
   
   Here is a simple query which returns a list of all of the hyperlinks (HREF attribute values) in
   an HTML document:

query* doctree element A attval HREF

   
   
   A slightly better version is:

query* doctree element A hasatt HREF attval HREF

   The hasatt HREF clause filters out the elements which have an implied HREF attribute. Without
   this clause, the returned list would contain empty members for each A element which is a
   destination anchor (A NAME=... instead of A HREF=...).
   
   The next example builds a cross-reference list from an HTML document, printing the anchor name of
   each destination anchor and the target URL of each source anchor, along with the anchor text:

puts stdout "Destination anchors:"
foreachNode doctree element A hasatt NAME {
    puts stdout "\t#[query attval NAME]: [content]"
}
puts stdout "Source anchors:"
foreachNode doctree element A hasatt HREF {
    puts stdout "\t<URL:[query attval HREF]>: [content]"
}

   
   
   A similar listing could also be produced with an event-driven specification:

specification printAnchors {
    {element A hasatt HREF} {
        START   { puts stdout "<URL:[query attval HREF]>: " nonewline }
    }
    {element A hasatt NAME} {
        START   { puts stdout "Anchor #[query attval NAME]: " nonewline }
    }
    {element A} {
        END     { puts "" }
    }
    {textnode within A} {
        CDATA   { puts stdout [query content] nonewline }
        RE      { puts stdout " " nonewline }
    }
}

   
   
   This would be invoked as:

withNode docroot { process printAnchors }

   
   
   The next example demonstrates a multi-step navigational query. (Each query clause is listed on a
   separate line for clarity.)

proc xreftext {refid} {
    withNode \
            doctree \
            element SECT \
            withattval ID $refid \
            child \
            element TITLE  \
    {
        return [content]
    }
    error "No such section $refid"
}

   
   
   doctree element SECT selects all the SECT elements. withattval ID $refid tests if the source node
   has the right ID. child element TITLE navigates to the first TITLE subelement, and then the
   withNode body returns the content of that element. (This could be used to generate
   cross-reference text from an ID reference, for example.)
   
   Another way to do this is:

join [query* doctree element SECT withattval ID $refid \
        child element TITLE subtree textnode content]

   
   
   NOTE -- The join command is necessary if the TITLE element contains subelements or SDATA nodes,
   in which case query* ... subtree textnode content returns a list with more than one member.
   
  10.2 A SPELL-CHECKER
  
   
   
   If you've ever tried to run the Unix utility ispell on an SGML document, you've probably noticed
   that it doesn't do a very good job, since it tries to ``correct'' the spelling of all the tags
   and other markup. (It's programmed to understand LaTeX and nroff markup, but it doesn't know
   anything about SGML.)
   
   This example, implemented as an [incr tcl] event handler, simply extracts the character data from
   the input document and filters it through ispell, producing a list of potentially misspelled
   words on standard output.
   
   It's not as fancy as the interactive ispell mode, but it works well enough. It has one extra
   feature which is useful for technical documentation, though: you can specify a list of elements
   which should not be spell-checked.
   
   The SpellChecker event handler class works with any document type, modulo the list of suppressed
   elements. It recognizes one processing instruction: <?spelling word word...> adds the listed
   words to a local dictionary.
   
   Here is the specification used to spell-check this document:


require specs/Spell.tcl

SpellChecker spellChecker \
    -suppress "AUTHOR DATE EDNOTE LISTING EXAMPLE SYNOPSIS LIT SAMP VAR
                ATTR CLASS CMD ELEM ENV EVENT NODETYPE QC SUBCMD TAG
                ARG OPTARG OPTION"

proc main {} {
    spellChecker begin
    process spellChecker
    spellChecker end
}

   
   
   And here is the implementation of the SpellChecker class:

# Spell.tcl
# CoST wrapper around 'ispell'

needExtension ITCL

itcl_class SpellChecker {
    inherit EventHandler;

    public dictfile ""
    public suppress "";         # list of elements not to spellcheck
    public tmpfile "/tmp/costspell.tmp";

    protected ispellpipe;       # pipe to 'ispell' process
    protected suppressing 0;    # flag: currently suppressing output?
    protected wordlist "";      # local dictionary

    constructor {config} {
        # make sure suppress GI list is all uppercase
        set suppress [string toupper $suppress]
    }

    method suppress? {} {
        # suppress checking for current element?
        return [expr [lsearch $suppress [query gi]] != -1]
    }

    # The START and END tag handlers just set the 'suppressing' flag,
    # and make sure there's whitespace between element boundaries.
    method START {} {
        if [suppress?] { incr suppressing }
    }
    method END {} {
        if [suppress?] { incr suppressing -1 }
        puts $ispellpipe ""
    }

    # Feed character data to 'ispell':
    method CDATA {} {
        if !$suppressing { puts $ispellpipe [content] }
    }

    method PI {} {
        # Is this a <?spelling ...> instruction?
        if {[lindex [query content] 0] == "spelling"} {
            # Yep; add to local dictionary:
            append wordlist " [lrange [query content] 1 end]"
        }
    }

    method begin {} {
        set cmd "ispell -l"
        if {$dictfile != ""} { append cmd " -p $dictfile" }
        set ispellpipe [open "|$cmd | sort | uniq > $tmpfile" w]
        set suppressing 0;
    }

    method end {} {
        close $ispellpipe
        # Read words back from temporary file
        set fp [open $tmpfile r]
        while {[gets $fp word] > 0} {
            # see if it's in local dictionary:
            if {[lsearch $wordlist $word] == -1} {
                # nope; report it:
                puts stdout $word
            }
        }
        close $fp
    }
}

  10.3 OUTLINE AND INDEX
  
   
   
   This is a utility which I've found useful in preparing this reference manual. It builds an
   outline from the section titles, and produces an index of every command and query clause
   mentioned in the document, cross-referenced to the section in which it appears.
   
   The DTD uses a recursive model for sections:

    <!element sect      - O (title,(%m.sect;)*,subsecs?) >
    <!element subsecs   - - (sect+)>

   Each SECT element contains a TITLE (the section heading), followed by any number of block-level
   elements (%m.sect;), and an optional SUBSECS element, which in turn contains other sections.
   
   Commands are tagged with the CMD element, and query clauses are tagged with the QC element. (The
   spell-checker in the previous example helps make sure that commands and query clauses are
   properly marked up!)

#
# outline.spec
# Build a table of contents and command/query clause index
# from the main document.
#

proc main {args} {
#
# The first pass prints the table of contents
# and annotates each SECT element with properties
# that are used in the second pass:
#
    process outline
    nl; nl;
#
# The second pass builds and prints an index of each command
# (CMD elements) and query clause (QC elements) used in the document,
# printing the section number(s) where they appear.
#
    puts stdout "Commands:"
    listall CMD
    nl;
    puts stdout "Query clauses:"
    listall QC
}


#
# Pass 1: table of contents
# Lists all <SECT>ion <TITLE>s and <H>eadings,
# assigning section number properties ('secnum') on the way.
#
global secdepth ;       # current nesting level
global secctrs ;        # array: secdepth -> current section number

set secdepth 1
set secctrs($secdepth) 0

specification outline {

    {element SECT} {
        START {
            global secdepth secctrs
            incr secctrs($secdepth)

            # Set node properties:
            setprop secdepth $secdepth
            setprop secctr $secctrs($secdepth)
            setprop secnum [join [query* rootpath propval secctr] "." ]

            # Set up for subsections:
            incr secdepth
            set secctrs($secdepth) 0
        }
        END {
            incr secdepth -1
        }
    }

    {element TITLE} {
        START {
            global secdepth
            indent $secdepth
            output "[query parent propval secnum] "
        }
        END {
            nl
        }
    }
    {textnode within TITLE} {
        CDATA { output [content] }
    }

    {element H} {
        START { indent [expr $secdepth + 1] }
        END { nl }
    }
    {textnode within H} {
        CDATA { output [content] }
    }
}


#
# Pass 2: build and print an index of terms.
# 'gi' is the generic identifier of the element to be indexed.
#
proc listall {gi} {
    foreachNode doctree element $gi {
        set term [content]
        set where [query ancestor propval secnum]
        lappend tindex($term) $where
    }

    foreach term [lsort [array names tindex]] {
        set tindex($term) [luniq $tindex($term)]
        indent 1
        output "$term ([join $tindex($term) ", "])";  nl
    }
}

#
# Miscellaneous utilities:
#
proc output {data} { puts stdout $data nonewline; }
proc nl {} { puts stdout ""; }
proc indent {n} {
    while {$n > 0} {
        output "    ";
        incr n -1;
    }
}

  10.4 LATEX CONVERSION
  
   
   
   This is a comprehensive example which uses most of the core CoST features, including event
   handlers, specifications, environments and substitutions. It is used to convert this reference
   manual to LaTeX.
   
   It is split in two parts: there is a generic ``architectural form'' processor implemented as an
   [incr tcl] event handler class, and there is a DTD-specific specification which maps elements to
   architectural forms and supplies parameters.
   
   latex.spec contains the DTD-specific part.

# File: specs/latex.spec
# costdoc->LaTeX specification
#

require specs/Latex.tcl

proc main {} { process tolatex }

LatexProcess tolatex -specification {

    {element COSTDOC} {
        arcform         document
        documentclass   article
        classoptions    {}
        packages        {t1enc alltt myfmt}
        preamble        { [makePreamble] }
        prefix          {\maketitle\tableofcontents}
    }

    {element P}         { arcform paragraph }
    {element SECT}      { arcform division }
    {element TITLE in SECT} { arcform divheading }
    {element H}         { arcform macro   macro "paragraph*" }

    {elements "EMPH DFN DPH"} {
        arcform macro
        macro "emph"
    }

    {elements "CMD SUBCMD QC ENV NODETYPE EVENT CLASS ATTR ELEM"} {
        arcform macro
        macro "texttt"
    }

    {element TAG} {
        arcform macro
        macro texttt
        prefix {<}
        suffix {>}
    }

    {element ENVAR} {
        arcform macro
        macro texttt
        prefix {\$}
    }

    {elements "OPTION OPTARG ARG"} { arcform macro }
    {element OPTION}    { macro textbf }
    {element OPTARG}    { macro textit }
    {element ARG}       { macro textit }

    {element LIT} {
        arcform macro
        macro "texttt"
    }
    {elements "SAMP"} {
        arcform macro
        macro "texttt"
        before "`"
        after "'"
    }

    {element CMDDEF}            { arcform environment  envname "framed" }
    {elements "TIP NOTE"}       { arcform environment  envname "quote" }
    {element EDNOTE}            { arcform environment  envname "ednote" }

    {element DLIST}             { arcform list   envname "description" }
    {element DT}                { arcform itemtag }
    {element DD}                { arcform itembody }

    {element LIST}              { arcform list  envname "itemize" }
    {element ENUM}              { arcform list  envname "enumerate" }
    {element IT}                { arcform item }

    {element SYNOPSIS} {
        arcform verbatim
        envname alltt
        cdataFilter allttFilter
    }
    {elements "EXAMPLE LISTING" hasatt SOURCE} {
        arcform special
        before {\begin{verbatim}}
        content {[readFile [query entity [query attval SOURCE] sysid]]}
        suffix {\end}
        after  {{verbatim}}
    }

    {elements "EXAMPLE LISTING"} {
        arcform verbatim
        envname alltt
        cdataFilter allttFilter
    }

    {elements "HEAD TITLEPG"} { arcform ignore }
}

#
# Extra utilities:
#

proc makePreamble {} {
    set txt ""
    withNode docroot subtree element HEAD {
        withNode child element title  { append txt "\\title{[content]}\n" }
        withNode child element author { append txt "\\author{[content]}\n" }
        withNode child element date   { append txt "\\date{[content]}\n" }
    }
    return $txt
}

proc readFile {filename} {
    # %%% for BSD systems, use '|expand' instead of '|newform ...'
    if [catch {set fp [open "|newform -i-8 -o-0 $filename" r]}] {
        warning "Can't process $filename"
        return;
    }
    set content [read $fp]
    close $fp
    return $content
}

# %%% doesn't do tabs quite right
substitution allttFilter {
    "\{"        "\\\{"
    "\}"        "\\\}"
    "\\"        "\\char\"5C{}"
    "\t"        "        "
}

   
   
   Latex.tcl contains the general-purpose architectural forms processing.

# Latex.tcl
# CoST SGML->LaTeX (2e) conversion utilities
#

needExtension ITCL

itcl_class LatexProcess {
    inherit EventHandler

    # supply either -specname <name> or -specification <spec-args>
    public specname ;
    public specification ;
    public verbose 0;
    public secheadings "section subsection subsubsection paragraph"
    public preamble ""; # extra preamble, inserted before \begin{document}

    protected spec;
    protected env;
    protected undefinedGIs "";
    protected blanklines -1;
    protected seclevel -1;

    constructor {config} {
        if [info exists specname] {
            set spec $specname
        } else {
            set spec [specification ${this}Assoc $specification]
        }
        set env [environment "${this}Env" \
                        cdataFilter latexEscape \
                        sdataFilter sdataFilter \
                        markupFilter identityFilter \
                        suppress 0 \
                ]
    }

#
# Output methods:
# Blank lines are significant in TeX (they cause a paragraph break),
# so these methods keep track of newlines.
# Use 'output' for character data (escaped), 'insert' for LaTeX markup.
#
    method output {text} {
        if {$text != ""} {
            puts stdout $text nonewline
            set blanklines -1
        }
    }
    method insert {text} {
        output [[$env get markupFilter] $text]
    }
#
# 'nl n' appends 'n' blank lines; if n==0, just break the line.
# NB: 'output' assumes that its argument contains no newlines.
#
    method nl {{n 0}} {
        while {$blanklines < $n} {
            puts stdout "";
            incr blanklines
        }
    }

#
# Event handler methods:
#
    method CDATA {} { output [[$env get cdataFilter] [content]] }
    method SDATA {} { output [[$env get sdataFilter] [content]] }
    method RE {} { nl 0 }
    method DATAENT {} { specialStart }

#
# START and END methods just call the architectural form handlers,
# inserting the 'before', 'after', 'prefix' and 'suffix' parameters verbatim
#
    method START {} {
        insert [$spec get before ""]
        "[$spec get arcform undefined]Start"
        insert [$spec get prefix ""]
    }

    method END {} {
        insert [$spec get suffix ""]
        "[$spec get arcform undefined]End"
        insert [$spec get after ""]
    }

#
# LaTeX "architectural form" processing:
# New forms may be defined by subclassing the LatexProcess class,
# adding xxxStart and xxxEnd methods for each new form 'xxx'
#

    # none: no special processing
    method noneStart {} { }
    method noneEnd {} { }

    # ignore: supress content of element
    # parameters:
    #   content: replacement content; $variables and [commands] expanded
    method ignoreStart {} {
        output [subst [$spec get content ""]]
        $env save cdataFilter nullFilter \
                    sdataFilter nullFilter \
                    markupFilter nullFilter \
                    suppress 1
    }
    method ignoreEnd {} {
        $env restore
    }

    # undefined: called as a default if no form specified.
    # Prints a warning to that effect if -verbose 1 specified
    method undefinedStart {} {
        if ![$env get suppress] {
            set gi [query gi]
            if { [lsearch $undefinedGIs $gi] == -1 } {
                lappend undefinedGIs $gi
                if $verbose { warning "No mapping for $gi element" }
            }
        }
    }
    method undefinedEnd {} { }
    method queryUndefined {} { return $undefinedGIs }
    method resetUndefined {} { set undefinedGIs "" }

    # special: special processing
    # parameters:
    #   content:        content of element, with Tcl $variable
    #                   and [command] substitution.
    method specialStart {} {
        output [subst [$spec get content ""]]
    }
    method specialEnd {} { }

    # environment: maps to  \begin{xxx} ... \end{xxx}
    # parameters:
    #   envname:        name of environment
    #   envargs:        extra arguments to \begin{envname}, incl. {}[]s.
    method environmentStart {} {
        nl;
        insert "\\begin\{[$spec get envname]\}[$spec get envargs ""]"
        nl 0
    }
    method environmentEnd {} {
        nl 0; insert "\\end\{[$spec get envname]\}"; nl;
    }

    # macro: maps to \xxx{ ... }
    # parameters:
    #   macro:          control sequence name (w/o backslash)
    method macroStart {} { insert "\\[$spec get macro]\{" }
    method macroEnd {} { insert "\}" }

    # paragraph: a paragraph.
    method paragraphStart {} { nl 1 }
    method paragraphEnd {} { nl 1 }

    # document: document body
    # parameters:
    #   documentclass:  LaTeX document class
    #   classoptions:   list of class options
    #   packages:       list of packages (.sty files) used
    #
    method documentStart {} {
        set class [$spec get documentclass article]
        set options [$spec get classoptions ""]
        if {$options != ""} { set options "\[$options\]" }
        output "% Generated by CoST"; nl
        output "\\documentclass${options}{$class}"; nl
        foreach package [$spec get packages {}] {
            output "\\usepackage{$package}"; nl
        }
        nl; output $preamble; nl;
        output [subst [$spec get preamble ""]]; nl;
        output "\\begin{document}"; nl
    }
    method documentEnd {} {
        nl; output "\\end{document}"; nl
    }
    # %%% Need: author, title, date, other metainfo

    # division: hierarchical division (section, chapter, etc.)
    # In LaTeX, sections are specified by heading elements, not containers.
    # The actual heading is produced by the 'divheading' form element
    method divisionStart {} {
        nl 1; incr seclevel;
        setprop seclevel $seclevel
        if [query? hasatt ID] { setprop reflabel [query attval ID] }
    }
    method divisionEnd {} {
        nl 1; incr seclevel -1
    }

    # divheading: hierarchical division title
    # parameters:
    #   macro:  macro name for division heading;
    #           (default: determined by section level and $secheadings list)
    method divheadingStart {} {
        if [$spec has macro ] {
            macroStart
        } else {
            insert "\\[lindex $secheadings $seclevel]\{"
        }
    }
    method divheadingEnd {} {
        if [$spec has macro] {
            macroEnd
        } else {
            insert "\}"
        }
        # add \label{ID} if containing SECTion element has an ID:
        if [query? parent hasprop reflabel] {
            insert " \\label{[query parent propval reflabel]}"
        }
        nl;
    }
    # verbatim: special case of 'environment';
    # does not escape special characters in content.
    # parameters:
    #   cdataFilter:    content filter procedure (default: tabexFilter)
    #   envname:        environment name (default: verbatim)
    method verbatimStart {} {
        nl; insert "\\begin{[$spec get envname verbatim]}"; nl;
        # should markupFilter == nullFilter?
        $env save cdataFilter [$spec get cdataFilter tabexFilter] \
                        sdataFilter nullFilter \
                        markupFilter identityFilter
    }
    method verbatimEnd {} {
        $env restore
        nl; insert "\\end{[$spec get envname verbatim]}"; nl
    }

    # lists: special case of environment.
    # should only contain item, itemtag, and itembody form elements.
    method listStart {}         { environmentStart }
    method listEnd {}           { environmentEnd }

    # item: members of single-part lists like itemize and enumerate
    method itemStart {}         { nl; insert "\\item{}"; nl }
    method itemEnd {}           { nl 1 }

    # itemtag/itembody: members of two-part lists like description
    # should occur in pairs.
    method itemtagStart {}      { nl; insert "\\item\[" }
    method itemtagEnd {}        { insert "\]" }
    method itembodyStart {}     { }
    method itembodyEnd {}       { }

    # xref: cross-references.
    # Don't ask.
    method xrefStart {} {
        set linkend [query attval REFID];
        if {$linkend == ""} { warning "XREF has no linkend"; return }
        set reflabel ""
        set refprefix ""
        withNode doctree withattval ID $linkend {
            set refprefix [$spec get refprefix [query propval refprefix]]
            set reftype [$spec get reftype [query propval reftype]]
            set reflabel [query propval reflabel]
        }
        set refprefix [$spec get refprefix $refprefix]
        set reftype [$spec get reftype $reftype]
        if {$reflabel == ""} { set reflabel $linkend }
        if {$reflabel == ""} { warning "XREF to $linkend has no label" }
        insert "$refprefix\\$reftype{$reflabel}"
    }
    method xrefEnd {} { }

}

# Substitution for text mode:
# %%% This only works with T1 encoding
# %%% Trying to do this with OT1 is too horrifying to think about
substitution latexEscape {
        "\{"    "\\\{"
        "\}"    "\\\}"
        "["     "\{[\}"
        "]"     "\{]\}"
        "\\"    "\\char\"5C{}"
        "#"     "\\#"
        "$"     "\\char`$"
        "%"     "\\%"
        "&"     "\\char`&"
        "~"     "$\\sim$"
        "_"     "\\char`_"
        "^"     "\\char`^"
        "TeX"   " \\TeX{}"
        "LaTeX" "\\LaTeX{}"
}

proc identityFilter {str} { return $str }
proc nullFilter {str} { return "" }
proc sdataFilter {str} { return $str }

# filter for 'verbatim' environment: expand tabs,
# don't substitute anything else.
# %%% this doesn't work: subelements will throw off column count
proc tabexFilter {line} {
    if {[string first "\t" $line] == -1} {return $line}
    set spaces 0
    set tabstop 8
    set result ""
    set col 0
    foreach chunk [split $line "\t"] {
        incr col $spaces
        while {$spaces > 0} { append result " "; incr spaces -1 }
        append result $chunk
        incr col [string length $chunk]
        set spaces [expr $tabstop - $col % $tabstop]
    }
    return $result
}

11 Changes from the B4 release

   
   
   costsh is a standalone process which reads the output from SGMLS; it is not a modified version of
   SGMLS as the B4 version was. costsh can be run as an interactive shell, which has proven to be
   very useful for debugging and for exploring the document structure.
   
   The CoST kernel has been completely reimplemented in C, and is, except in spirit, almost
   completely different from the B4 release.
   
   NOTE -- I had planned to reimplement all documented facilities of the B4 release on top of the
   new primitives. This is turning out to be rather difficult to do, so the B4 release will still be
   available and maintained as a separate package.
   
   In CoST B4, all tree nodes were represented as [incr tcl] objects. The new release stores the
   document internally and provides access to data through queries.
   
   The previous version of CoST processed documents in a single pass by default, with an optional
   ``tree mode'' that allowed two passes over specific subtrees. In the new release, documents may
   be processed in any order with any number of passes.
   
   The new release is considerably faster than before. It's still not blazingly fast, but it's
   reasonable. There is still room for improvement; specifications and queries could be optimized in
   several ways. Tcl and [incr Tcl] still seem to be the main speed bottleneck. [incr Tcl] 2.0 will
   reportedly be much faster than 1.5, and that should help as well.

