HTGREP


Htgrep is a generic search engine for HTTP servers contributed by Oscar Nierstrasz at the University of Geneva. Htgrep allows you to query any document accessible to your server on a paragraph-by-paragraph basis. Htgrep understands: Htgrep is an evolution of parscan, a package written specifically for use with the Plexus http server. Htgrep is a cgi-bin script written in perl, and can be used with any http server that supports cgi-bin scripts. Htgrep has also been rewritten to understand arguments passed to it through form interfaces. A sample form is at the end of this document.

The package htgrep.pl and the script htgrep can be obtained by ftp or http from:

http://cui_www.unige.ch/ftp/PUBLIC/oscar/scripts/README.html
ftp://cui.unige.ch:/PUBLIC/oscar/scripts/
A couple of alternative interfaces are provided.


Querying HTML Documents

Suppose that http://site/path is the URL of an HTML document available at your site. Then you can query this document by simply using the URL:

http://site/cgi-bin/htgrep/file=path
An example application is the W3catalog a searchable catalog of WWW resources. It normally called by another script that hides htgrep's arguments, but it can also be access directly by the URL:

http://cui_www.unige.ch/cgi-bin/htgrep/file=W3catalog/cat.html
BIB EG

EG

If the file contents are list items rather than standalone HTML blocks, htgrep can be instructed to bracket the results of the search with <DL> and </DL>, <OL> and </OL> or <UL> and </UL>. The tag style=style must be included in the call to htgrep, where style is one of pre, dl, ol or ul. For example, we may query a list of titles of HTML documents and cause the resulting entries to be numbered as follows:

http://cui_www.unige.ch/cgi-bin/htgrep/file=unige-pages.html&style=ol

Providing your own Cover Page

By default, htgrep produces only a minimal title and introduction to a searchable document. You can produce your own cover page as follows:

If a header file base.hdr exists, htgrep will print that instead of the default header. In addition, if base.qry exists, it will be used whenever a non-empty query is given. (Normally base.hdr will be a cover page with introductory information, whereas base.qry will only contain the title and main headline.)

The header pages can also be specified with the tags hdr=file and qry=file.


Querying Plain Text

The tag style=pre can be used if the source document is a plain text file. This will cause special characters to be escaped and each paragraph to be surrounded by <PRE> and </PRE>.

The tag grab=yes will cause htgrep to search for URLs and ftp pointers and convert them into hypertext links. This is most interesting in combination with the tag style=pre to query plain text files. An example is the Free Compilers List.


Querying Refer Bibliography Files

Htgrep can also be used to query a database of refer(1) style bibliography entries. Use the tag refer=plain.

See, for example, the OO Bibliography Database.

The tag refer=abstract is used internally by htgrep and is automatically generated when a bibliography entry contains an abstract (%X field). A link to a new call to htgrep is then generated, which will cause the abstract for a given entry to be displayed. Links to ftpable papers are also generated, if the refer entry contains a line of the form:

%% ftp: site:file
For example, see the list of OO papers available by ftp in the same OO bibliography database.

If the tag ftpstyle=dir is used, the link will be to the containing directory rather than to the file itself (to facilitate exploration).


Other HTGREP Tags

Normally a maximum of 250 records will be retrieved. This can be controlled with the tag max=number.

In some cases, this package is not called by htgrep but by another script that is responsible for setting the tags. You can inform the package to use a different URL when generating new requests by using the tag htgrep=path. This is used, for example, by W3catalog to hide the actual arguments to htgrep.

Finally, the tag linemode=yes causes htgrep to retrieve refer records on a line-by-line basis, if fields are separated by ^A instead of a "\n". (This is mainly interesting for the CUI library database.)

Summary of htgrep tags

	file      -- file to search
	isindex   -- query string
	hdr       -- header file (to preceed output)
	qry       -- query file (alternative header for non-empty query)
	style     -- [none/pre/ol/ul/dl] format of records
	max       -- max records to return (default 250)
	grab      -- [no/yes] convert URLs to hypertext (in plain text)
	refer     -- [plain/abstract] format
	ftpstyle  -- [file/dir] make link to ftp file or dir (for refer)
	linemode  -- [no/yes] each record is a single line
	htgrep    -- alternative URL to use for self-calls

HTGREP Form

Here is a sample form for specifying the arguments to htgrep.

File to search (relative to WWW home)
Header file
Alternative header (when query is supplied)
Search string (perl regexp)

Output style Make URLs live (in plain text)

Bibliography format (search refer files) FTP style
Max records to return


Alternative Interfaces

For backwards compatibility, another cgi-bin script, parscan, provides an alternative interface to the htgrep package, translating parscan's arguments to htgrep tags. If you alias requests for /parscan to /cgi-bin/parscan, they should be correctly interpreted by htgrep.

Note that htgrep takes the file to search directly from the URL. Although the package takes pains to ensure that only files visible to the http server may be searched, there is presently no further support for access control. It is, however, possible to restrict the set of files that may be searched through an alternative interface that hard-wires the parameters to the htgrep package for a variety of search engines. See cuisearch for an example of such a front-end.


14 June, 1994 oscar@cui.unige.ch