Proposed Search verb ver. 2.0

This is a proposal for a Search verb ver. 2.0. Search ver 1.0 allows searches on bibliographic fields and full-text, but the full-text "region" is treated as a single structure (in effect, there is only one structure, maindocument). Thus, the only full-text boolean query a user can construct in ver. 1.0 is one for two terms occurring within the same maindocument (which in our case is an entire monographs). We recognize that this limits query precision and thus the usefulness of full-text searching.

However, abstracting a search protocol so that substructural regions can be identified across native document types is not trivial. We have struggled, particularly, with the distinctions between concurrent structures (especially logical and physical), but there are plenty of other problems, partially masked by the uniformity of our document types (monographs). A further complication is how a query service "understands" and makes use of structural relationships (parent|child, etc.) in order to construct meaningful searches, when these relationships may be different at each local repository. For example, one repository may have chapters with child elements of pages while another has chapters with no, or different, child elements. The query service is faced with significant data gathering and interpretation demands, merely to build an accurate search query form.

The basic premise of Search ver. 2.0 is that we go ahead and push 'abstraction' further, to include not only search query mechanics (as we've done in ver. 1.0) but document structures as well.

Search ver. 2.0 overview

Search ver. 2.0 would allow for documents with four abstract structural elements (or docStruct values), two of which are required (meaning that searches within these structures must be supported):

The local repository maps native document structures to abstract structures. Communities may have strict guidelines for such mapping, although it could be left to local decisions. For this project, I imagine a mapping to look something like this:

Again, mapping all of these is not required. If you have no high level document structures, you don't use the docStruct value "div-high". If your maindocument is an article, you'd most likely offer maindocument and div-low as the only available structures. If you've got a lot of mid-level divisions (chapters, sections, divs within divs), you'd decide which one to map to div-mid. Multiple mappings could be allowed (div-mid mapping to local chapter and section and whatever), but would mean more local expense (and perhaps redundant searching?).

The verb ListDocStruct would list the abstract document structures supported by a repository.

A parent|child|sibling relationship is assumed among the abstract document structures.

dwr, 2003-03-20