Philip J. Smith Workshop Discussion Paper

Philip J. Smith
Cognitive Systems Engineering Laboratory
The Ohio State University

Overview

There has been an explosion in public access to information. There has not, however, been a matching increase in the availability of powerful tools to explore this information. This paper focuses on three themes:

  1. For the foreseeable future, the types of search tools that are likely to be available will be based on several already identified conceptual approaches;

  2. Each of these alternative approaches has particular strengths and weaknesses. One of the most fruitful areas for progress is therefore likely to be the design of search environments that provided integrated access to such tools;

  3. Some of these conceptual approaches require the development of substantial knowledge-bases. Tools are therefore needed to support the efficient and effective creation of such knowledge bases.

Alternative Search Tools

Text searches based on character-string matches (using Boolean and proximity operators) will continue to be one of the major conceptual approaches. The strength of this approach is that, for material that is in the form of text, it is relatively inexpensive to implement. The primary barrier to its use arises in cross-database searching, where there are often incompatibilities in the availability or definition of operators, or where the amount of text to be searched becomes prohibitive. There also remain the classic difficulties that novice users have in understanding how to use such operators correctly. The solution to the first barrier, incompatible operators, is mostly a practical problem. There needs to be agreement on the standardization of such operators. An incentive to promote such standardization might be provided by including such tools in the functionality of widely used clients such as Web browses.

The second problem, excessive quantities of text to search, is probably handled only by creating abstractions that represent contents in a more concise form. To support character-string searching, such surrogates need not be readable text, they only need to contain reasonably ordered sets of terms (i.e., meaningful phrases should be present so that proximity searching can be used). Thus, statistical techniques, knowledge-based systems approaches and natural language processing approaches, as well as traditional human indexing, are all viable methods for creating such abstractions.

The third problem, usability, probably has no truly good solution. Improvements can probably be made, however, by providing intelligent critics that monitor for questionable queries and retrieval sets, and that actively explore modifications of these queries to generate suggestions for query refinement.

Browsing

Without question, the biggest increase has been in the design of databases organized for browsing. Because of the relative ease of creating and searching such databases, this is dominating much of the information explosion at the moment. An important area for extension of this browsing capability, however, is again the need for abstract representations that allow quicker scanning to identify relevant information. This time, though, it is the user rather than a computer that must be supported. What is probably needed are more powerful browsers that help users to explore thesauri, as well as tools to help database creators to develop effective thesauri. These latter tools could include clustering algorithms to help in the identification of the organization(s) implicit in document sets.

Statistical Methods

A third conceptual approach, which is quite different from the use of character-string searches and browsing, involves the use of term co-occurrence relationships to infer semantic relationships. This approach, because it is so different, offers a complementary approach to support exploration and search.

A number of questions need further exploration, however, including:

  1. What are the strengths and weaknesses of such methods in different types of document sets, especially sets distributed over heterogeneous databases;

  2. How well do users understand the strengths and weaknesses of such approaches;

  3. How do different interface design concepts, such as "star" displays, influence the user's ability to use such an approach effectively?

Knowledge-Based Systems Techniques

Another radically different approach is to develop true knowledge-based systems to support search. Such systems offer the potential to support exploration in true semantic spaces, actively generating interactions like the following:

User: I'm interested in pollution from Strontium-90 and Cesium-137 in Europe. Computer: That only generates 16 documents. If you broaden your search by looking for information on pollution from fallout in Europe, you will retrieve the original 17 documents plus 23 additional documents.

User: I'd like documents on the prevention of acid rain in North America. Computer: That produces 87 documents. Another 194 are available if you search on the control of sulfur and nitrogen oxides in North America.

Such interactions mimic the assistance provided by expert human intermediaries. While such tools appear potentially powerful, there is little data to date to evaluate their effectiveness. In addition, sizable human effort is required to produce both the supporting knowledge-base and database. However, I would argue that any new efforts to index documents would be well-advised to index them semantically. The effort is about equivalent or less, and the potential benefit is significant.

Additional Methods

Relevance rankings based on crude term weightings (as in WAIS), and methods based on true natural language processing, represent the other two major conceptual approaches that are available.

Hybrid Systems

Relatively little is really understood about the impact of different database characteristics and different types of queries on each of these alternative conceptual approaches alone. Even less is known, though, about how to combine them, and about how to design interfaces that allow users to make effective use of a repertoire of such tools. Because each such approach alone has serious weaknesses, this hybrid approach seems to merit much more serious consideration than it has received to date.


[ Return to Digital Libraries Workshop ]