Search Capabilities for Users of Digital Libraries:
Tools and Paradigms

Ray R. Larson
University of California, Berkeley


Introduction

This discussion paper outlines some considerations for end-user search needs in digital libraries. It draws on a number of research projects underway at Berkeley, as well as a small review of literature. At issue is the question of how users will be able to gain access to the contents of digital libraries, and of how information sources in a digital library can or should be identified, organized and retrieved to best support the users who need to have access to it.

My objectives for the workshop are to

This paper doesn't attempt to provide any sort of final synthesis of these issues, but instead attempts to provide a rough outline of areas for further consideration.

Assumptions

My primary assumption in the following discussion is that Digital Libraries are not single, stand-alone, repositories of digital data. Instead they are a heterogeneous collection of network-based repositories using a variety of protocols for user (and repository) interaction, data encoding and transmission. The repositories may range from small personal collections of information housed on a PC and offered to the network using the WWW HTTP protocol, to multi-terabyte repositories of remote sensing data downloaded from satellites and made available to researchers via FTP or specialized data transfer protocols. These repositories include existing online library catalogs using protocols like Z39.50 and publishers providing full-text of documents via HTTP.

Within individual repositories (databases, web sites, etc.) there are a variety of differing information structures, retrieval and navigation techniques that currently exist, or that could be developed and applied. These low-level support mechanisms for searching and browsing are very important, but should be considered in a wider context of network access by both end-users and their software "agents''.

Questions and Issues

There are any number of questions and issues that might be raised in discussing end-user searching capabilities for digital libraries. A few of these are outlined here:

Search Paradigms

Obviously, the list of questions presented above is much too large for adequate discussion in a short discussion paper (or even at a two day workshop). Of particular concern, to my mind, are the sorts of interaction paradigms for use in digital libraries, and the interaction of those paradigms in today's and future end-user tools for searching digital libraries.

Interaction Paradigms

What types of end-user search interaction are, or might be, made available in a digital library context? The major dichotomy usually found in discussions of IR systems is that of browsing vs. searching. The primary emphasis in IR research over the years has been on searching systems. However, research on information seeking behavior has characterized a variety of paradigms for information seeking. Bates[1] provides a discussion of these paradigms and their interactions. She uses a 2x2 matrix of characteristics of information seeking behavior, with DIRECTED and UNDIRECTED on the horizontal axis and ACTIVE and PASSIVE on the vertical. Figure 1 shown below is an adaption of Bates' figure 2 in [1].

[figure31] Figure 1: Forms of Information Seeking

While one can argue for a continuum of control on one axis and a continuum of activity on the the other, this discrete division of the area is a useful starting point for discussion.

Most conventional search systems and search behaviors fall under the "ACTIVE/DIRECTED'' cell of this matrix, where the user has a good notion of what is wanted and seeks it in a focussed, purposeful way. As Bates points out

The search is active, that is, the user is committing various actions with the intent of acquiring information, and directed, in that the searcher could, if asked, describe some features of the information being sought. [1, p.92,]

The primary emphasis in IR research over the past 30 or more years has been on development of systems to foster this paradigm of active searching. Conventional (i.e., keyword and Boolean) and advanced information retrieval methods (such as vector and probabilistic methods), share a model of searching where the user submits some sort of query and the retrieval system formulates a response. That response represents a mapping from some statement of the users' information needs, as expressed in the query language of the system, to the set of items in the database that provide an exact match or the best match (according to the mapping function or retrieval algorithm employed). Exact matching functions are commonly based on Boolean logic, the best match functions commonly use advanced retrieval methods based on the vector model or probabilistic models of information retrieval[14, 11]. These latter methods attempt to present the results of a search in a ranked order based on an estimation by the system of how relevant a given document in the database might be in relation to the need expressed in the query. These types of search systems are important and necessary components of digital libraries. Some advanced techniques that fall under this paradigm for searching, especially relevance feedback[15], are likely to be very important for support of this type of directed searching in digital libraries. Most current WWW indexes(Lycos, Intomi, WebCrawler, Alta Vista), for example, use some form of ranked retrieval (though often not particularly effective algorithms) to provide a rough ordering of documents by some term frequency statistics.

Too often, information retrieval systems have been designed that don't take into account the user's ability to recognize and exploit relevance, or don't allow the user to exercise control or judgement in the search methods used. They have been designed as ``black boxes'' where a single query/response pair is supposed to satisfy the needs of the user. This is untenable in an environment diverse and complex as digital libraries. Tools to help the user in focussing a search, or in filtering the results of a search to fit criteria not recognized or implemented by the original search engine (as in OASIS[3]) are sadly lacking most of today's search interfaces for users. This is a fertile area for research and development.

Bates characterizes the "ACTIVE/UNDIRECTED'' cell of the matrix as "browsing.'' She notes that under this paradigm:

The searcher is committing actions in a effort to acquire information, but the information seeking behaviors are not directed to any readily specifiable information. The searcher cannot say what is being sought because there is no particular thing that is wanted. [1, p.92,]

This sort of browsing may be seen as the primary mode of interaction with many current internet-based digital libraries. In particular the World-Wide Web is noted for providing a highly browsable interface using hypertext links with ``point and click'' access. Another retrieval method that could be used to foster the sort of interactive, iterative, query formulation that characterizes browsing is the Scatter/Gather clustering method developed at Xerox PARC[4, 5]. This method of automatically and interactively classifying sets of documents could be used to help the ``browser'' to select terms or suggest interesting topics by consulting the structure revealed in the whole database, partial retrieval results, or portions of the data.

Browsing support for non-textual databases, such as databases of scientific data, is much more problematic than conventional or hypertext browsing. Support for effective browsing of scientific data requires that the data be visualized in some form. Work on visualization for scientific data has been carried out in conjunction with the Sequoia 2000 project at the University of California and the San Diego Supercomputer Center[6, 9, 10] Some of the work on NASA's EOSDIS also includes data querying methods that support a browsing paradigm, such as presenting data files as icons or other representations placed according to the geographic coordinates of the data on a map or Geographic Information System interface. A number of "information space'' interfaces have been developed that use a 3-D visualization of information in some database for browsing or data mining. Most of these interfaces have never been used by anyone other than the designer (and some probably can't be). But Geographic and Spatial browsing provides a new paradigm for user interaction that should be evaluated with "real users''. User interface design studies are very important both to understand how users use a system and to discover what additional features are needed or required in a system. Such studies require multi-disciplinary teams of information/computer scientists, social scientists, design specialists and end-users.

There are many interface design issues that need to be addressed in the design of both searching and browsing tools for digital library systems. As a simple example, when system response will be slow (because data is on tertiary storage or the response will take a long to compute) in standalone systems it is recommended practice that system should provide an estimate of expected time to complete the task. How is this sort of information to be collected and conveyed to the user in a heterogeneous distributed enviroment?

The ``PASSIVE/DIRECTED'' cell of the matrix defines the set of information seeking behaviors where the searcher knows what they want to find, but does nothing in particular to find that information. This is characterized by Bates as

The searcher remains intellectually aware enough to recognize the desired information if it should be encountered, but engages in other behaviors besides active efforts to find the information. [1, p.92-3,]

This paradigm of information seeking can be supported by Selective Dissemination of Information (SDI) or information filtering systems[2]. In these systems the user registers an interest profile with the system, and the system then compares the profile to each new item that is inserted into the database, notifying the user when there is a match with the profile. In effect, the user delegates the ``seeking'' of the information to the system and takes no further directed action to find items. Some work done at Stanford University provides an example of how this sort of SDI/filtering system can be applied in a distributed digital library environment[16].

The "PASSIVE/UNDIRECTED'' cell of the information seeking paradigm matrix does not involve any particular effort on the part of the searcher. As Bates points out:

PASSIVE UNDIRECTED is the means by which we acquire most of the information in our minds; we are passively receptive to the information contained in out experiences, retaining and recording what we learn as we go through our lives. [1, p.93,]

This form of information seeking (if it can be called that) is actually a part of all of the other forms of information seeking. It is this sort of awareness that can lead to serendipitous discovery of information, or plant a memory (``I've seen something like that someplace'') that may prove useful in later, more active, information seeking.

Observations and Issues

Some of the more interesting developments in search mechanisms for users are happening at the intersections of these categories of information seeking. In the Cheshire II system[12], for example a search can combineboth probabilistic and Boolean elements. But the search is just considered a first step to further interaction, refinement and browsing by the user. Any record seen by the user can become the basis for relevance feedback, and within each record selected elements (such as author names and subject headings) become hypertext links so that the user can browse though related records. A history mechanism allows the user to backtrack and tools are provided to collect and save items seen along the way.

Other areas that deserve further research include examination of iterative query development for scientific queries and queries against meta data. What is the interaction between the user's knowledge, the database, the search tools, and what is learned during a search? A possible area for research is to examine whether Knowledge Based systems can assist the user in formulating or refining queries. Given an ``expert system'' on a particular topic, how might a user consult it and how might it aid a user's search?

Presently there are thousands of repositories and users must select among them without much in the way of guidance or assistance. We need better information resource discovery mechanisms to characterize and index digital library resources.

Given this chaotic diversity of resources on the network, how will users obtain some authentification of sources (is this really the author? is this the true original form of the item?). Who will be trusted to perform the ``gatekeeper'' functions carried out by editors and publishers in the print environment?

Research Agenda

There are many possible research directions for end-user search capabilities that should be considered for funding by agencies concerned with digital library research. Some important research areas for end-user searching of digital libraries are summarized below.

References

  1. Bates, Marcia J. (1986) "An exploratory paradigm for online information retrieval" In: Brookes, B.C. (ed.) Intelligent Information Systems for the Information Society. Amsterdam: North-Holland, 1986.
  2. Belkin, Nicholas J. and Croft, W. Bruce (1992) "Information filtering and information retrieval: Two sides of the same coin?'' Communications of the ACM 35(12):29-38.
  3. Buckland, Michael K.; Butler, Mark H.; Norgard, Barbara A.; and Plaunt, Christian J. (1993) "OASIS: Prototyping graphical interfaces to networked information" In: Bonzi, Susan (Ed.) Integrating Technologies, Converging Professions: ASIS '96, Proceedings of the 56th American Society for Information Science Annual Meeting, Columbus, Ohio, Oct. 24-28, 1993: pp. 204-210.
  4. Cutting, Douglass R.; Pedersen, Jan O.; Karger, David; and Tukey, John W. (1992) "Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections'' In: Proceedings of the 15th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Copenhagen, Denmark, June 21-24, 1992: pp. 318-329.
  5. Cutting, Douglass R.; Karger, David R.; Pedersen, Jan O. (1993) "Constant Interaction-Time Scatter/Gather Browsing of Very Large Document Collections'' In: Proceedings of the 16th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Pittsburgh, PA., June 27-July 1, 1993: pp. 126-134.
  6. Dozier, Jeff (1992). How Sequoia 2000 Addresses Issues in Data and Information Systems for Global Change. Sequoia 2000 Technical Report S2K-92-14. ftp://cs-tr.cs.berkeley.edu/pub/sequoia/tech-reports/s2k-92-14
  7. Ferguson, William; Bareiss, Ray; Birnbaum, Lawrence; and Osgood, Richard (1992). "ASK Systems: An approach to the realization of story-based teachers'' The Journal of the Learning Sciences 1(2):95-134
  8. Gravano, Luis; Garcia-Molina, Hector; Tomasic, Anthony (1993). The Efficacy for GlOSS for the Text Database Discovery Problem. Stanford University Technical Note Number STAN-CS-TN-93-002. ftp://db.stanford.edu/pub/gravano/1993/stan.cs.tn.93.002.ps
  9. Kochevar, Peter, et al. (1993). A Visualization Architecture for the Sequoia 2000 Project. Sequoia 2000 Technical Report S2K-93-35. ftp://cs-tr.cs.berkeley.edu/pub/sequoia/tech-reports/s2k-93-35
  10. Kochevar, Peter; Ahmed, Zahid (1994). An Intelligent Assistant for Creating Data Flow Visualization Networks. Sequoia 2000 Technical Report S2K-94-52. ftp://cs-tr.cs.berkeley.edu/pub/sequoia/tech-reports/s2k-94-52
  11. Larson, Ray R. (1992) ``Evaluation of Advanced Retrieval Techniques in an Experimental Online Catalog'' Journal of the American Society for Information Science 43(1):34-53.
  12. Larson, Ray R.; Moon, Ralph; McDonough, Jerome; Kuntz, Lucy and O'Leary, Paul. (1995) "Cheshire II: Design and Evaluation of a Next-Generation Online Catalog System" IN: Kinney, Tom (ed.) ASIS '95: Proceedings of the 58th American Society for Information Science Annual Meeting, Chicago, Oct. 9-12, 1995: pp. 215-225.
  13. Robertson, George G.; Card, Stuart K.; and Mackinlay, Jock D. (1993) "Information visualization using 3D interactive animation'' Communications of the ACM 36(4):57-71.
  14. Salton, Gerard (1989). Automatic Text Processing: The transformation, analysis and retrieval of information by computer. New York: Addison-Wesley, 1989.
  15. Salton, Gerard and Buckley, Chris (1990). "Improving retrieval performance by relevance feedback'' Journal of the American Society for Information Science 41(4):288-297. 16 Yan, Tak W. and Garcia-Molina, Hector (1993). "Index Structures for Information Filtering Under the Vector Space Model'' Stanford University Technical Note Number STAN-CS-93-1494. ftp://db.stanford.edu/pub/yan/1993/sdi-vector-model-tr.ps

[ Return to Digital Libraries Workshop ]