A few weeks ago I announced that Thanksgiving Day would come December 8 at my house. Well, I can’t wait to celebrate, because I just uploaded my final paper for the semester to the D2L dropbox. Because of the nature of this paper and how it relates to us “just communicating,” and in this case, sharing information via online searching, I’ve decided to post it here. You’ll have to forgive the fact that part of the paper includes the entirety of my October “Metacrap happens” post, as this was the impetus for my paper theme.
Zipf’s Law of Least Effort and Marcia Bates’ Berrypicking Model: An Inquiry into Online Search Methods
Ease of access to desired information has traditionally been and remains a constant goal for information professionals. How to achieve this has intrigued library and behavioral scientists for decades. Rapidly changing technology has revolutionized cataloguing and indexing, utilizing mathematical algorithms and promising access to a more relevant choice of information with less effort. Search engines like Google entice users with their single search-box format and PageRank process. All the while, academic databases like ProQuest and Lexis Nexus regularly refine search tools for their inverted indexing schema, looking for new ways to match information needs with information retrieved. Taxonomies and ontologies are being challenged by social tagging and folksonomies. This paper examines in what ways online searching may be more like George Zipf’s Law of Least Effort or Marcia Bates’ Berrypicking Model and how the examination of information seeking behavior offers opportunities for the development of increasingly user-friendly online information retrieval systems.
Information Seeking Theory
Case (2007) explains that Zipf’s Law of Least Effort functions as a paradigm or grand theory for studies in information seeking behavior. According to Zipf (1949), each individual will adopt a course of action that will involve the expenditure of the probable least average of his work—in other words, the least effort. Originally formulated as a statistical analysis of language usage, Zipf’s theory was first published in his book Human Behaviour and the Principle of Least Effort. His “law” has been tested by linguists and behavioral scientists in a number of subsequent studies. Author Herbert Poole validated Zipf’s supposition in Theories of the Middle Range, when he reported that his analysis of information seeking literature found that 40 of the 51 studies he sampled lent support to Zipf’s principal (Bates 2002).
It appears that Google owners Larry Page and Sergey Brin have capitalized on a derivative of Zipf’s Rank-Frequency. “Zipfian distributions are presented as rank-frequency and frequency-size models. For instance, a rank-frequency distribution of number of index terms assigned in a database displays the frequencies of application of the terms against their rank, with the most frequent being the highest rank (i.e., first)” (Bates 1998). Google describes its PageRank as a system that relies on the democratic nature of the web, using its vast link structure as an indicator of an individual page’s value and provides matches on its views on pages’ relative importance.
As an alternative to “least effort,” Bates offers a theoretical framework describing a more focused, evolving search. She likens it to berrypicking:
“…with each different conception of the query, the user may identify useful information and references. In other words, the query is satisfied not by a single final retrieved set, but by a series of selections of individual references and bits of information at each stage of the ever-modifying search. A bit-at-a-time retrieval of this sort is here called berrypicking. This term is used by analogy to picking huckleberries or blueberries in the forest. The berries are scattered on the bushes; they do not come in bunches. One must pick them one at a time. One could do berrypicking of information without the search need itself changing (evolving).”
Given that Bates (1998), subsequent to her landmark berrypicking article (1989), recognized Zipfian distributions of search methods–noting that information-related searching and other human activity fall under Zipf’s Law of Least Effort, a comparison of the two models of information seeking is warranted.
Indeed, many theorists have speculated about the motives and methods of searching for information. Robert Taylor (1968) posited that “the ease of access to an information system is more significant than the amount or quality of information.” Whether online searching is a form of Zipf’s “law” or more like Bates’ “berrypicking” seems hypothetical at most, and while emphatically naming a black or white polar extreme may be easier (Zipfian), a shade of grey is likely more accurate. Therefore, one might propose that the two models are not “more or less,” “either/or,” hence not a mutually exclusive understanding of online searching, but rather a mix of “how” and “why” questions information retrieval system developers would be wise to ask. This examination should prove of benefit to online information seekers because it is ultimately their wants and needs that will drive the way in which search engine and database search tools and retrieval methods are modified.
Online Retrieval Systems
Systems developers who begin to recognize that both ease of access (how) and relevance to information needs (why) influence usage would do well to blend the information seeking behavior of both least effort and focused berrypicking. Bates’ evolving search could perhaps be accomplished (with more efficiency and satisfaction) in a Google, e.g. Zipfian, information retrieval environment. “Good information retrieval design requires just as much expertise about information and systems of organization as it does about the technical aspects of systems” (Bates, 2007). Improving information retrieval on the Internet is an important goal for information professionals in our knowledge society.
A 2004 paper examining the results of 163 studies (as reported in 175 individual articles) covering two decades of research and analysis of end-user behaviors in online, mostly academic environments classified “searching techniques” and “satisfaction with results” as targeted areas of study. One question that dominated the studies was how to make search and retrieval systems better (Ondrusek, 2004). This study seemed germane, given the topic of this paper and class discussions throughout the semester. A look at end-user behaviors under the microscope of behavioral scientist theories like least effort and berrypicking could assist in systems design.
Personal experience and review of the literature suggests that the method of information retrieval depends on population, information need, information setting, and time constraints. Masters students are at the same time scholars, family members and friends. Class polls fell within the norm of information seeking behavior, indicating that most often we look to someone we know for our answers. According to Case (2007), decades of studies document a strong preference among information seekers for interpersonal sources, who are easier and more accessible than authoritative printed sources. Sometimes it just takes less effort to ask someone for information than find it ourselves. Other times, the situation begs for more investigation, i.e. focus and effort, as is the case for academic research and writing a final paper [ 🙂 ] . Academic databases, when employing meaningful search terms, are proven resources for relevant information retrieval, ripe for berrypicking, and often lead researchers from one source to another as their query evolves. Indeed, the way in which databases are structured to offer additional search terms and suggestions for successful retrieval gives credence to Bates model for online searching.
However, even in academic settings, there is plenty of evidence that scholars sometimes use the least effort. “A recent study found that 63 percent of the search terms employed by scholar searchers of databases were single-word terms” (Bates, 1998). While this statistic is consistent with Zipf’s principal, Bates goes on to say that with digital libraries and the Internet, where tens of millions of records are currently involved, as many as 100,000 hits per word can easily become the minimum of matches on any given database. Perhaps sorting thru the results does not yield the least effort, but rather begs a berrypicking approach that would require great effort to focus on relevance.
Regardless of online searching methods and the resultant information retrieved, in the end, there is only so much time, so many answers, and only a probability that a definitive answer can be found. At some point a decision to end the query will have to be made and an information need will have to be satisfied, accepting a “good is good enough” state of mind, e.g. “satisficied,” as named by Simon Herbert in 1957.
A New Paradigm
Perhaps Bates (2002) “Modes of Information Seeking” diagram is a good way to view the spectrum of online research methods: one that has Zipf’s law of least effort on one end and focused searching which requires great effort on the other (Case, 2007). She ranks the gathering of information in a progressive pattern from undirected awareness (least effort) to directed searching (berrypicking), recognizing a state of browsing and monitoring that fall between the two extremes.
Google certainly lends itself to both ease and relevance (although in astronomical proportions), encouraging online searching and berrypicking at its best. The variety of retrieved information often facilitates an evolving search, acting as a thesaurus of such. The entry of Google Scholar into the scholarly research environment suggests that attention to behavioral science theories like Moore’s Law, which postulates that no one will use an information system if using it is more trouble than it’s worth (Case, 2007), might be shaping the future of online retrieval systems. This view was certainly echoed in our class. Whether the vast majority of retrieval system developers can create an online research environment that is both easy to use (Zipfian) while at the same time provide a manageable choice of reliable sources from which to berrypick is one of great interest to me—a Google junkie looking for quick and easy, yet relevant and authoritative information sources that aid in my search for information.
This desire certainly intrigued the information science field when a post I uploaded to my Just Communicate! blog on the Internet spawned a flurry of attention at my rhetorical questioning of the disparity between user-friendly (easy) Google online searching and that of formal, academic databases:
“Today has been a surreal graduate student experience about metadata, classification and cataloging… basically how we find information or enable information to be found so we can use it. (Isn’t that just another way to define communication?)
It all started with a Melissa Gross-style imposed query by one of my favorite professors, Dr. Brown, who literally ‘set us up’ to use databases like ProQuest’s ABI/INFORM, OCLC FirstSearch’s ArticleFirst, and EBSCO Host’s Academic Search Elite to find an obscure article from an unnamed source with little ‘real’ information to go on. Of course, the exercise was engineered for us to learn ways to use search terms and limiters and stop words and the like to assist in a search for information. (Did you pick up on those stop words?)
Well, a half hour and a lot of frustration later, I employed my competitive intelligence skills and did a little googling to find the answer. I think many of the legitimate searchers in our class are still searching, and in order to successfully finish my assignment, I’m using the answer to go back and recreate a ‘legitimate’ search.
Is this the way it’s supposed to be?
Does information have to be so properly indexed and classified? Why is ‘googling’ at the academic level a bad thing? Is it the means or the end?
Too many unanswered questions! So I turned to Doc Martens’ focus on ‘established classification’ for a break from the imposed query experience, only to read Cory Doctorow’s article on ‘Metacrap’ and listen to his interview with David Weinberger at WIRED about ‘explicit’ and ‘implicit’ metadata. To paraphrase, Cory postulates that metadata just isn’t the answer to organizing our information. But perhaps the way Google makes links ‘could’ be. In the interview, Cory explained that google only works well when using implicit metadata, and as soon as you get ‘explicit,’ the searching for information starts to break down.
I keep thinking about my open source blog post as it relates to the common good and Cory’s explanation that Google’s type of classification is based on a mass who naively makes links and discovers accidentalness… and all at a very low search cost. He went on to discuss how tagging moves us to convergence, and that the resulting technical and social incentive creates mass collaboration.
He used examples like the ‘decay’ tag at Flickr and the value of using misspelled (is that mispelled?) words on e-bay to get a deal. All of this seems to enforce the absurdity of having to get everything right in a formal database search in order to find your ‘answer.’ Herbert Simon’s satisficing theory suggests that what one finds to satisfy an inquiry may be as simple as finding something that’s ‘good enough.’ (The ACRL might disagree. ) Who has time to thoroughly review every possible answer?
The concept of folksonomy makes everyone an amateur taxonomist. Does this cooperative classification take the mystery out of finding information and put the fun back into it?
I find it interesting that after I write this post, I’m going to tag it the wordpress way, so hopefully someone else can pull my thoughts out of cyberspace. Again, Cory Doctorow, co-author of the Boing Boing blog, postulates that collectively we can achieve our goal: the ‘communication of information,’ as I like to call it. In its simplest form, isn’t commuication all about getting information from the sender to the receiver without a bunch of noise? (Thank you Shannon and Weaver, 1949.) As information professionals, it should be our charge to make that communication connection without letting information get lost in classification metacrap. Whether a web designer or reference librarian, if the goal is to ‘Just Communicate!’ then we need to find ways to reduce the crap, I mean noise.”
The literature review of behavioral research articles and reports examined for this paper (and found via both Google and The University of Oklahoma LORA’s consortium of databases) failed to find a definitive consensus as to whether Zipf’s Law of Least Effort or Bates’ Berrypicking Model is a more predominant method or accurate description of online searching and information retrieval on the Internet. I predict that the question will continue to intrigue information seeking behavioral scientists, eliciting ongoing investigation of online research of end-users, as new technology is developed to satisfy the ever-growing need for information organization and retrieval. “Changes in our developing understanding of human cognition in information seeking situations, and changes in vocabulary can relatively quickly be accommodated in a user-oriented front-end, without requiring the re-indexing of giant databases.
Having reviewed these various issues, we come to the next natural question: How can these things we have learned be applied to improve IR system design for the new digital information world?” (Bates, 2002).
Bates, Marcia. 2007. After the Dot-Bomb: Getting Web Information Retrieval Right This Time. First Monday. http://firstmonday.org/issues/issue7_7/bates/ (accessed November 16, 2007).
Bates, Marcia. 2007. What is browsing—really? A model drawing from behavioral science research. Information Research, 12(4). http://informationr.net/ir/12-4/paper330.html (accessed November 16, 2007).
Bates, Marcia. (2002, September). Toward an integrated model of information seeking and searching. Keynote presented at the International Conference on Information Needs, Seeking and Use in Different Contexts, Lisbon, Portugal.
Bates, Marcia. 1998. Indexing and access for digital libraries and the Internet: Human, database, and domain factors, Journal of the American Society for Information Science, 49(13), 1185-1205.
Borgman Christine, Hirsh, Sandra, et al. 1995. Children’s searching behavior on browsing and keyword online catalogues: The science library catalog project, Journal of the American Society for Information Science, 46(9), 663-684.
Case, Donald. 2007. 2nd Ed. Looking for Information: A Survey of Research on Information Seeking, Needs, and Behavior. Academic Press: London.
Henner, Terry. 2006. Evaluation and research, Medical Reference Services Quarterly, 25(2), 109-117.
Holscher, Christoph and Strube, Gerhard. 2007. Web search behavior of Internet experts and newbies. http://www9.org/w9cdrom/81/81.html (accessed November 16, 2007).
Kim, Chai. 1982. Retrieval language of social sciences and natural sciences: A statistical investigation, Journal of the American Society for Information Science, 33(1), 1-7.
Ondrusek, Anita. 2004. The attributes of research on end-user online searching behavior: A retrospective review and analysis, Library and Information Science Research 26, 221-265.
Powers, David. 1998. Applications and Explanations of Zipf’s Law. http://acl.ldc.upenn.edu/W/W98/W98-1218.pdf (accessed December 2, 2007).
Taylor, R.S. 1968. Question-Negotiation and Information Seeking, College and Research Libraries. 61(5), 180-194.