I'd like to start our tour of book and library information-management techniques with a glance at the humble back-of-book index. I started the USDA's excellent indexing course back in the day, and while it became clear fairly quickly that I do not have the chops to be a good indexer and so I never finished the course, I surely learned to respect those who do have indexing chops. It's not an easy job.
Go find a book with an index and flip through it. Seriously, go ahead. I'll wait. Just bask in the lovely indentedness and order of it all.
Now answer me a question: Should Google be calling that huge mass of crawled web data it computes upon an index?
Arguably, it shouldn't, though this is absolutely a lost battle; the word "index" is polysemous and always will be. What Google has is more along the lines of a concordance of the web. What's a concordance, you ask? A list of words in a given corpus of text, along with pointers to where those words are used in the corpus. Way back in the day, compiling concordances to important literature (e.g. religious texts) was considered worthy scholarly work. Today, of course, feeding a text into a computer can yield a concordance in seconds—I'm no great shakes as a programmer, but even I could hack up some concordance software if I had to.
Google's index is a bit more than a straight-up concordance: they do stemming and some n-gram analysis and other fancy-pants tricks. But it is still qualitatively different from a back-of-book index. How? I'll adduce three major differences: human intervention, terminological insularity, and intentional grouping.
There is a standard documenting what an index is for and how to create one. I'm not paying over $250 to own it, but I'll happily give you the gist.
An indexer presented with a book reads it at least twice, with concentrated attention. She is looking for concepts that the book treats usefully and/or in some depth, because an index containing every passing mention of everything is usually useless to someone asking "does this book have useful, original information on topic X?"
(I did say "usually." Sometimes a topic is so terribly remote or abstruse that even the slightest mention is useful. That's when a concordance can be superior to an index. Google Books is a godsend to lovers of minutiae.)
Please note that I said concepts, not "words" or even "phrases." A recurring problem in information management is that human language is truly annoying about using different words for the same thing, in various sorts of ways that this post is already too long to discuss in depth. Suffice to say that part of the indexer's job is to tease out concepts in the text that aren't necessarily labeled consistently or even labeled at all. A text on web design may never actually use the word "usability," for example, but that doesn't mean it has nothing to say about the subject! A good indexer will work that out.
So how does an indexer label the concepts she finds? Well, ideally, the text has done that for her; that's why an index is more insular than Google, which makes considerable use of other people's labels for web pages insofar as those are discoverable through links. (That's what Googlebombing is all about, of course.) The indexer is not slavishly bound to the text's language, however. She is allowed to take into account the text's readers, and (what she believes to be) their language use.
An indexer will not lightly discard the text's usage. What she will do is use "See" entries to connect likely reader usage to the text's usage. If the aforementioned web-design text casually throws in "HCI" without ever expanding it (shame on the editor! but it does happen), a smart indexer will throw in an entry "Human-computer interaction. See HCI." Remember this trick. We will see it in other forms later.
A See entry is not the same as a See also entry. See entries are intended for more-or-less synonymous terms. Rather than wastefully repeat the entire litany of page numbers for every synonym of a given term, pick the likeliest term (probably the text's most-often-used term, but again, the indexer has some discretion) and point the other synonyms to it. See also entries are for related terms, other information that in the indexer's judgment a reader might be interested in.
See also entries are another example of the grouping function of an index, alongside the entire idea of bringing mentions of the same concept that have been scattered throughout the text together in a single entry. Google does not do this save by haphazard accident. A few other search engines try (have a look at Clusty), but the results typically don't make entire sense—and why should they? they're using algorithms to do a human's job!
Purely mechanical questions such as page count enter into index compilation as well; publishers reserve a certain number of pages for the index (or in the hurly-burly of typesetting, a certain number of pages become available), and the index must be chopped to fit. You can imagine, I'm sure, that it's much harder to do a short index than a longer one!
Indexing electronic books introduces user-interface and workflow questions. The print book has the immensely convenient "page" construct to use as a pointer. The electronic book may have pages—or it may scroll, or the page boundaries may change according to font size, or… you see the UI problem, I trust. It's not insoluble, but it's annoying. The workflow problem is simple: how (and when in the production process) does the poor indexer mark the places a given entry should point to?
When I was doing ebooks back in the day, these problems hadn't been solved yet. I worry sometimes that if they remain unsolved, the noble art of book indexing will wither and die—and the search engine, as I hope you now understand, is not an entire replacement.
Go back and flip through that book index again. Appreciate it a little more? Excellent.