Total Pageviews

Search: This Blog, Linked From Here, The Web, My fav sites, My Blogroll

Translate

25 November 2009

The Invisible Web


The world is full of mostly invisible things,

And there is no way but putting the mind’s eye,
Or its nose, in a book, to find them out.
—Howard Nemerov




Intro


Internet search engines, not readily available to the general public until the mid-1990s, have in a few short years made themselves part of our everyday lives. It’s hard to imagine going about our daily routines without them. Indeed, one study from the Fall of 2000 on how people seek answers found that search engines were the top information resource consulted, used nearly 1/3 of the time.

Of course, it’s common to hear gripes about search engines. Almost like bad weather, our failures in locating information with them provide a common experience that everyone can commiserate with. Such complaints overlook the fact that we do indeed tend to find what we are looking for most of the time with search engines. If not, they would have long been consigned to the Internet’s recycle bin and replaced with something better. Nevertheless, it is the search failures that live in our memories, not the successes. “What a stupid search engine! How could it not have found that?” we ask ourselves. The reasons why are multifold.

  • Sometimes we don’t ask correctly, and the search engine cannot interrogate us to better understand what we want.
  • Sometimes we use the wrong search tool, for example, looking for current news headlines on a general-purpose Webwide search engine. It’s the cyberspace equivalent of trying to drive a nail into a board with a screwdriver. Use the right tool, and the job is much easier.
  • Sometimes the information isn’t out there at all, and so a search engine simply cannot find it. Despite the vast resources of the World Wide Web, it does not contain the answers to everything. During such times, turning to information resources such as books and libraries, which have served us valiantly for hundreds of years, may continue to be the best course of action.
  • Of course, sometimes the information is out there but simply hasn’t been accessed by search engines. Web site owners may not want their information to be found. Web technologies may pose barriers to search engine access. Some information simply cannot be retrieved until the right forms are processed. These are all examples of information that is essentially “invisible” to search engines, and if we had a means to access this “Invisible (or Deep) Web” then we might more readily find the answers we are looking for.
The good news is that the Invisible Web is indeed accessible to us, though we might need to look harder to find it. Though we can’t see it easily, there’s nothing to fear from the Invisible Web and plenty to gain from discovering it.

If the Web has become an integral part of daily life you enjoy search engines and Web directories, and these pathfinders are crucial guides that help you navigate through an exploding universe of constantly changing information. Yet you also hate them, because all too
often they fail miserably at answering even the most basic questions or satisfying the simplest queries. They waste your time, they exasperate and frustrate, even provoking an extreme reaction, known as “Web rage,” in some people. It’s fair to ask, “What’s the problem here? Why is it so difficult to find the information I’m looking for?”

The problem is that vast expanses of the Web are completely invisible to general-purpose search engines like AltaVista, HotBot, and Google. Even worse, this “Invisible Web” is in all likelihood growing significantly faster than the visible Web that you’re familiar with. It’s not
that the search engines and Web directories are “stupid” or even badly engineered. Rather, they simply can’t “see” millions of high-quality resources that are available exclusively on the Invisible Web.

So what is this Invisible Web and why aren’t search engines doing anything about making it visible? Good question. There is no dictionary definition for the Invisible Web. Several studies have attempted to map the entire Web, including parts of what we call the Invisible Web. To our knowledge, however, we have found little consensus among the professional Web search community regarding the cartography of the Invisible Web.

Many people—even those “in the know” about Web searching—make many assumptions about the scope and thoroughness of the coverage by Web search engines that are simply untrue. In a nutshell, the Invisible Web consists of material that general purpose search engines either cannot or, perhaps more importantly, will not include in their collections of Web pages (called indexes or indices). The Invisible Web contains vast amounts of authoritative and current information that’s accessible to you, using your Web browser or add-on utility software—but you have to know where to find it ahead of time, since you simply cannot locate it using a search engine like HotBot or Lycos.

Why? There are several reasons. One is technical—search engine technology is actually quite limited in its capabilities, despite its tremendous usefulness in helping searchers locate text documents on the Web. Another reason relates to the costs involved in operating a comprehensive search engine. It’s expensive for search engines to locate Web resources and maintain up-to-date indices. Search engines must also cope with unethical Web page authors who seek to subvert their indexes with millions of bogus “spam” pages—pages that, like
their unsavory e-mail kin, are either junk or offer deceptive or misleading information. Most of the major engines have developed strict guidelines for dealing with spam, which sometimes has the unfortunate effect of excluding legitimate content. These are just a few of the reasons the Invisible Web exists.

The bottom line for the searcher is that understanding the Invisible Web and knowing how to access its treasures can save both time and frustration, often yielding high-quality results that aren’t easily found any other way.


The paradox of the Invisible Web is that it’s easy to understand why it exists, but it’s very hard to actually define in concrete, specific terms. In a nutshell, the Invisible Web consists of content that’s been excluded from general-purpose search engines and Web directories such as Lycos and LookSmart. There’s nothing inherently “invisible” about this content. But since this content is not easily located with the information-seeking tools used by most Web users, it’s effectively invisible because it’s so difficult to find unless you know exactly where to look.

The visible Web is easy to define. It’s made up of HTML Web pages that the search engines have chosen to include in their indices. It’s no more complicated than that. The Invisible Web is much harder to define and classify for several reasons.
  • First, many Invisible Web sites are made up of straightforward Web pages that search engines could easily crawl and add to their indices,but do not, simply because the engines have decided against including them. This is a crucial point—much of the Invisible Web is hidden because search engines have deliberately chosen to exclude some types of Web content. We’re not talking about unsavory “adult” sites or blatant spam sites—quite the contrary! Many Invisible Web sites are first-rate content sources. These exceptional resources simply cannot be found by using general-purpose search engines because they have been effectively locked out. There are a number of reasons for these exclusionary policies. But keep in mind that should the engines change their policies in the future, sites that today are part of the Invisible Web will suddenly join the mainstream as part of the visible Web.
  • Second, it’s relatively easy to classify some sites as either visible or Invisible based on the technology they employ. Some sites using database technology, for example, are genuinely difficult for current generation search engines to access and index. These are “true” Invisible Web sites. Other sites, however, use a variety of media and file types, some of which are easily indexed, and others that are incomprehensible to search engine crawlers. Web sites that use a mixture of these media and file types aren’t easily classified as either visible or Invisible. Rather, they make up what we call the “opaque” Web.
  • Finally, search engines could theoretically index some parts of the Invisible Web, but doing so would simply be impractical, either from a cost standpoint, or because data on some sites is ephemeral and not worthy of indexing—for example, current weather information, moment-by-moment stock quotes, airline flight arrival times, and so on.
Now we define the Invisible Web, and delve into the reasons search engines can’t “see” its content. We also discuss the four different “types” of invisibility, ranging from the “opaque” Web, which is relatively accessible to the searcher, to the truly invisible Web, which requires specialized finding aids to access effectively.


Invisible Web Defined


The definition given above is deliberately very general, because the general-purpose search engines are constantly adding features and improvements to their services. What may be invisible today may become visible tomorrow, should the engines decide to add the capability to index things that they cannot or will not currently index.

Let’s examine the two parts of our definition in more detail. First, we’ll look at the technical reasons search engines can’t index certain types of material on the Web. Then we’ll talk about some of the other non-technical but very important factors that influence the policies that guide search engine operations. At their most basic level, search engines are designed to index Web pages. Search engines use programs called crawlers to find and retrieve Web pages stored on servers all over the world. From a Web server’s standpoint, it doesn’t make any difference if a request for a page comes from a person using a Web browser or from an automated search engine crawler. In either case, the server returns the desired Web page to the computer that requested it.

A key difference between a person using a browser and a search engine crawler is that the person is able to manually type a URL into the browser window and retrieve that Web page. Search engine crawlers lack this capability. Instead, they’re forced to rely on links they find on Web pages to find other pages. If a Web page has no links pointing to it from any other page on the Web, a search engine crawler can’t find it. These “disconnected” pages are the most basic part of the Invisible Web. There’s nothing preventing a search engine from crawling and indexing disconnected pages—there’s simply no way for a crawler to discover and fetch them.

Disconnected pages can easily leave the realm of the Invisible and join the visible Web in one of two ways.
  1. First, if a connected Web page links to a disconnected page, a crawler can discover the link and spider the page
  2. Second, the page author can request that the page be crawled by submitting it to search engine “add URL” forms.
Technical problems begin to come into play when a search engine crawler encounters an object or file type that’s not a simple text document. Search engines are designed to index text, and are highly optimized to perform search and retrieval operations on text. But they don’t do very well with non-textual data, at least in the current generation of tools. Some engines, like AltaVista and HotBot, can do limited searching for certain kinds of non-text files, including images, audio, or video files. But the way they process requests for this type of material are reminiscent of early Archie searches, typically limited to a filename or the minimal alternative (ALT) text that’s sometimes used by page authors in the HTML image tag. Text surrounding an image, sound, or video file can give additional clues about what the file contains. But keyword searching with images and sounds is a far cry from simply telling the search engine to “find me a picture that looks like Picasso’s Guernica” or “let me hum a few bars of this song and you tell me what it is.” Pages that consist primarily of images, audio, or video, with little or no text, make up another type of Invisible Web content. While the pages may actually be included in a search engine index, they provide few textual clues as to their content, making it highly unlikely that they will ever garner high relevance scores. Researchers are working to overcome these limitations.

While search engines have limited capabilities to index pages that are primarily made up of images, audio, and video, they have serious problems with other types of non-text material. Most of the major general-purpose search engines simply cannot handle certain types of formats. These formats include:
  • PDF or Postscript (Google excepted)
  • Flash
  • Shockwave
  • Executables (programs)
  • Compressed files (.zip, .tar, etc.)
The problem with indexing these files is that they aren’t made up of HTML text. Technically, most of the formats in the list above can be indexed. The search engines choose not to index them for business reasons.
  • For one thing, there’s much less user demand for these types of files than for HTML text files.
  • These formats are also “harder” to index, requiring more computing resources. For example, a single PDF file might consist of hundreds or even thousands of pages.
  • Indexing non HTML text file formats tends to be costly.
Pages consisting largely of these “difficult” file types currently make up a relatively small part of the Invisible Web. However, we’re seeing a rapid expansion in the use of many of these file types, particularly for some kinds of high-quality, authoritative information. For example, to comply with federal paperwork reduction legislation, many U.S. government agencies are moving to put all of their official documents on the Web in PDF format. Most scholarly papers are posted to the Web in Postscript or compressed Postscript format. For the searcher, Invisible Web content made up of these file types poses a serious problem. We discuss a partial solution to this problem later.

The biggest technical hurdle search engines face lies in accessing information stored in databases. This is a huge problem, because there are thousands—perhaps millions—of databases containing high-quality information that are accessible via the Web.
Web content creators favor databases because they offer flexible, easily maintained development environments. And increasingly, content-rich databases from universities, libraries, associations, businesses, and government agencies are being made available online, using Web interfaces as front-ends to what were once closed, proprietary information systems.
Databases pose a problem for search engines because every database is unique in both the design of its data structures, and its search and retrieval tools and capabilities. Unlike simple HTML files, which search engine crawlers can simply fetch and index, content stored in databases is trickier to access, for a number of reasons that we’ll describe in detail here.

Search engine crawlers generally have no difficulty finding the interface or gateway pages to databases, because these are typically pages made up of input fields and other controls. These pages are formatted with HTML and look like any other Web page that uses interactive forms. Behind the scenes, however, are the knobs, dials, and switches that provide access to the actual contents of the database, which are literally incomprehensible to a search engine crawler.

Although these interfaces provide powerful tools for a human searcher, they act as roadblocks for a search engine spider. Essentially, when an indexing spider comes across a database, it’s as if it has run smack into the entrance of a massive library with securely bolted doors. A crawler can locate and index the library’s address, but because the crawler cannot penetrate the gateway it can’t tell you anything about the books, magazines, or other documents it contains.

These Web-accessible databases make up the lion’s share of the Invisible Web. They are accessible via the Web, but may or may not actually be on the Web (see Table 4.1). To search a database you must use the powerful search and retrieval tools offered by the database itself. The advantage to this direct approach is that you can use search tools that were specifically designed to retrieve the best results from the database. The disadvantage is that you need to find the database in the first place, a task the search engines may or may not be able to help you with.

There are several different kinds of databases used for Web content, and it’s important to distinguish between them. Just because Web content is stored in a database doesn’t automatically make it part of the Invisible Web. Indeed, some Web sites use databases not so much for their sophisticated query tools, but rather because database architecture is more robust and makes it easier to maintain a site than if it were simply a collection of HTML pages.
  • One type of database is designed to deliver tailored content to individual users. Examples include My Yahoo!, Personal Excite, Quicken.com’s personal portfolios, and so on. These sites use databases that generate “on the fly” HTML pages customized for a specific user. Since this content is tailored for each user, there’s little need to index it in a general-purpose search engine.
  • A second type of database is designed to deliver streaming or realtime data—stock quotes, weather information, airline flight arrival information, and so on. This information isn’t necessarily customized, but is stored in a database due to the huge, rapidly changing quantities of information involved. Technically, much of this kind of data is indexable because the information is retrieved from the database and published in a consistent, straight HTML file format. But because it changes so frequently and has value for such a limited duration (other than to scholars or archivists), there’s no point in indexing it. It’s also problematic for crawlers to keep up with this kind of information. Even the fastest crawlers revisit most sites monthly or even less frequently. Staying current with real-time information would consume so many resources that it is effectively impossible for a crawler.
  • The third type of Web-accessible database is optimized for the data it contains, with specialized query tools designed to retrieve the information using the fastest or most effective means possible. These are often “relational” databases that allow sophisticated querying to find data that is “related” based on criteria specified by the user. The only way of accessing content in these types of databases is by directly interacting with the database. It is this content that forms the core of the Invisible Web. Let’s take a closer look at these elements of the Invisible Web, and demonstrate exactly why search engines can’t or won’t index them.


Why Search Engines Can’t See the Invisible Web
Text—more specifically hypertext—is the fundamental medium of the Web. The primary function of search engines is to help users locate hypertext documents of interest. Search engines are highly tuned and optimized to deal with text pages, and even more specifically, text pages that have been encoded with the HyperText Markup Language (HTML). As the Web evolves and additional media become commonplace, search engines will undoubtedly offer new ways of searching for this information. But for now, the core function of most Web search engines is to help users locate text documents.

HTML documents are simple. Each page has two parts: a “head” and a “body,” which are clearly separated in the source code of an HTML page.
  • The head portion contains a title, which is displayed (logically enough) in the title bar at the very top of a browser’s window. The head portion may also contain some additional metadata describing the document, which can be used by a search engine to help classify the document. For the most part, other than the title, the head of a document contains information and data that help the Web browser display the page but is irrelevant to a search engine.
  • The body portion contains the actual document itself. This is the meat that the search engine wants to digest.
The simplicity of this format makes it easy for search engines to retrieve HTML documents, index every word on every page, and store them in huge databases that can be searched on demand. Problems arise when content doesn’t conform to this simple Web page model. To understand why, it’s helpful to consider the process of crawling and the factors that influence whether a page either can or will be successfully crawled and indexed.
  • The first determination a crawler attempts to make is whether access to pages on a server it is attempting to crawl is restricted. Webmasters can use three methods to prevent a search engine from indexing a page.Two methods use blocking techniques specified in the Robots Exclusion Protocol that most crawlers voluntarily honor and one creates a technical roadblock that cannot be circumvented. The Robots Exclusion Protocol is a set of rules that enables a Webmaster to specify which parts of a server are open to search engine crawlers, and which parts are off-limits. The Webmaster simply creates a list of files or directories that should not be crawled or indexed, and saves this list on the server in a file named robots.txt. This optional file, stored by convention at the top level of a Web site, is nothing more than a polite request to the crawler to keep out, but most major search engines respect the protocol and will not index files specified in robots.txt.

  • The second means of preventing a page from being indexed works in the same way as the robots.txt file, but is page-specific. Webmasters can prevent a page from being crawled by including a “noindex” meta tag instruction in the “head” portion of the document. Either robots.txt or the noindex meta tag can be used to block crawlers. The only difference between the two is that the noindex meta tag is page specific, while the robots.txt file can be used to prevent indexing of individual pages, groups of files, or even entire Web sites.

  • Password protecting a page is the third means of preventing it from being crawled and indexed by a search engine. This technique is much stronger than the first two because it uses a technical barrier rather than a voluntary standard. Why would a Webmaster block crawlers from a page using the Robots Exclusion Protocol rather than simply password protecting the pages? Password-protected pages can be accessed only by the select few users who know the password. Pages excluded from engines using the Robots Exclusion Protocol, on the other hand, can be accessed by anyone except a search engine crawler. The most common reason Webmasters block pages from indexing is that their content changes so frequently that the engines cannot keep up.
Pages using any of the three methods described here are part of the Invisible Web. In many cases, they contain no technical roadblocks that prevent crawlers from spidering and indexing the page. They are part of the Invisible Web because the Webmaster has opted to keep them out of the search engines.

Once a crawler has determined whether it is permitted to access a page, the next step is to attempt to fetch it and hand it off to the search engine’s indexer component. This crucial step determines whether a page is visible or invisible. Let’s examine some variations that crawlers encounter as they discover pages on the Web, using the same logic they do to determine whether a page is indexable.

  • Case 1. The crawler encounters a page that is straightforward HTML text, possibly including basic Web graphics. This is the most common type of Web page. It is visible and can be indexed.
  • Case 2. The crawler encounters a page made up of HTML, but it’s a form consisting of text fields, check boxes, or other components requiring user input. It might be a sign-in page, requiring a user name and password. It might be a form requiring the selection of one or more options. The form itself, since it’s made up of simple HTML, can be fetched and indexed. But the content behind the form (what the user sees after clicking the submit button) may be invisible to a search engine. There are two possibilities here:


    • The form is used simply to select user preferences. Other pages on the site consist of straightforward HTML that can be crawled and indexed (presuming there are links from other pages elsewhere on the Web pointing to the pages). In this case, the form and the content behind it are visible and can be included in a search engine index. Quite often, sites like this are specialized search sites . A good example is Hoover’s Business Profiles, hich provides a form to search for a company, but presents company profiles in straightforward HTML that can be indexed.
    • The form is used to collect user-specified information that will generate dynamic pages when the information is submitted. In this case, although the form is visible the content “behind” it is invisible. Since the only way to access the content is by using the form, how can a crawler—which is simply designed to request and fetch pages—possibly know what to enter into the form? Since forms can literally have infinite variations, if they function to access dynamic content they are essentially roadblocks for crawlers. A good example of this type of Invisible Web site is The World Bank Group’s Economics of Tobacco Control Country Data Report Database, which allows you to select any country and choose a wide range of reports for that country. It’s interesting to note that this database is just one part of a much larger site, the bulk of which is fully visible. So even if the search engines do a comprehensive job of indexing the visible part of the site, this valuable information still remains hidden to all but those searchers who visit the site and discover the database on their own.
    In the future, forms will pose less of a challenge to search engines. Several projects are underway aimed at creating more intelligent crawlers that can fill out forms and retrieve information. One approach uses preprogrammed “brokers” designed to interact with the forms of specific databases. Other approaches combine brute force with artificial intelligence to “guess” what to enter into forms, allowing the crawler to “punch through” the form and retrieve information. However, even if general-purpose search engines do acquire the ability to crawl content in databases, it’s likely that the native search tools provided by each database will remain the best way to interact with them.
  • Case 3. The crawler encounters a dynamically generated page assembled and displayed on demand. The telltale sign of a dynamically generated page is the “?” symbol appearing in its URL. Technically, these pages are part of the visible Web. Crawlers can fetch any page that can be displayed in a Web browser, regardless of whether it’s a static page stored on a server or generated dynamically. A good example of this type of Invisible Web site is Compaq’s experimental SpeechBot search engine, which indexes audio and video content using speech recognition, and converts the streaming media files to viewable text. Somewhat ironically, one could make a good argument that most search engine result pages are themselves Invisible Web content, since they generate dynamic pages on the fly in response to user search terms.
    Dynamically generated pages pose a challenge for crawlers. Dynamic pages are created by a script, a computer program that selects from various options to assemble a customized page. Until the script is actually run, a crawler has no way of knowing what it will actually do. The script should simply assemble a customized Web page. Unfortunately, unethical Webmasters have created scripts to generate millions of similar but not quite identical pages in an effort to “spamdex” the search engine with bogus pages. Sloppy programming can also result in a script that puts a spider into an endless loop, repeatedly retrieving the same page.
    These “spider traps” can be a real drag on the engines, so most have simply made the decision not to crawl or index URLs that generate dynamic content. They’re “apartheid” pages on the Web—separate but equal, making up a big portion of the “opaque” Web that potentially can be indexed but is not. Inktomi’s FAQ about its crawler, named “Slurp,”
    offers this explanation:



    “Slurp now has the ability to crawl dynamic links or dynamically
    generated documents. It will not, however, crawl them by default. There
    are a number of good reasons for this. A couple of reasons are that
    dynamically generated documents can make up infinite URL spaces,
    and that dynamically generated links and documents can be different
    for every retrieval so there is no use in indexing them”.
    As crawler technology improves, it’s likely that one type of dynamically generated content will increasingly be crawled and indexed. This is content that essentially consists of static pages that are stored in databases for production efficiency reasons. As search engines learn which sites providing dynamically generated content can be trusted not to subject crawlers to spider traps, content from these sites will begin to appear in search engine indices. For now, most dynamically generated content is squarely in the realm of the Invisible Web.
  • Case 4. The crawler encounters an HTML page with nothing to index. There are thousands, if not millions, of pages that have a basic HTML framework, but which contain only Flash, images in the .gif, .jpeg, or other Web graphics format, streaming media, or other non-text content in the body of the page. These types of pages are truly parts of the Invisible Web because there’s nothing for the search engine to index. Specialized multimedia search engines, such as ditto.com and WebSeek are able to recognize some of these non-text file types and index minimal information about them, such as file name and size, but these are far from keyword searchable solutions.
  • Case 5. The crawler encounters a site offering dynamic, real-time data. There are a wide variety of sites providing this kind of information, ranging from real-time stock quotes to airline flight arrival information. These sites are also part of the Invisible Web, because these data streams are, from a practical standpoint, unindexable. While it’s technically possible to index many kinds of real-time data streams, the value would only be for historical purposes, and the enormous amount of data captured would quickly strain a search engine’s storage capacity, so it’s a futile exercise. A good example of this type of Invisible Web site is TheTrip.com’s Flight tracker, which provides real-time flight arrival
    information taken directly from the cockpit of in-flight airplanes.
  • Case 6. The crawler encounters a PDF or Postscript file. PDF and Postscript are text formats that preserve the look of a document and display it identically regardless of the type of computer used to view it. Technically, it’s a straightforward task to convert a PDF or Postscript file to plain text that can be indexed by a search engine. However, most
    search engines have chosen not to go to the time and expense of indexing files of this type. One reason is that most documents in these formats are technical or academic papers, useful to a small community of scholars but irrelevant to the majority of search engine users, though this is changing as governments increasingly adopt the PDF format for their official documents. Another reason is the expense of conversion to plain text. Search engine companies must make business decisions on how best to allocate resources, and typically they elect not to work with these formats.
    An experimental search engine called ResearchIndex, created by computer scientists at the NEC Research Institute, not only indexes PDF and Postscript files, it also takes advantage of the unique features that commonly appear in documents using the format to improve search results. For example, academic papers typically cite other documents, and include lists of references to related material. In addition to indexing the full text of documents, ResearchIndex also creates a citation index that makes it easy to locate related documents. It also appears that citation searching has little overlap with keyword searching, so combining the two can greatly enhance the relevance of results. We hope that the major search engines will follow Google’s example and gradually adopt the pioneering work being done by the developers of ResearchIndex. Until then, files in PDF or Postscript format remain firmly in the realm of the Invisible Web.
  • Case 7. The crawler encounters a database offering a Web interface. There are tens of thousands of databases containing extremely valuable information available via the Web. But search engines cannot index the material in them. Although we present this as a unique case, Web-accessible databases are essentially a combination of Cases 2 and 3. Databases generate Web pages dynamically, responding to commands issued through an HTML form. Though the interface to the database is an HTML form, the database itself may have been created before the development of HTML, and its legacy system is incompatible with protocols used by the engines, or they may require registration to access the data. Finally, they may be proprietary, accessible only to select users, or users who have paid a fee for access. Ironically, the original HTTP specification developed by Tim Berners-Lee included a feature called format negotiation that allowed a client to say
    what kinds of data it could handle and allow a server to return data in any acceptable format. Berners-Lee’s vision encompassed the information in the Invisible Web, but this vision—at least from a search engine stand-point—has largely been unrealized.
These technical limitations give you an idea of the problems encountered by search engines when they attempt to crawl Web pages and compile indices. There are other, non-technical reasons why information isn’t included in search engines. We look at those next.



What You See Is not What You Get In theory, the results displayed in response to a search engine query accurately reflect the pages that are deemed relevant to the query. In practice, however, this isn’t always the case. When a search index is out of date. Search results may not match the current content of the page simply because the page has been changed since it was last indexed. But there’s a more insidious problem: spiders can be fooled into crawling one page that’s masquerading for another. This technique is called “cloaking” or, more technically, “IP delivery.”
By convention, crawlers have unique names, and they identify themselves by name whenever they request pages from a server, allowing servers to deny them access during particularly busy times so that human users won’t suffer performance consequences. The crawler’s name also provides a means for Webmasters to contact the owners of spiders that put undue stress on servers. But the identification codes
also allow Webmasters to serve pages that are created specifically for spiders in place of the actual page the spider is requesting.
This is done by creating a script that monitors the IP (Internet Protocol) addresses making page requests. All entities, whether Web browsers or search engine crawlers, have their own unique IP addresses. IP addresses are effectively “reply to” addresses—the Internet address to which pages should be sent. Cloaking software watches for the unique signature of a search engine crawler (its IP address), and feeds specialized versions of pages to the spider that aren’t identical to the ones that will be seen by anyone else.
Cloaking allows Webmasters to “break all the rules” by feeding specific information to the search engine that will cause a page to rank well for specific search keywords. Used legitimately, cloaking can solve the problem of unscrupulous people stealing metatag source code from a high-ranking page. It can also help sites that are required by law to have a “search-unfriendly” disclaimer page as their home page. For example, pharmaceutical companies Eli Lilly and Schering-Plough use IP delivery techniques to assure that their pages rank highly for their specific products, which would be impossible if the spiders were only able to index the legalese on pages required by law.
Unfortunately, cloaking also allows unscrupulous Webmasters to employ a “bait and switch” tactic designed to make the search engine think the page is about one thing when in fact it may be about something completely different. This is done by serving a totally bogus page to a crawler, asserting that it’s the actual content of the URL, while in fact the content at the actual URL of the page may be entirely different. This sophisticated trick is favored by spammers seeking to lure unwary searchers to pornographic or other unsavory sites.
IP delivery is difficult for search crawlers to recognize, though a careful searcher can often recognize the telltale signs by comparing the title and description with the URL in a search engine result. For example, look at these two results for the query “child toys”:
Dr. Toy’s Guide: Information on Toys and Much More
Toy Information! Over 1,000 award winning toys and children’s
products are fully described with company phone numbers, photos
and links to useful resources...
URL: www.drtoy.com/
AAA BEST TOYS
The INTERNET’S LARGEST ULTIMATE TOY STORE for
children of all ages.
URL: 196.22.31.6/xxx-toys.htm
In the first result, the title, description, and URL all suggest a reputable resource for children’s toys. In the second result, there are several clues that suggest that the indexed page was actually served to the crawler via IP delivery. The use of capital letters and a title beginning with “AAA” (a favorite but largely discredited trick of spammers) are blatant red flags. What really clinches it is the use of a numeric URL, which makes it difficult to know what the destination is, and the actual filename of the page, suggesting something entirely different from wholesome toys for children. The important thing to remember about this method is that the titles and descriptions, and even the content of a page, can be faked using IP delivery, but the underlying URL cannot. If a search result looks dubious, pay close attention to the URL before clicking on it. This type of caution can save you both frustration and potential embarrassment.

Four Types of Invisibility
Technical reasons aside, there are other reasons that some kinds of material that can be accessed either on or via the Internet are not included in search engines. There are really four “types” of Invisible Web content. We make these distinctions not so much to make hard and fast distinctions between the types, but rather to help illustrate the amorphous boundary of the Invisible Web that makes defining it in concrete terms so difficult. The four types of invisibility are:
  • The Opaque Web
  • The Private Web
  • The Proprietary Web
  • The Truly Invisible Web

The Opaque Web
The Opaque Web consists of files that can be, but are not, included in search engine indices. The Opaque Web is quite large, and presents a unique challenge to a searcher. Whereas the deep content in many truly Invisible Web sites is accessible if you know how to find it, material on the Opaque Web is often much harder to find.

The biggest part of the Opaque Web consists of files that the search engines can crawl and index, but simply do not. There are a variety of reasons for this; let’s look at them.


DEPTH OF CRAWL
Crawling a Web site is a resource-intensive operation. It costs money for a search engine to crawl and index every page on a site. In the past, most engines would merely sample a few pages from a site rather than performing a “deep crawl” that indexed every page, reasoning that a sample provided a “good enough” representation of a site that would satisfy the needs of most searchers. Limiting the depth of crawl also reduced the cost of indexing a particular Web site.

In general, search engines don’t reveal how they set the depth of crawl for Web sites. Increasingly, there is a trend to crawl more deeply, to index as many pages as possible. As the cost of crawling and indexing goes down, and the size of search engine indices continues to be a
competitive issue, the depth of crawl issue is becoming less of a concern for searchers. Nonetheless, simply because one, fifty, or five thousand pages from a site are crawled and made searchable, there is no guarantee that every page from a site will be crawled and indexed. This problem gets little attention and is one of the top reasons why useful material may be all but invisible to those who only use general-purpose search tools to find Web materials.


FREQUENCY OF CRAWL
The Web is in a constant state of dynamic flux. New pages are added constantly, and existing pages are moved or taken off the Web. Even the most powerful crawlers can visit only about 10 million pages per day, a fraction of the entire number of pages on the Web. This means that each search engine must decide how best to deploy its crawlers, creating a schedule that determines how frequently a particular page or site is visited.

Web search researchers Steve Lawrence and Lee Giles, writing in the July 8, 1999, issue of Nature state that “indexing of new or modified pages by just one of the major search engines can take months” (Lawrence, 1999). While the situation appears to have improved since their study, most engines only completely “refresh” their indices monthly or even less frequently.

It’s not enough for a search engine to simply visit a page once and then assume it’s still available thereafter. Crawlers must periodically return to a page to not only verify its existence, but also to download the freshest copy of the page and perhaps fetch new pages that have been added to a site. According to one study, it appears that the half-life of a Web page is somewhat less than two years and the half-life of a Web site is somewhat more than two years. Put differently, this means that if a crawler returned to a site spidered two years ago it would contain the same number of URLs, but only half of the original pages would still exist, having been replaced by new ones (Koehler, 2000).

New sites are the most susceptible to oversight by search engines because relatively few other sites on the Web will have linked to them compared to more established sites. Until search engines index these new sites, they remain part of the Invisible Web.


MAXIMUM NUMBER OF VIEWABLE RESULTS
It’s quite common for a search engine to report a very large number of results for any query, sometimes into the millions of documents. However, most engines also restrict the total number of results they will display for a query, typically between 200 and 1,000 documents. For queries that return a huge number of results, this means that the majority of pages the search engine has determined might be relevant are inaccessible, since the result list is arbitrarily truncated. Those pages that don’t make the cut are effectively invisible.

Good searchers are aware of this problem, and will take steps to circumvent it by using a more precise search strategy and using the advanced filtering and limiting controls offered by many engines. However, for many inexperienced searchers this limit on the total number of viewable hits can be a problem. What happens if the answer you need is available (with a more carefully crafted search) but cannot be viewed using your current search terms?


DISCONNECTED URLS
For a search engine crawler to access a page, one of two things must take place. Either the Web page author uses the search engine’s “Submit URL” feature to request that the crawler visit and index the page, or the crawler discovers the page on its own by finding a link to the page on
some other page. Web pages that aren’t submitted directly to the search engines, and that don’t have links pointing to them from other Web pages, are called “disconnected” URLs and cannot be spidered or indexed simply because the crawler has no way to find them.

Quite often, these pages present no technical barrier for a search engine. But the authors of disconnected pages are clearly unaware of the requirements for having their pages indexed. A May 2000 study by IBM, AltaVista, and Compaq discovered that the total number of disconnected URLs makes up about 20 percent of the potentially indexable Web, so this
isn’t an insignificant problem (Broder, etc., 2000).

In summary, the Opaque Web is large, but not impenetrable. Determined searchers can often find material on the Opaque Web, and search engines are constantly improving their methods for locating and indexing Opaque Web material. The three other types of Invisible Webs are more problematic, as we’ll see.


The Private Web
The Private Web consists of technically indexable Web pages that have deliberately been excluded from search engines. There are three ways that Webmasters can exclude a page from a search engine:
  • Password protect the page. A search engine spider cannot go past the form that requires a username and password.
  • Use the robots.txt file to disallow a search spider from accessing the page.
  • Use the “noindex” meta tag to prevent the spider from reading past the head portion of the page and indexing the body.
For the most part, the Private Web is of little concern to most searchers. Private Web pages simply use the public Web as an efficient delivery and access medium, but in general are not intended for use beyond the people who have permission to access the pages.

There are other types of pages that have restricted access that may be of interest to searchers, yet they typically aren’t included in search engine indices. These pages are part of the Proprietary Web, which we describe next.


The Proprietary Web
Search engines cannot for the most part access pages on the Proprietary Web, because they are only accessible to people who have agreed to special terms in exchange for viewing the content.
Proprietary pages may simply be content that’s only accessible to users willing to register to view them. Registration in many cases is free, but a search crawler clearly cannot satisfy the requirements of even the simplest registration process.

Examples of free proprietary Web sites include The New York Times, Salon’s “The Well” community, Infonautics’ “Company Sleuth” site, and countless others.

Other types of proprietary content are available only for a fee, whether on a per-page basis or via some sort of subscription mechanism. Examples of proprietary fee-based Web sites include the Electric Library, Northern Light’s Special Collection Documents, and The Wall Street Journal Interactive Edition.

Proprietary Web services are not the same as traditional online information providers, such as Dialog, LexisNexis, and Dow Jones. These services offer Web access to proprietary information, but use legacy database systems that existed long before the Web came into being. While the content offered by these services is exceptional, they are not considered to be Web or Internet providers.


The Truly Invisible Web
Some Web sites or pages are truly invisible, meaning that there are technical reasons that search engines can’t spider or index the material they have to offer. A definition of what constitutes a truly invisible resource must necessarily be somewhat fluid, since the engines are
constantly improving and adapting their methods to embrace new types of content. But at the end of 2001, truly invisible content consisted of several types of resources.

The simplest, and least likely to remain invisible over time, are Web pages that use file formats that current generation Web crawlers aren’t programmed to handle. These file formats include PDF, Postscript, Flash, Shockwave, executables (programs), and compressed files. There are two reasons search engines do not currently index these types of files.
  1. First, the files have little or no textual context, so it’s difficult to categorize them, or compare them for relevance to other text documents. The addition of metadata to the HTML container carrying the file could solve this problem, but it would nonetheless be the metadata description that got indexed rather than the contents of the file itself.
  2. The second reason certain types of files don’t appear in searchindices is simply because the search engines have chosen to omit them. They can be indexed, but aren’t. You can see a great example of this in action with the Research Index engine, which retrieves and indices PDF,postscript, and even compressed files in real time, creating a searchable database that’s specific to your query. AltaVista’s Search Engine product for creating local site search services is capable of indexing more than 250 file formats, but the flagship public search engine includes only a few of these formats. It’s typically lack of willingness, not an ability issue with file formats.
More problematic are dynamically generated Web pages. Again, in some cases, it’s not a technical problem but rather unwillingness on the part of the engines to index this type of content. This occurs specifically when a non-interactive script is used to generate a page. These are static pages, and generate static HTML that the engine could spider. The problem is that unscrupulous use of scripts can also lead crawlers into “spider traps” where the spider is literally trapped within a huge site of thousands, if not millions, of pages designed solely to spam the search engine. This is a major problem for the engines, so they’ve simply opted not to index URLs that contain script commands.

Finally, information stored in relational databases, which cannot be extracted without a specific query to the database, is truly invisible. Crawlers aren’t programmed to understand either the database structure or the command language used to extract information.

Now that you know the reasons that some types of content are effectively invisible to search engines, let’s move on and see how you can apply this knowledge to actual sites on the Web, and use this understanding to become a better searcher.


Visible or Invisible?


How can you determine whether what you need is found on the visible or Invisible Web? And why is this important? Learning the difference between visible and Invisible Web resources
is important because it will save you time, reduce your frustration, and often provide you with the best possible results for your searching efforts. It’s not critical that you immediately learn to determine whether a resource is visible or invisible—the boundary between visible and invisible sources isn’t always clear, and search services are continuing their efforts to make the invisible visible. Your ultimate goal should be to satisfy your information need in a timely manner using all that the Web has to offer.

The key is to learn the skills that will allow you to determine where you will likely find the best results—before you begin your search. With experience, you’ll begin to know ahead of time the types of resources that will likely provide you with best results for a particular type of search.


Navigation vs. Content Sites
Before you even begin to consider whether a site is invisible or not, it’s important to determine what kind of site you’re viewing. There are two fundamentally different kinds of sites on the Web:
  • Sites that provide content
  • Sites that facilitate Web navigation and resource discovery
All truly invisible sites are fundamentally providers of content, not portals, directories, or even search engines, though most of the major portal sites offer both content and navigation. Navigation sites may use scripts in the links they create to other sites, which may make them
appear invisible at first glance. But if their ultimate purpose is to provide links to visible Web content, they aren’t really Invisible Web sites because there’s no “there” there. Navigation sites using scripts are simply taking advantage of database technology to facilitate a process of
pointing you to other content on the Web, not to store deep wells of content themselves.

On the other hand, true Invisible Web sites are those where the content is stored in a database, and the only way of retrieving it is via a script or database access tool. How the content is made available is key—if the content exists in basic HTML files and is not password protected or restricted by the robots exclusion protocol, it is not invisible content. The content must be stored in the database and must only be accessible using the database interface for content to be truly invisible to search engines.

Some sites have both visible and invisible elements, which makes categorizing them all the more challenging. For example, the U.S. Library of Congress maintains one of the largest sites on the Web. Much of its internal navigation relies on sophisticated database query and retrieval tools. Much of its internal content is also contained within databases, making it effectively invisible. Yet the Library of Congress site also features many thousands of basic HTML pages that can be and have been indexed by the engines. Later we’ll look more closely at the Library of Congress site, pointing out its visible and invisible parts.

Some sites offer duplicate copies of their content, storing pages both in databases and as HTML files. These duplicates are often called “mirror” or “shadow” sites, and may actually serve as alternate content access points that are perfectly visible to search engines. The Education Resource Information Clearinghouse (ERIC) database of educational resource documents on the Web is a good example of a site that does this, with some materials in its database also appearing in online journals, books, or other publications.

In cases where visibility or invisibility is ambiguous, there’s one key point to remember: where you have a choice between using a general-purpose search engine or query and retrieval tools offered by a particular site you’re usually better off using the tools offered by the site. Local site search tools are often finely tuned to the underlying data; they’re limited to the underlying data, and won’t include “noise” that you’ll invariably get in the results from a general search engine. That said, let’s take a closer look at how you tell the difference between visible and Invisible Web sites and pages.


Direct vs. Indirect URLs

The easiest way to determine if a Web page is part of the Invisible Web is to examine its URL. Most URLs are direct references to a specific Web page. Clicking a link containing a direct URL causes your browser to explicitly request and retrieve a specific HTML page. A search engine crawler follows exactly the same process, sending a request to a Web server to retrieve a specific HTML page.

Examples of direct URLs:
  • http://www.yahoo.com (points to Yahoo!’s home page)
  • http://www.invisible-web.net/about.htm (points to the information page for this text companion Web site)
  • http://www.forbes.com/forbes500/ (points to the top-level page for the Forbes 500 database. Though this page is visible, the underlying database is an Invisible Web resource)
Indirect URLs, on the other hand, often don’t point to a specific physical page on the Web. Instead, they contain information that will be executed by a script on the server—and this script is what generates the page you ultimately end up viewing. Search engine crawlers typically won’t follow URLs that appear to have calls to scripts. The key tip-offs that a page can’t or won’t be crawled by a search engine are symbols or words that indicate that the page will be dynamically generated by assembling its component parts from a database. The most common symbol used to indicate the presence of dynamic content is the question mark, but be careful: although question marks are used to execute scripts that generate dynamic pages, they are often simply used as “flags” to alert the server that additional information is being passed along using variables that follow the question mark. These variables can be used to track your route through a site, represent items in a shopping cart, and for many other purposes that have
nothing to do with Invisible Web content. Typically, URLs with the words “cgi-bin” or “javascript” included will also execute a script to generate a page, but you can’t simply assume
that a page is invisible based on this evidence alone. It’s important to conduct further investigations.

Examples of indirect URLs:
  • http://us.imdb.com/Name?Hitchcock,+Alfred (points to the listing for Alfred Hitchcock in the Internet Movie Database)
  • http://www.sec.gov/cgi-bin/srch-edgar?cisco+adj+systems (points to a page showing results for a search on Cisco Systems in the SEC EDGAR database)
  • http://adam.ac.uk/ixbin/hixserv?javascript:go_to(‘0002’,current_level+1) (points to a top-level directory in the ADAM Art Resources database)

The URL Test
If a URL appears to be indirect, and looks like it might execute a script, there’s a relatively easy test to determine if the URL is likely to be crawled or not.
  1. Place the cursor in the address window immediately to the left of the question mark, and erase the question mark and everything to the right of it.
  2. Then press your computer’s Enter key to force your browser to attempt to fetch this fragment of the URL.
  3. Does the page still load as expected? If so, it’s a direct URL. The question mark is being used as a flag to pass additional information to the server, not to execute a script. The URL points to a static HTML page that can be crawled by a search engine spider.
  4. If a page other than the one you expected appears, or you see some sort of error message, it likely means that the information after the question mark in the URL is needed by a script in order to dynamically generate the page.


    Without the information, the server doesn’t know what data to fetch from the database to create the page;
    these types of URLs represent content that is part of the Invisible Web, because the crawler won’t read past the question mark. Note carefully: most crawlers can read past the question mark and fetch the page, just as your browser can, but they won’t for fear of spider traps.
Sometimes it’s trickier to determine if a URL points to content that will be generated dynamically. Many browsers save information about a page in variables that are hidden to the user. Clicking “refresh” may simply send the data used to build the page back to the server, recreating the page. Alternately, the page may have been cached on your computer. The best way to test URLs that you suspect are invisible is to start up another instance of your browser, cut and paste the URL into the new browser’s address box, and try to load the page. The new instance of the browser won’t have the same previously stored information, so you’ll likely see a different page or an error message if the page is invisible. Browsable directories, given their hierarchical layout, may appear at first glance to be part of the visible Web. Test the links in these directories by simply holding your cursor over a link and examining its structure. If the links have question marks indicating that scripts generate the new pages, you have a situation where the top level of the directory, including its links and annotations, may be visible, but the material it links to is invisible.
This is a case where the content of the directory itself is invisible, but content that it links to is not.
Human Resources Development Canada’s Labor Market Information directory is an example of this phenomenon. It’s important to do these tests, because to access most material on the Invisible Web you’ll need to go directly to the site providing it. Many huge, content-specific sites may at first glance appear to be part of the Invisible Web, when in fact they’re nothing more than specialized search sites. Let’s look at this issue in more detail.


Specialized vs. Invisible
There are many specialized search directories on the Web that share characteristics of an Invisible Web site, but are perfectly visible to the search engines. These sites often are structured as hierarchical directories, designed as navigation hubs for specific topics or categories of information, and usually offer both sophisticated search tools and the ability to browse a structured directory. But even if these sites consist of hundreds, or even thousands of HTML pages, many aren’t part of the Invisible Web, since search engine spiders generally have no problem finding and retrieving the pages. In fact, these sites typically have an extensive internal link structure that makes the spider’s job even easier. That said, remember our warning about the depth of crawl issue: because a site is easy to index doesn’t mean that search engines have spidered it thoroughly or recently.

Many sites that claim to have large collections of invisible or “deep” Web content actually include many specialized search services that are perfectly visible to search spiders. They make the mistake of equating a sophisticated search mechanism with invisibility. Don’t get us wrong—we’re all in favor of specialized sites that offer powerful search tools and robust interfaces. It’s just that many of these specialized sites aren’t invisible, and to label them as such is misleading.

For example, we take issue with a highly popularized study performed by Bright Planet claiming that the Invisible Web is currently 400 to 550 times larger than the commonly defined World Wide Web (Bright Planet, 2000). Many of the search resources cited in the study are excellent specialized directories, but they are perfectly visible to search engines. Bright Planet also includes ephemeral data such as weather and astronomy measurements in their estimates that serve no practical purpose for searchers. Excluding specialized search tools and data irrelevant to searchers, we estimate that the Invisible Web is between 2 and 50 times larger than the visible Web.

How can you tell the difference between a specialized vs. Invisible Web resource? Always start by browsing the directory, not searching. Search programs, by their nature, use scripts, and often return results that contain indirect URLs. This does not mean, however, that the site
is part of the Invisible Web. It’s simply a byproduct of how some search tools function.
  • As you begin to browse the directory, click on category links and drill down to a destination URL that leads away from the directory itself. As you’re clicking, examine the links. Do they appear to be direct or indirect URLs? Do you see the telltale signs of a script being executed? If so, the page is part of the Invisible Web—even if the destination URLs have no question marks. Why? Because crawlers wouldn’t have followed the links to the destination URLs in the first place.
  • But if, as you drill down the directory structure, you notice that all of the links contain direct links, the site is almost certainly part of the visible Web, and can be crawled and indexed by search engines.
This may sound confusing, but it’s actually quite straightforward. To illustrate this point, let’s look at some examples in several categories. We’ll put an Invisible Web site side-by-side with a high-quality specialized directory and compare the differences between them.


Visible vs. Invisible
The Gateway to Educational Materials Project is a directory of collections of high-quality educational resources for teachers, parents, and others involved in education. The Gateway features annotated links to more than 12,000 education resources.
  • Structure: Searchable directory, part of the Visible Web. Browsing the categories reveals all links are direct URLs. Although the Gateway’s search tool returns indirect URLs, the direct URLs of the directory structure and the resulting offsite links provide clear linkages for search engine spiders to follow.
AskERIC allows you to search the ERIC database, the world’s largest source of education information. ERIC contains more than one million citations and abstracts of documents and journal articles on education research and practice.
  • Structure: Database, limited browsing of small subsets of the database available. These limited browsable subsets use direct 84 The Invisible Web URLs; the rest of the ERIC database is only accessible via the AskERIC search interface, making the contents of the database effectively invisible to search engines.
Very important point: Some of the content in the ERIC database also exists in the form of plain HTML files; for example, articles published in the ERIC digest. This illustrates one of the apparent paradoxes of the Invisible Web. Just because a document is located in an Invisible Web database doesn’t mean there aren’t other copies of the document existing elsewhere on visible Web sites. The key point is that the database containing the original content is the authoritative source, and searching the database will provide the highest probability of retrieving a document. Relying on a general-purpose search engine to find documents that may have copies on visible Web sites is unreliable.

The International Trademark Association (INTA) Trademark Checklist is designed to assist authors, writers, journalists/editors, proofreaders, and fact checkers with proper trademark usage. It includes listings for nearly 3,000 registered trademarks and service marks with their generic terms and indicates capitalization and punctuation.
  • Structure: Simple HTML pages, broken into five extensively cross-linked pages of alphabetical listings. The flat structure of the pages combined with the extensive crosslinking make these pages extremely visible to the search engines.
The Delphion Intellectual Property Network allows you to search for, view, and analyze patent documents and many other types of intellectual property records. It provides free access to a wide variety of data collections and patent information including United States patents, European patents and patent applications, PCT application data from the World Intellectual Property Office, Patent Abstracts of Japan, and more.
  • Structure: Relational database, browsable, but links are indirect and rely on scripts to access information from the database. Data contained in the Delphion Intellectual Property Network database is almost completely invisible to Web search engines.
Key point: Patent searching and analysis is a very complex process. The tools provided by the Delphion Intellectual Property Network are finely tuned to help patent researchers home in on only the most relevant information pertaining to their search, excluding all else. Search engines are simply inappropriate tools for searching this kind of information. In addition, new patents are issued weekly or even daily. The Delphion Intellectual Property Network is constantly refreshed. Search engines, with their month or more long gaps between recrawling Web sites, couldn’t possibly keep up with this flood of new information.

Hoover’s Online offers in-depth information for businesses about companies, industries, people, and products. It features detailed profiles of hundreds of public and private companies.
  • Structure: Browsable directory with powerful search engine. All pages on the site are simple HTML; all links are direct (though the URLs appear complex). Note: some portions of Hoover’s are only available to subscribers who pay for premium content.
Thomas Register features profiles of more than 155,000 companies, including American and Canadian companies. The directory also allows searching by brand name, product headings, and even some supplier catalogs. As an added bonus, material on the Thomas Register
site is updated constantly, rather than on the fixed update schedules of the printed version.
  • Structure: Database access only. Further, access to the search tool is available to registered users only. This combination of database-only access available to registered users puts the Thomas Register squarely in the universe of the Invisible Web.
WebMD aggregates health information from many sources, including medical associations, colleges, societies, government agencies, publishers, private and non-profit organizations, and for-profit corporations.
  • Structure: MyWebMD site features a browsable table of contents to access its data, using both direct links and javascript relative links to many of the content areas on the site. However, the site also provides a comprehensive site map using direct URLs, allowing search engine spiders to index most of the site.
The National Health Information Center’s Health Information Resource Database includes 1,100 organizations and government offices that provide health information upon request. Entries include contact information, short abstracts, and information about publications and services that the organizations provide.
  • Structure: You may search the database by keyword, or browse the keyword listing of resources in the database. Each keyword link is an indirect link to a script that searches the database for results. The database is entirely an Invisible Web site.
As these examples show, it’s relatively easy to determine whether a resource is part of the Invisible Web or not by taking the time to examine its structure. Some sites, however, can be virtually impossible to classify since they have both visible and invisible elements. Let’s look at
an example.


The Library of Congress Web Site: Both Visible and Invisible
The U.S. Library of Congress is the largest library in the world, so it’s fitting that its site is also one of the largest on the Web. The site provides a treasure trove of resources for the searcher. In fact, it’s hard to even call it a single site, since several parts have their own domains or subdomains.

The library’s home page has a simple, elegant design with links to the major sections of the site. Mousing over the links to all of the sections reveals only one link that might be invisible to the America’s Library site. If you follow the link to the American Memory collection, you see a screen that allows you to access more than 80 collections featured on the site. Some of the links, such as those to “Today in History” and the “Learning Page,” are direct URLs that branch to simple HTML pages. However, if you select the “Collection Finder” you’re presented with a directory-type menu for all of the topics in the collection. Each one of the links on this page is not only an indirect link but contains a large amount of information used to create new dynamic pages. However, once those pages are created, they include mostly direct links to simple HTML pages.

The point of this exercise is to demonstrate that even though the ultimate content available at the American Memory collection consists of content that is crawlable, following the links from the home page leads to a “barrier” in the form of indirect URLs on the Collection Finder directory page. Because they generally don’t crawl indirect URLs, most crawlers would simply stop spidering once they encounter those links, even though they lead to perfectly acceptable content.

Though this makes much of the material in the American Memory collection technically invisible, it’s also probable that someone outside of the Library of Congress has found the content and linked to it, allowing crawlers to access the material despite the apparent roadblocks. In other words, any Web author who likes content deep within the American Memory collection is free to link to it—and if crawlers find those links on the linking author’s page, the material may ultimately be crawled, even if the crawler couldn’t access it through the “front door.” Unfortunately, there’s no quick way to confirm that content deep within a major site like the Library of Congress has been crawled in this manner, so the searcher should utilize the Library’s own internal search and directory services to be assured of getting the best possible results.


The Robots Exclusion Protocol
Many people assume that all Webmasters want their sites indexed by search engines. This is not the case. Many sites that feature timely content that changes frequently do not want search engines to index their pages. If a page changes daily and a crawler only visits the page monthly, the result is essentially a permanently inaccurate page in a search index. Some sites make content available for free for only a short period before moving it into archives that are available to paying customers only—the online versions of many newspaper and media sites are good examples of this.

To block search engine crawlers, Webmasters employ the Robots Exclusion Protocol. This is simply a set of rules that enable a Webmaster to tell a crawler which parts of a server are off-limits. The Webmaster simply creates a list of files or directories that should not be crawled or indexed, and saves this list in a file called robots.txt. CNN, Canadian Broadcasting Corporation, the London Times, and the Los Angeles Times all use robots.txt to exclude some or all of their content using the robots.txt file.

Here’s an example of the robots.txt file used by the Los Angeles Times:
User-agent: *
Disallow: /RealMedia
Disallow: /archives
Disallow: /wires/
Disallow: /HOME/
Disallow: /cgi-bin/
Disallow: /class/realestate/dataquick/dqsearch.cgi
Disallow: /search
The User-agent field specifies which spiders must pay attention to the following instructions. The asterisk (*) is a wildcard, meaning all crawlers must read and respect the contents of the file. Each “Disallow” command is followed by the name of a specific directory on the Los Angeles Times Web server that spiders are prohibited from accessing and crawling. In this case, the spider is blocked from reading streaming media files, archive files, real estate listings, and so on.

It’s also possible to prevent a crawler from indexing a specific page by including a “noindex” meta tag instruction in the “head” portion of the document. Here’s an example:
«head»
«title»Keep Out, Search Engines!«/title»
«ΜETA name=”robots” content=”noindex, nofollow”»
«/head»
Either the robots.txt file or the noindex meta tag can be used to block crawlers. The only difference between the two is that the noindex meta tag is page specific, while the robots.txt file can be used to prevent indexing of individual pages, groups of files—even entire Web sites.

As you can see, it’s important to look closely at a site and its structure to determine whether it’s visible or invisible. One of the wonderful things many Invisible Web resources can do is
help you focus your search and allow you to manipulate a “subject oriented” database in ways that would not be possible with a general-purpose search tool. Many resources allow you to organize your results via various criteria or are much more up-to-date than a general search tool or print versions of the same material. For example, lists published by Forbes and Fortune provide the searcher with all kinds of ways to sort, limit, or filter data that is simply impossible with the print-based versions. Also, you could have a much smaller haystack of “focused” data to search through to find the necessary “needles” of information. In the later section we’ll show you some specific cases where resources on the Invisible Web provide a superior—if not the only—means of locating important and dependable information online.


Using the Invisible Web


How do you decide when the Invisible Web is likely to be your best source for the information you’re seeking? After all, Invisible Web resources aren’t always the solution for satisfying an information need. Although we’ve made a strong case for the value of the resources available on the Invisible Web, we’re not suggesting that you abandon the general-purpose search engines like AltaVista, HotBot, and Google. Far from it! Rather, we’re advocating that you gain an understanding of what’s available on the Invisible Web to make your Web searching time more efficient. By expanding the array of tools available to you, you’ll learn to select the best available tool for every particular searching task.

In this section, we’ll examine the broad issue of why you might choose to use Invisible Web resources instead of a general-purpose search engine or Web directory. Then we’ll narrow our focus and look at specific instances of when to use the Invisible Web. To illustrate these specifics, we’ve compiled a list of 25 categories of information where you’ll likely get the best results from Invisible Web resources. Then we’ll look at what’s not available on the Web, visible or Invisible.

It’s easy to get seduced by the ready availability and seeming credibility of online information. But just as you would with print materials, you need to evaluate and assess the quality of the information you find on the Invisible Web. Even more importantly, you need to watch out for bogus or biased information that’s put online by charlatans more interested in pushing their own point of view than publishing accurate information.

The Invisible Web, by its very nature, is highly dynamic. What is true on Monday might not be accurate on Thursday. Keeping current with the Invisible Web and its resources is one of the biggest challenges faced by the searcher. We’ll show you some of the best sources for keeping up with the rapidly changing dynamics of the Invisible Web. Finally, as you begin your own exploration of the Invisible Web, you should begin to assemble your own toolkit of trusted resources. As your personal collection of Invisible Web resources grows, your confidence in choosing the appropriate tool for every search task will grow in equal proportions.


Why Use the Invisible Web?
General-purpose search engines and directories are easy to use, and respond rapidly to information queries. Because they are so accessible and seemingly all-powerful, it’s tempting to simply fire up your favorite Web search engine, punch in a few keywords that are relevant to your search, and hope for the best. But the general-purpose search engines are essentially mass audience resources, designed to provide something for everyone. Invisible Web resources tend to be more focused, and often provide better results for many information needs. Consider how a publication like Newsweek would cover a story on Boeing compared to an aviation industry trade magazine such as Aviation Week and Space Technology. Or a how a general newsmagazine like Time would cover a story on currency trades vs. a business magazine like Forbes or Fortune.

In making the decision whether to use an Invisible Web resource, it helps to consider the point of view of both the searcher and the provider of a search resource. The goal for any searcher is relatively simple: to satisfy an information need in a timely manner.

Of course, providers of search resources also strive to satisfy the information needs of their users, but they face other issues that complicate the equation. For example, there are always conflicts between speed and accuracy. Searchers demand fast results, but if a search engine has a large, comprehensive index, returning results quickly may not allow for a thorough search of the database.
For general-purpose search engines, there’s a constant tension between finding the correct answer vs. finding the best answer vs. finding the easiest answer. Because they try to satisfy virtually any information need, general-purpose search engines resolve these conflicts by making compromises. It costs a significant amount of money to crawl the Web, index pages, and handle search queries. The bottom line is that general-purpose search engines are in business to make a profit, a goal that often works against the mission to provide comprehensive results for searchers with a wide variety of information needs. On the other hand, governments, academic institutions, and other organizations that aren’t constrained by a profit-making motive operate many Invisible Web resources. They don’t feel the same pressures to be everything to everybody. And they can often afford to build comprehensive search resources that allow searchers to perform exhaustive research within a specific subject area, and keep up-to-date and current.

Why select an Invisible Web resource over a general-purpose search engine or Web directory? Here are several good reasons:
  • Specialized content focus = more comprehensive results. Like the focused crawlers and directories, Invisible Web resources tend to be focused on specific subject areas. This is particularly true of the many databases made available by government agencies and academic institutions. Your search results from these resources will be more comprehensive than those from most visible Web resources for two reasons.
  1. First, there are generally no limits imposed by databases on how quickly a search must be completed—or if there are, you can generally select your own time limit that will be reached before a search is cut off. This means that you have a much better chance of having all relevant results returned, rather than just those results that were found fastest.
  2. Second, people who go to the trouble of creating a database-driven information resource generally try to make the resource as comprehensive as possible, including as many relevant documents as they are able to find. This is in stark contrast to general-purpose search engine crawlers, which often arbitrarily limit the depth of crawl for a particular Web site. With a database, there is no depth of crawl issue—all documents in the database will be searched by default.
  • Specialized search interface = more control over search input and output. Here’s a question to get you thinking. Let’s assume that everything on the Web could be located and accessed via a general search tool like Google or HotBot. How easy and efficient would it be to use one of these general-purpose engines when a specialized tool was available? Would you begin a search for a person’s phone number with a search of an encyclopaedia? Of course not. Likewise, even if the general-purpose search engines suddenly provided the capability to find specialized information, they still couldn’t compete with search services specifically designed to find and easily retrieve specialized information. Put differently, searching with a general-purpose search engine is like using a shotgun, whereas searching with an Invisible Web resource is more akin to a taking a highly precise rifle-shot approach.
    As an added bonus, most databases provide customized search fields that are subject-specific. History databases will allow limiting searches to particular eras, for example, and biology databases by species or genomic parameters. Invisible Web databases also often provide extensive control over how results are formatted. Would you like documents to be sorted by relevance, by date, by author, or by some other criteria of your own choosing? Contrast this flexibility with the general-purpose search engines, where what you see is what you get. Increased precision and recall. Consider two informal measures of search engine performance—recall and precision.


    Recall represents the total number of relevant documents retrieved in response to a search query, divided by the total number of relevant documents in the search engine’s entire index. One hundred percent recall means that the search engine was able to retrieve every document in its index that was relevant to the search terms.
    Measuring recall alone isn’t sufficient, however, since the engine could always achieve 100 percent recall simply by returning every document in its index. Recall is balanced by precision.


    Precision is the number of relevant documents retrieved divided by the total number of documents retrieved. If 100 pages are found, and only 20 are relevant, the precision is (100/20), or 20 percent.
    Relevance, unfortunately, is strictly a subjective measure. The searcher ultimately determines relevance after fully examining a document and deciding whether it meets the information need. To maximize potential relevance, search engines strive to maximize recall and precision simultaneously. In practice, this is difficult to achieve. As the size of a search engine index increases, there are likely to be more relevant documents for any given query, leading to a higher recall percentage. As recall increases, precision tends to decrease, making it harder for the searcher to locate relevant documents. Because they are often limited to specific topics or subjects, many Invisible Web and specialized search services offer greater precision even while increasing total recall. Narrowing the domain of information means there is less extraneous or irrelevant information for the search engine to process. Because Invisible Web resources tend to have smaller databases, recall can be high while still offering a great deal of precision, leading to the best of all possible worlds: higher relevance and greater value to the searcher.

  • Invisible Web resources = highest level of authority. Institutions or organizations that have a legitimate claim on being an unquestioned authority on a particular subject maintain many Invisible Web resources. Unlike with many sites on the visible Web, it’s relatively easy to determine the authority of most Invisible Web sites. Most offer detailed information about the credentials of the people responsible for maintaining the resource. Others feature awards, citations, or other symbols of recognition from other acknowledged subject authorities. Many Invisible Web resources are produced by book or journal publishers with sterling reputations among libraries and scholars.

  • The answer may not be available elsewhere. The explosive growth of the Web, combined with the relative ease of finding many things online, has led to the widely held but wildly inaccurate belief that “if it’s not on the Web, it’s not online.” There are a number of reasons this belief simply isn’t true. For one thing, there are vast amounts of information available exclusively via Invisible Web resources. Much of this information is in databases, which can’t be directly accessed by search engines, but it is definitely online and often freely available.

When to Use the Invisible Web
It’s not always easy to know when to use an Invisible Web resource as opposed to a general search tool. As you become more familiar with the landscape of the Invisible Web, there are several rules of thumb you can use when deciding to use an Invisible Web resource.
  • When you’re familiar with a subject. If you know a particular subject well, you’ve likely already discovered one or more Invisible Web resources that offer the kind of information you need. Familiarity with a subject also offers another advantage: knowledge of which search terms will find the “best” results in a particular search resource, as well as methods for locating new resources.
  • When you’re familiar with specific search tools. Some Invisible Web resources cover multiple subjects, but since they often offer sophisticated interfaces you’ll still likely get better results from them compared to general-purpose search tools. Restricting your search through the use of limiters, Boolean logic, or other advanced search functions generally makes it easier to pull a needle from a haystack.
  • When you’re looking for a precise answer. When you’re looking for a simple answer to a question, the last thing you want is a list of hundreds of possible results. No matter—an abundance of potential answers is what you’ll end up with if you use a general-purpose search engine, and you’ll have to spend the time scanning the result list to find what you need. Many Invisible Web resources are designed to perform what are essentially lookup functions, when you need a particular fact, phone number, name, bibliographic record, and so on.
  • When you want authoritative, exhaustive results. General-purpose search engines will never be able to return the kind of authoritative, comprehensive results that Invisible Web resources can. Depth of crawl, timeliness, and the lack of selective filtering fill any result list from a general-purpose engine with a certain amount of noise. And, because the haystack of the Web is so huge, a certain number of authoritative documents will inevitably be overlooked.
  • When timeliness of content is an issue. Invisible Web resources are often more up-to-date than general-purpose search engines and directories.


Top 25 Invisible Web Categories
To give you a sense of what’s available on the Invisible Web, we’ve put together a list of categories where, in general, you’ll be far better off searching an Invisible Web resource than a general-purpose search engine. Our purpose here is to simply provide a quick overview of each category, noting one or two good Invisible Web resources for each. Detailed descriptions of and annotated links to many more resources for all of these categories can be found in the online directory available at companion website.
  1. Public Company Filings. The U.S. Securities and Exchange Commission (SEC) and regulators of equity markets in many other countries require publicly traded companies to file certain documents on a regular schedule or whenever an event may have a material effect on the company. These documents are available in a number of locations, including company Web sites. While many of these filings may be visible and findable by a general-purpose search engine, a number of Invisible Web services have built comprehensive databases incorporating this information. FreeEDGAR, 10K Wizard , and SEDAR are examples of services that offer sophisticated searching and limiting tools as well as the assurance that the database is truly comprehensive. Some also offer free e-mail alert services to notify you that the companies you choose to monitor have just filed reports.
  2. Telephone Numbers. Just as telephone white pages serve as the quickest and most authoritative offline resource for locating telephone numbers, a number of Invisible Web services exist solely to find telephone numbers. InfoSpace, Switchboard.com, and AnyWho offer additional capabilities like reverse-number lookup or correlating a phone number with an e-mail address. Because these databases vary in currency it is often important to search more than one to obtain the most current information.
  3. Customized Maps and Driving Directions. While some search engines, like Northern Light, have a certain amount of geographical “awareness” built in, none can actually generate a map of a particular street address and its surrounding neighborhood. Nor do they have the capability to take a starting and ending address and generate detailed driving directions, including exact distances between landmarks and estimated driving time(now adays all that is possible n.d.t.). Invisible Web resources such as Mapblast and Mapquest are designed specifically to provide these interactive services.
  4. Clinical Trials. Clinical trials by their very nature generate reams of data, most of which is stored from the outset in databases. For the researcher, sites like the New Medicines in Development database are essential. For patients searching for clinical trials to participate in, ClinicalTrials.gov and CenterWatch’s Clinical Trials Listing Service are invaluable.
  5. Patents. Thoroughness and accuracy are absolutely critical to the patent searcher. Major business decisions involving significant expense or potential litigation often hinge on the details of a patent search, so using a general-purpose search engine for this type of search is effectively out of the question. Many government patent offices maintain Web sites, but Delphion’s Intellectual Property Network allows full-text searching of U.S. and European patents and abstracts of Japanese patents simultaneously. Additionally, the United States Patent Office provides patent information dating back to 1790, as well as U.S. Trademark data.
  6. Out of Print Books. The growth of the Web has proved to be a boon for bibliophiles. Countless out of print booksellers have established Web sites, obliterating the geographical constraints that formerly limited their business to local customers. Simply having a Web presence, however, isn’t enough. Problems with depth of crawl issues, combined with a continually changing inventory, make catalog pages from used booksellers obsolete or inaccurate even if they do appear in the result list of a general-purpose search engine. Fortunately, sites like Alibris and Bibliofind allow targeted searching over hundreds of specialty and used bookseller sites.
  7. Library Catalogs. There are thousands of Online Public Access Catalogs (OPACs) available on the Web, from national libraries like the U.S. Library of Congress and the Bibliothèque Nationale de France, academic libraries, local public libraries, and many other important archives and repositories. OPACs allow searches for books in a library by author, title, subject, keywords, or call number, often providing other advanced search capabilities. webCATS, Library Catalogs on the World Wide Web (now at http://www.lights.ca/webcats/ ) is an excellent directory of OPACs around the world. OPACS are great tools to verify the title or author of a book.
  8. Authoritative Dictionaries. Need a word definition? Go directly to an authoritative online dictionary. Merriam-Webster’s Collegiate and the Cambridge International Dictionary of English are good general dictionaries. Scores of specialized dictionaries also provide definitions of terms from fields ranging from aerospace to zoology. Some Invisible Web dictionary resources even provide metasearch capability, checking for definitions in hundreds of online dictionaries simultaneously. OneLook is a good example.
  9. Environmental Information. Need to know who’s a major polluter in your neighborhood? Want details on a specific country’s position in the Kyoto Treaty? Try the Envirofacts multiple database search.
  10. Historical Stock Quotes. Many people consider stock quotes to be ephemeral data, useful only for making decisions at a specific point in time. Stock market historians and technical analysts, however, can use historical data to compile charts of trends that some even claim to have a certain amount of predictive value. There are numerous resources available that contain this information. One of our favorites is from BigCharts.com .
  11. Historical Documents and Images. You’ve seen that general-purpose search engines don’t handle images well. This can be a problem with historical documents, too, as many historical documents exist on the Web only as scanned images of the original. The U.S. Library of Congress American Memory Project is a wonderful example of a continually expanding digital collection of historical documents and images. The American Memory Project also illustrates that some data in a collection may be “visible” while other portions are “invisible.”
  12. Company Directories. Competitive intelligence has never been easier thanks to the Web. We wrote about Hoover’s and the Thomas Register. There are numerous country or region specific company directories, including the Financial Times’ European Companies Premium Research (http://www.globalarchive.ft.com/cb/cb_search.html) and
    Wright Investors’ Services (http://profiles.wisi.com/profiles/comsrch.htm).
  13. Searchable Subject Bibliographies. Bibliographies are gold mines for scholars and other researchers. Because bibliographies generally conform to rigid formats specified by the MLA or the AP, most are stored in searchable online databases, covering subjects ranging from Architecture to Zoology. The Canadian Music Periodical Index provided by the National Library of Canada is a good example as it contains almost infinite citations.
  14. Economic Information. Governments and government agencies employ entire armies of statisticians to monitor the pulse of economic conditions. This data is often available online, but rarely in a form visible to most search engines. RECON-Regional Economic Conditions is an interactive database from the Federal Deposit Insurance Corporation that illustrates this point.
  15. Award Winners. Who won the Nobel Peace Prize in 1938? You might be able to learn that it was Viscount Cecil of Chelwood (Lord Edgar Algernon Robert Gascoyne Cecil) via a general-purpose search engine, but the Nobel e-museum site will provide the definitive answer. Other Invisible Web databases have definitive information on major winners of awards ranging from Oscar (http://www.oscars.org/awards_db/) to the Peabody Awards (http://www.peabody.uga.edu/recipients/search.html).
  16. Job Postings. Looking for work? Or trying to find the best employee for a job opening in your company? Good luck finding what you’re looking for using a general-purpose search engine. You’ll be far better off searching one of the many job-posting databases, such as CareerBuilder.Com , the contents of which are part of the Invisible Web. Better yet, try one of our favorites—the oddly named Flipdog. Flipdog is unique in that it scours both company Web sites and other job posting databases to compile what may be the most extensive collection of job postings and employment offers available on the Web.
  17. Philanthropy and Grant Information. Show me the money! If you’re looking to give or get funding, there are literally thousands of clearinghouses on the Invisible Web that exist to match those in need with those willing and able to give. The Foundation Finder from the Foundation Center is an excellent place to begin your search.
  18. Translation Tools. Web-based translation services are not search tools in their own right, but they provide a valuable service when a search has turned up documents in a language you don’t understand. Translation tools accept a URL, fetch the underlying page, translate it into the desired language and deliver it as a dynamic document. AltaVista provides such a service. Please note the many limitations and frequent translation issues that often arise. These tools, while far from perfect, will continue to improve with time. Another example of an Invisible Web translation tool is EuroDicAutom, described as “the multilingual terminological database of the European Commission’s Translation Service.”
  19. Postal Codes. Even though e-mail is rapidly overtaking snail mail as the world’s preferred method of communication, we all continue to rely on the postal service from time to time. Many postal authorities such as the Royal Mail in the United Kingdom provide postal code look-up tools.
  20. Basic Demographic Information. Demographic information from the U.S. Census and other sources can be a boon to marketers or anyone needing details about specific communities. One of many excellent starting points is the American FactFinder. The utility that this site provides seems to almost never end!
  21. Interactive School Finders. Before the Web, finding the right university or graduate school often meant a trek to the library and hours scanning course catalogs. Now it’s easy to locate a school that meets specific criteria for academic programs, location, tuition costs, and many other variables. Peterson’s GradChannel is an excellent example of this type of search resource for students, offered by a respected provider of school selection data.
  22. Campaign Financing Information. Who’s really buying—or stealing—the election? Now you can find out by accessing the actual forms filed by anyone contributing to a major campaign. The Federal Elections Commission provides several databases (http://www.fec.gov/finance_reports.htrml) while a private concern called Fecinfo.Com
    “massages” government-provided data for greater utility. Fecinfo.com has a great deal of free material available in addition to several fee-based resources. Many states are also making this type of data available.
  23. Weather Data. If you don’t trust your local weatherman, try an Invisible Web resource like AccuWeather. This extensive resource offers more than 43,000 U.S. 5-day forecasts, international forecasts, local NEXRAD Doppler radar images, customizable personal pages, and fee-based premium services. Weather information clearly illustrates the vast amount of real-time data available on the Internet that the general search tools do not crawl. Another favorite is Automated Weather Source. This site allows you to view local weather conditions in real-time via instruments placed at various sites (often located at schools) around the country.
  24. Product Catalogs. It can be tricky to determine whether pages from many product catalogs are visible or invisible. One of the Web’s largest retailers, Amazon.com, is largely a visible Web site. Some general-purpose search engines include product pages from Amazon.com’s catalogs in their databases, but even though this information is visible, it may not be relevant for most searches. Therefore, many engines either demote the relevance ranking of product pages or ignore them, effectively rendering them invisible.
    However, in some cases general search tools have arrangements with major retailers like Amazon to provide a “canned” link for search terms that attempt to match products in a retailer’s database.
  25. Art Gallery Holdings. From major national exhibitions to small co-ops run by artists, countless galleries are digitizing their holdings and putting them online. An excellent way to find these collections is to use ADAM, the Art, Design, Architecture & Media Information Gateway. ADAM is a searchable catalogue of more than 2,500 Internet resources whose entries are all invisible. Specifically, the Van Gogh Museum in Amsterdam provides a digital version of the museums, collection that is invisible to general search tools.


What’s NOT on the Web—Visible or Invisible

There’s an entire class of information that’s simply not available on the Web, including the following:
  • Proprietary databases and information services. These include Thomson’s Dialog service, LexisNexis, and Dow Jones, which restrict access to their information systems to paid subscribers.
  • Many government and public records. Although the U.S. government is the most prolific publisher of content both on the Web and in print, there are still major gaps in online coverage. Some proprietary services such as KnowX offer limited or no access to public records for a fee. Coverage of government and public records is similarly spotty in other countries around the world. While there is a definite trend toward moving government information and public records online, the sheer mass of information will prohibit all of it from going online. There are also privacy concerns that may prevent certain types of public records from going digital in a form that might compromise an individual’s rights.
  • Scholarly journals or other “expensive” information. Thanks in part to the “publish or perish” imperative at modern universities, publishers of scholarly journals or other information that’s viewed as invaluable for certain professions have succeeded in creating a virtual “lock” on the market for their information products. It’s a very profitable business for these publishers, and they wield an enormous amount of control over what information is published and how it’s distributed. Despite ongoing, increasingly acrimonious struggles with information users, especially libraries, who often have insufficient funding to acquire all of the resources they need, publishers of premium content see little need to change the status quo. As such, it’s highly unlikely that this type of content will be widely available on the Web any time soon. There are some exceptions. Northern Light’s Special Collection, for example, makes available a wide array of reasonably priced content that previously was only available via expensive subscriptions or site licenses from proprietary information services. ResearchIndex, can retrieve copies of scholarly papers posted on researchers’ personal Web sites, bypassing the “official” versions appearing in scholarly journals. But this type of semi-subversive “Napster-like” service may come under attack in the future, so it’s too early to tell whether it will provide a viable alternative to the official publications or not. For the near future, public libraries are one of the best sources for this information, made available to community patrons and paid for by tax dollars.
  • Full Text of all newspapers and magazines. Very few newspapers or magazines offer full-text archives. For those publications that do, the content only goes back a limited time—10 or 20 years at the most.There are several reasons for this. Publishers are very aware that the content they have published quite often retains value over time. Few economic models have emerged that allow publishers to unlock that value as yet. Authors’ rights are another concern. Many authors retained most re-use rights to the materials printed in magazines and newspapers. For content published more than two decades ago, reprints in digital format were not envisioned or legally accounted for. It will take time for publishers and authors to forge new agreements and for consumers of Web content to become comfortable with the notion that not everything on the Web is free. New micropayment systems, or “all you can eat” subscription services will emerge that should remove some of the current barriers keeping magazine and newspaper content off the Web. Some newspapers are placing archives of their content on the Web. Often the search function is free but retrieval of full text is fee based—for example, the services offered by Newslibrary. And finally, perhaps the reason users cannot find what they are looking for on either the visible or Invisible Web is simply because it’s just not there. While much of the world’s print information has migrated to the Web, there are and always will be millions of documents that will never be placed online. The only way to locate these printed materials will be via traditional methods: using libraries or asking for help from people who have physical access to the information.


Spider Traps, Damned Lies, and Other Chicanery
Though there are many technical reasons the major search engines don’t index the Invisible Web, there are also “social” reasons having to do with the validity, authority, and quality of online information. Because the Web is open to everybody and anybody, a good deal of its content is published by non-experts or—even worse—by people with a strong bias that they seek to conceal from readers. Search engines must also cope with unethical Web page authors who seek to subvert their indexes with millions of bogus “spam” pages. Most of the major engines have developed strict guidelines for dealing with spam that sometimes has the unfortunate effect of excluding legitimate content.

No matter whether you’re searching the visible or Invisible Web, it’s important always to maintain a critical view of the information you’re accessing. For some reason, people often lower their guard when it comes to information on the Internet. People who would scoff if asked to participate in an offline chain-mail scheme cast common sense to the wind and willingly forward hoax e-mails to their entire address books. Urban legends and all manner of preposterous stories abound on the Web.

Here are some important questions to ask and techniques to use for assessing the validity and quality of online information, regardless of its source.
  • Who Maintains the Content? The first question to ask of any Web site is who’s responsible for creating and updating it. Just as you would with any offline source of information, you want to be sure that the author and publishers are credible and the information they are providing can be trusted.
    Corporate Web sites should provide plenty of information about the company, its products and services. But corporate sites will always seekto portray the company in the best possible light, so you’ll need to use other information sources to balance favorable bias. If you’re unfamiliar with a company, try searching for information about it using
    Hoover’s. For many companies, AltaVista provides a link to a page with additional “facts about” the company, including a capsule overview, news, details of Web domains owned, and financial information.
    Information maintained by government Web sites or academic institutions is inherently more trustworthy than other types of Web content, but it’s still important to look at things like the authority of the institution or author. This is especially true in the case of academic institutions, which often make server space available to students who may publish anything they like without worrying about its validity.
    If you’re reading a page created by an individual, who is the author? Do they provide credentials or some other kind of proof that they write with authority? Is contact information provided, or is the author hiding behind the veil of anonymity? If you can’t identify the author or maintainer of the content, it’s probably not a good idea to trust the resource, even if it appears to be of high quality in all other respects.
  • What Is the Content Provider’s Authority? Authority is a measure of reputation. When you’re looking at a Web site, is the author or producer of the content a familiar name? If not, what does the site provide to assert authority?
    For an individual author, look for a biography of the author citing previous work or awards, a link to a resume or other vita that demonstrates experience, or similar relevant facts that prove the author has authority. Sites maintained by companies should provide a corporate profile, and some information about the editorial standards used to select or commission work.
    Some search engines provide an easy way to check on the authority of an author or company. Google, for example, tries to identify authorities by examining the link structure of the entire Web to gauge how often a page is cited in the form of a link by other Web page authors. It also checks to see if there are links to these pages from “important” sites of the Web that have authority. Results in Google for a particular
    query provide an informal gauge of authority. Beware, though, that this is only informal—even a page created by a Nobel laureate may not rank highly on Google if other important pages on the Web don’t link to it.
  • Is There Bias? Bias can be subtle, and can be easily camouflaged in sites that deal with seemingly non-controversial subjects. Bias is easy to spot when it takes the form of a one-sided argument. It’s harder to recognize when it dons a Janusian mask of two-sided “argument” where one side consistently (and seemingly reasonably) always prevails. Bias
    is particularly insidious on so-called “news” sites that exist mainly to promote specific issues or agendas. The key to avoiding bias is to look for balanced writing.
    Another form of bias on the Web appears when a page appears to be objective, but is sponsored by a group or organization with a hidden agenda that may not be apparent on the site. It’s particularly important to look for this kind of thing in health or consumer product information sites. Some large companies fund information resources for specific
    health conditions, or advocate a particular lifestyle that incorporates a particular product. While the companies may not exert direct editorial influence over the content, content creators nonetheless can’t help but be aware of their patronage, and may not be as objective as they might be. On the opposite side of the coin, the Web is a powerful medium for activist groups with an agenda against a particular company or industry. Many of these groups have set up what appear to be objective Web sites presenting seemingly balanced information when in fact they are extremely one-sided and biased.
    There’s no need to be paranoid about bias. In fact, recognizing bias can be very useful in helping understand an issue in depth from a particular point of view. The key is to acknowledge the bias and take steps to filter, balance, and otherwise gain perspective on what is likely to be a complex issue.
  • Examine the URL. URLs can contain a lot of useful clues about the validity and authority of a site. Does the URL seem “appropriate” for the content? Most companies, for example, use their name or a close approximation in their primary URL. A page stored on a free service like Yahoo’s GeoCities or Lycos-Terra’s Tripod is not likely to be an official company Web site. URLs can also reveal bias.
    Deceptive page authors can also feed search engine spiders bogus content using cloaking techniques, but once you’ve actually retrieved a page in your browser, its URLs cannot be spoofed. If a URL appears to contain suspicious or irrelevant words to the topic it represents, it’s likely a spurious source of information.

  • Examine Outbound Links. The hyperlinks included in a document can also provide clues about the integrity of the information on the page. Hyperlinks were originally created to help authors cite references, and can provide a sort of online “footnote” capability. Does a page link to other credible sources of information? Or are most of the links to other internal content on a Web site?
    Well-balanced sites have a good mix of internal and external links. For complex or controversial issues, external links are particularly important. If they point to other authorities on a subject, they allow you to easily access alternative points of view from other authors. If they point to less credible authors, or ones that share the same point of view as the author, you can be reasonably certain you’ve uncovered bias, whether subtle or blatant.
  • Is the Information Current? Currency of information is not always important, but for timely news, events, or for subject areas where new research is constantly expanding a field of knowledge, currency is very important.
    Look for dates on a page. Be careful—automatic date scripts can be included on a page so that it appears current when in fact it may be quite dated. Many authors include “dateline” or “updated” fields somewhere on the page.
    It’s also important to distinguish between the date in search results and the date a document was actually published. Some search engines include a date next to each result. These dates often have nothing to do with the document itself—rather, they are the date the search engine’s crawler last spidered the page. While this can give you a good idea of the freshness of a search engine’s database, it can be misleading to assume that the document’s creation date is the same. Always check the document itself if the date is an important part of your evaluation criteria.
  • Use Common Sense. Apply the same filters to the Web as you do to other sources of information in your life. Ask yourself: “How would I respond to this if I were reading it in a newspaper, or in a piece of junk mail?” Just because something is on the Web doesn’t mean you should believe it—quite the contrary, in many cases. For excellent information about evaluating the quality of Web resources, we recommend Genie Tyburski’s excellent Evaluating The Quality Of Information On The Internet.


Keeping Current with the Invisible Web

Just as with the visible Web, new Invisible Web resources are being made available all the time. How do you keep up with potentially useful new additions? There are also several useful, high-quality current awareness services that publish newsletters that cover Invisible Web resources. These newsletters don’t limit themselves to the Invisible Web, but the news and information they provide is exceptionally useful for all serious Web searchers. All of these newsletters are free.
  • The Scout Report The Scout Report provides the closest thing to an “official” seal of
    approval for quality Web sites. Published weekly, it provides organized summaries of the most valuable and authoritative Web resources available. The Scout Report Signpost provides the full-text search of nearly 6,000 of these summaries. The Scout Report staff is made up of a group of librarians and information professionals, and their standards for
    inclusion in the report are quite high.
  • Librarians’ Index to the Internet (LII) This searchable, annotated directory of Web resources, maintained by Carole Leita and a volunteer team of more than 70 reference librarians, is organized into categories including “best of,” “directories,” “databases,” and “specific resources.” Most of the Invisible Web content reviewed by LII falls in the “databases” and “specific resources” categories. Each entry also includes linked cross-references, making it a browser’s delight.
    Leita also publishes a weekly newsletter that includes 15-20 of the resources added to the Web site during the previous week.
  • ResearchBuzz ResearchBuzz is designed to cover the world of Internet research. To
    that end this site provides almost daily updates on search engines, new data-managing software, browser technology, large compendiums of information, Web directories, and Invisible Web databases. If in doubt, the final question is, “Would a reference librarian find it useful?” If the answer’s yes, in it goes.
    ResearchBuzz’s creator, Tara Calishain, is author of numerous Internet research books, including Official Netscape Guide to Internet Research. Unlike most of the other current awareness services described here, Calishain often writes in-depth reviews and analyses of new resources, pointing out both useful features and flaws in design or implementation.
  • Free Pint Free Pint is an e-mail newsletter dedicated to helping you find reliable Web sites and search the Web more effectively. It’s written by and for knowledge workers who can’t afford to spend valuable time sifting through junk on the Web in search of a few valuable nuggets of e-gold. Each issue of Free Pint has several regular sections. William Hann, Managing Editor, leads off with an overview of the issue and general news announcements, followed by a “Tips and Techniques” section, where professionals share their best searching tips and describe their favorite Web sites.
    The Feature Article covers a specific topic in detail. Recent articles have been devoted to competitive intelligence on the Internet, central and eastern European Web sources, chemistry resources, Web sites for senior citizens, and a wide range of other topics. Feature articles are between 1,000-2,000 words long, and are packed with useful background information, in addition to numerous annotated links to vetted sites in the article’s subject area. Quite often these are Invisible Web resources. One nice aspect of Free Pint is that it often focuses on European resources that aren’t always well known in North America or other parts of the world.
  • Internet Resources Newsletter Internet Resources Newsletter’s mission is to raise awareness of new sources of information on the Internet, particularly for academics, stu-
    dents, engineers, scientists, and social scientists. Published monthly, Internet Resources Newsletter is edited by Heriot-Watt University Library staff and published by Heriot-Watt University Internet Resource Centre.


Build Your Own Toolkit

As you become more familiar with what’s available on the Invisible Web, it’s important to build your own collection of resources. Knowing what is available before beginning your search is in many ways the greatest challenge in mastering the Invisible Web. But isn’t this a paradox? If Invisible Web resources can’t be found using general-purpose search tools, how do you go about finding them?

A great way to become familiar with Invisible Web resources is to do preemptive searching, a process much like the one professional librarians use in collection development.

  • Explore the Invisible Web gateways, cherry-picking resources that seem relevant to your information needs, asking yourself what kinds of questions each resource might answer in the future.
  • As your collection grows, spend time organizing and reorganizing it for easier access.
  • Be selective—choose Invisible Web resources the same way you build your personal collection of reference works.
  • Consider saving your collection of Invisible Web resources with a remote bookmark service such as Backflip, Delicious or Hotlinks etc.. This will give you access to your collection from any Web accessible computer. Your ultimate goal in building your own toolkit should draw on one of the five laws of library science: to save time. Paradoxically, as you become a better searcher and are able to build your own high-quality toolkit, you’ll actually need to spend less time exercising your searching skills, since in many cases you’ll already have the resources you need close at hand. With your own collection of the best of the Invisible Web, you’ll be able to boldly—and quickly—go where no search engine has gone before.


The Best of the Invisible Web


You face a similar challenge to the one confronted by early explorers of Terra Incognito. Without the benefit of a search engine to guide you, exactly where do you begin your search for information on the Invisible Web?

In this section, we discuss several Invisible Web pathfinders that make excellent starting points for the exploration of virtually any topic. We also introduce our directory. This introduction takes the form of the familiar “Frequently Asked Questions” (FAQ) section you see on many Web sites. We talk about the structure of the directory, how we selected our resources, and how to get the most out of the directory for doing your own searching.

Finally, we’ll leave you with a handy “pocket reference” that you can refer to on your explorations—the top ten concepts to understand about the Invisible Web.


Invisible Web Pathfinders
Invisible Web pathfinders are, for the most part, Yahoo!-like directories with lists of links to Invisible Web resources. Most of these pathfinders, however, also include links to searchable resources that aren’t strictly invisible. Nonetheless, they are useful starting points for finding and building your own collection of Invisible Web resources.
  • direct search direct search is a growing compilation of links to the search interfaces of resources that contain data not easily or entirely searchable/accessible from general search tools like AltaVista, Google, and HotBot. The goal of direct search is to get as close as possible to the search form offered by a Web resource (rather than having to click through one or two pages to get there); hence the name “direct search.”
  • InvisibleWeb The InvisibleWeb Catalog contains over 10,000 databases and searchable sources that have been frequently overlooked by traditional searching. Each source is analyzed and described by editors to ensure that every user of the InvisibleWeb Catalog will find reliable information on hundreds of topics, from Air Fares to Yellow Pages. All
    of this material can be accessed easily by Quick or Advanced Search features or a browsable index of the InvisibleWeb Catalog. Unlike other search engines, this takes you directly to the searchable source within a Web site, even generating a search form for you to perform your query.
  • Librarians’ Index to the Internet The Librarians’ Index to the Internet is a searchable, annotated subject directory of more than 7,000 Internet resources selected and evaluated by librarians for their usefulness to users of public libraries. LII only includes links to the very best Net content. While not a “pure” Invisible Web pathfinder, LII categorizes each resource as Best Of, Directories, Databases, and Specific Resources. Databases, of course, are Invisible Web resources. By using LII’s advanced search feature, you can limit your search to return only databases in the results list. Advanced search
    also lets you restrict your results to specific fields of the directory (author name, description, title, URL, etc.). In effect, the Librarians’ Index to the Internet is a laser-sharp searching tool for finding Invisible Web databases.
  • WebData General portal Web sites like Yahoo!, Excite, Infoseek, Lycos, and Goto.com, etc. are page-oriented search engine sites (words on pages are indexed), where WebData.com’s searches are content-oriented searches (forms and databases on Web sites are indexed). WebData.com and the traditional search engines are often confused
    with each other when composed side by side because they look alike. However, results from searches on WebData.com return databases where the others return Web pages that may or may not be what a user is looking for.
  • AlphaSearch The primary purpose of AlphaSearch is to access the finest Internet
    “gateway” sites. The authors of these gateway sites have spent significant time gathering into one place all relevant sites related to a discipline, subject, or idea. You have instant access to hundreds of sites by entering just one gateway site. http://www.calvin.edu/library/searreso/internet/as/
  • ProFusion ProFusion is a meta search engine from Intelliseek, the same company that runs InvisibleWeb.com. In addition to providing a sophisticated simultaneous search capability for the major general-purpose search engines, ProFusion provides direct access to the Invisible Web with the ability to search over 1,000 targeted sources of information, including sites like TerraServer, Adobe PDF Search, Britannica.com, The New York Times, and the U.S. Patent database. http://www.profusion.com


An Invisible Web Directory
In general, we like the idea of comparing the resources available on the Invisible Web to a good collection of reference works. The challenge is to be familiar with some key resources prior to needing them. Information professionals have always done this with canonical refer-
ence books, and often with traditional, proprietary databases like Dialog and LexisNexis. We encourage you to approach the Invisible Web in the same way—consider each specialized search tool as you would an individual reference resource.


In Summary: The Top 10 Concepts to Understand about the Invisible Web
As you begin your exploration and charting of the Invisible Web, here’s a list of the top ten concepts that you should understand about the Invisible Web.
  1. In most cases, the data found in an Invisible Web database or opaque Web database cannot be accessed entirely or easily via a general-purpose search engine.
  2. The Invisible Web is not the sole solution to all of one’s information needs. For optimal results, Invisible Web resources should be used in conjunction with other information resources, including general-purpose Web search engines and directories.
  3. Because many Invisible Web databases (as well as opaque databases) search a limited universe of material, the opportunity for a more precise and relevant search is greater than when using a general search tool.
  4. Often, Invisible Web and Opaque Web databases will have the most current information available online, since they are updated more frequently than most general-purpose search engines.
  5. In many cases, Invisible Web resources clearly identify who is providing the information, making it easy to judge the authority of the content and its provider.
  6. Material accessible “on the Invisible Web” is not the same as what is found in proprietary databases, such as Dialog or Factiva. In many cases, material on the Invisible Web is free or available for a small fee. In some cases material is available in multiple formats.
  7. Targeted crawlers, which commonly focus on Opaque Web resources, often offer more comprehensive coverage of their subject, since they crawl more pages of each site that they index and crawl them more often than a general-purpose search engine.
  8. To use the Invisible Web effectively, you must make some effort to have an idea of what is available prior to searching. Consider each resource as if it were a traditional reference book. Ask yourself, “What questions can this resource answer?” Think less of an entire site and more of the tools that can answer specific types of questions.
  9. Invisible Web databases can make non-textual material searchable and accessible.
  10. Invisible Web databases offer specialized interfaces that enhance the utility of the information they access. Even if a general-purpose search engine could somehow access Invisible Web data, the shotgun nature of its search interface simply is no match for the rifle-shot approach offered by most Invisible Web tools.



to be continued...



Google tricks

Google is clearly the best general-purpose search engine on the Web. But most people don't use it to its best advantage. Do you just plug in a keyword or two and hope for the best? That may be the quickest way to search, but with more than 3 billion pages in Google's index, it's still a struggle to pare results to a manageable number.
    Google's search options go beyond simple keywords, the Web, and even its own programmers. Let's look at some of Google's lesser-known options.


Syntax Search Tricks

Using a special syntax is a way to tell Google that you want to restrict your searches to certain elements or characteristics of Web pages. Google has a fairly complete list of its syntax elements. Here are some advanced operators that can help narrow down your search results.
  • Intitle: at the beginning of a query word or phrase (intitle:"Three Blind Mice") restricts your search results to just the titles of Web pages.
  • Intext: does the opposite of intitle:, searching only the body text, ignoring titles, links, and so forth. Intext: is perfect when what you're searching for might commonly appear in URLs. If you're looking for the term HTML, for example, and you don't want to get results such as www.mysite.com/index.html, you can enter intext:html.
  • Link: lets you see which pages are linking to your Web page or to another page you're interested in. For example, try typing inlink:http://www.pcmag.com  
  • Site: restricts results to top-level domains 
  • Daterange: (start date end date). You can restrict your searches to pages that were indexed within a certain time period. Daterange: searches by when Google indexed a page, not when the page itself was created. This operator can help you ensure that results will have fresh content (by using recent dates), or you can use it to avoid a topic's current-news blizzard and concentrate only on older results. 
Daterange: is actually more useful if you go elsewhere to take advantage of it, because daterange: requires Julian dates, not standard Gregorian dates. You can find converters on the Web (such as  here) , but an easier way is to do a Google daterange: search by filling in a form here
or here. If one special syntax element is good, two must be better, right? Sometimes. Though some operators can't be mixed (you can't use the link: operator with anything else) many can be, quickly narrowing your results to a less overwhelming number.
Try using site: with intitle: to find certain types of pages. For example, get scholarly pages about Mark Twain by searching for intitle:"Mark Twain"site:edu. Experiment with mixing various elements; you'll develop several strategies for finding the stuff you want more effectively. The site: command is very helpful as an alternative to the mediocre search engines built into many sites.


Swiss Army Google

Google has a number of services that can help you accomplish tasks you may never have thought to use Google for. For example, the new calculator feature lets you do both math and a variety of conversions from the search box. For extra fun, try the query "Answer to life the universe and everything."
    Let Google help you figure out whether you've got the right spelling and the right word for your search. Enter a misspelled word or phrase into the query box (try "thre blund mise") and Google may suggest a proper spelling. This doesn't always succeed; it works best when the word you're searching for can be found in a dictionary. Once you search for a properly spelled word, look at the results page, which repeats your query. (If you're searching for "three blind mice," underneath the search window will appear a statement such as Searched the web for "three blind mice.") You'll discover that you can click on each word in your search phrase and get a definition from a dictionary.
    Suppose you want to contact someone and don't have his phone number handy. Google can help you with that, too. Just enter a name, city, and state. (The city is optional, but you must enter a state.) If a phone number matches the listing, you'll see it at the top of the search results along with a map link to the address. If you'd rather restrict your results, use rphonebook: for residential listings or bphonebook: for business listings.
If you'd rather use a search form for business phone listings, try Yellow Search.


Extended Googling

Google offers several services that give you a head start in focusing your search.
You're probably used to using Google in your browser. But have you ever thought of using Google outside your browser?
    Google Alert  monitors your search terms and e-mails you information about new additions to Google's Web index. (Google Alert is not affiliated with Google; it uses Google's Web services API to perform its searches.)
    If you're more interested in news stories than general Web content, check out the beta version of Google News Alerts. This service (which is affiliated with Google) will monitor up to 50 news queries per e-mail address and send you information about news stories that match your query. (Hint: Use the intitle: and source: syntax elements with Google News to limit the number of alerts you get.)
    Google on the telephone? Yup. This service is brought to you by the folks at Google Labs, a place for experimental Google ideas and features (which may come and go, so what's there at this writing might not be there when you decide to check it out).
    With Google Voice Search, you dial the Voice Search phone number, speak your keywords, and then click on the indicated link. Every time you say a new search term, the results page will refresh with your new query (you must have JavaScript enabled for this to work). Remember, this service is still in an experimental phase, so don't expect 100 percent success.
    In 2002, Google released the Google API (application programming interface), a way for programmers to access Google's search engine results without violating the Google Terms of Service. A lot of people have created useful (and occasionally not-so-useful but interesting) applications not available from Google itself, such as Google Alert. For many applications, you'll need an API key, which is available free here.
    Thanks to its many different search properties, Google goes far beyond a regular search engine. You'll be amazed at how many different ways Google can improve your Internet searching.



More Google API Applications

CapeMail is an e-mail search application that allows you to send an e-mail to google@capeclear.com with the text of your query in the subject line and get the first ten results for that query back. Maybe it's not something you'd do every day, but if your cell phone does e-mail and doesn't do Web browsing, this is a very handy address to know.






Resources


The Invisible Web: Uncovering Information Sources Search Engines Can’t See
by Chris Sherman and Gary Price (2001)
ISBN 0-910965-51-X

How To Find and Search the Invisible Web
Google hacks

1 comment:

  1. Thanks for this article. It's just what I was searching for. I am always interested in this subject.

    PIC Grant Singapore

    ReplyDelete