The Deep Web refers to the part of the World Wide Web that cannot be found when searching via normal search engines. In contrast to the deep web, the web pages accessible via search engines are called clear web, visible web, or surface web. The Deep Web consists largely of topic-specific databases and websites. In summary, this is content that is not freely accessible and/or content that is not indexed by search engines or that is not intended to be indexed.
According to the experts, five types of the invisible web are distinguished:
- Opaque Web
- Private Web
- Proprietary Web
- Invisible Web
- Truly invisible web
The opaque web is a website that could be indexed, but is currently not indexed for reasons of technical performance or cost-benefit ratio (search depth, frequency of visits).
Search engines do not take into account all directory levels and subpages of a website. When tracking web pages, web crawlers control via links to the following web pages. Web crawlers are not able to navigate on their own. You can even get lost in complex directory structures and then have difficulty capturing pages (searching text, images, links, and other relevant data, for indexing) or returning to the home page. For this reason, search engines often consider at most five or six levels of directory. Extensive and thus relevant documents can be found at lower hierarchical levels and cannot be found by search engines due to the limited depth of indexing.
In addition, there are file formats that can only be partially captured (for example, PDF files, Google indexes only a part of a PDF file and makes the content available as HTML).
In addition, constantly updated databases, such as online measurement data, are affected. Websites without hyperlinks or navigation systems, unlinked websites, hermit URLs or “orphan sites” are also included.
The private web describes web pages that could be indexed, but are not indexed due to webmaster’s access restrictions.
This can be websites on the intranet (internal websites), but also password-protected data (registration with login and password), access only for certain IP addresses, protection against indexing by the Robots Exclusion Standard (also known as robots.txt) or protection against indexing by the meta tag values “noindex”, “nofollow” and “noimageindex” in the source code of the website.
Proprietary Web refers to websites that could be indexed, but can only be accessed after accepting a term of use or by entering a password (free or paid).
Such websites are usually only accessible after identification (web-based specialist databases, paywalls for online media).
The invisible web includes websites that could technically be indexed, but are not indexed for commercial or strategic reasons – such as databases with a web form.
Truly Invisible Web
Truly Invisible Web refers to websites that cannot (yet) be indexed for technical reasons. These can be database formats that were created before the WWW (some hosts), documents that cannot be displayed directly in the browser, non-standard formats (e.g. Flash), as well as file formats that cannot be captured due to their complexity (graphic formats). In addition, there is compressed data or web pages that can only be operated via user navigation, graphics (image maps) or scripts.