A search engine is a program used to search documents stored on a computer or computer network, or the World Wide Web. After creating a search query, often by typing a search term, a search engine provides a list of references to potentially relevant documents, usually presented with a title and a short excerpt of the respective document. Various search methods are used. The essential components or areas of responsibility of a search engine are:
- Creation and maintenance of an index (data structure with information about documents),
- Processing search queries (finding and sorting results) and
- Preparation of the results in the most meaningful form possible.
As a rule, data is obtained automatically, on the Internet by web crawlers, on a single computer by regularly reading all files in user-specified directories in the local file system.
Search engines are meta-media that specifically find and access content from other media. They can be categorized according to a number of characteristics. The following features are largely independent. Thus, when designing a search engine, one can choose a possibility from each of the characteristic groups without this influencing the choice of the other characteristics.
Features of a Search Engine
Type of data
Different search engines can search different types of data. First of all, these can be roughly divided into “document types” such as text, image, sound, video, and others. Result pages are designed depending on this genre. When searching for text documents, you will usually see a snippet of text containing the search terms. Image search engines display a thumbnail of the matching images. A people search engine finds publicly available information about names and people, which is presented as a list of links. Other specialized types of search engines include, for example, job search engines, industry searches, or product search engines. The latter are primarily used by online price comparisons, but there are already local offer searches that display products and offers from brick-and-mortar retailers online.
Another finer breakdown goes into data-specific properties that are not shared by all documents within a genre. If you stick to the example of text, you can search for specific authors in Usenet posts, and for websites in HTML format, you can search for the title of the document.
Depending on the data category, it is possible to restrict the data to a subset of all data in a category. This is generally achieved by means of additional search parameters, which filter out the search result from the collected data, for example with certain Boolean expressions (AND, OR, NOT), for a certain language, a certain country, a certain time period, a certain file format, etc. Alternatively, a search engine can limit itself to only including suitable documents from the beginning. Examples include a search engine for weblogs (instead of the entire web) or search engines that only process documents from universities, or only documents from a specific country, language, or file format.
Another characteristic for categorization is the source from which the data collected by the search engine originates. In most cases, the name of the search engine type already describes the source.
- Web search engines capture documents from the World Wide Web,
- Vertical search engines look at a selected area of the World Wide Web and only capture web documents on a specific topic, such as football, health, or law.
- Usenet search engines Contributions from the globally distributed discussion medium Usenet.
- Intranet search engines are limited to the computers of a company’s intranet.
- Enterprise search engines enable a central search across various data sources within a company, such as file servers, wikis, databases, and intranets.
- Desktop search engines are programs that make the local database of a single computer searchable.
If the data is obtained manually by means of registration or by proofreaders, this is referred to as a catalogue or directory. In directories such as the Open Directory Project, documents are organized hierarchically in a table of contents by topic.
This section describes differences in the realization of the operation of the search engine. The most important group today is index-based search engines. They read suitable documents and create an index. This is a data structure that will be used in a later search query. The disadvantage is the time-consuming maintenance and storage of the index, the advantage is the acceleration of the search process. The most common form of this structure is an inverted index.
Metasearch engines send search queries to several index-based search engines in parallel and combine the individual results. The advantage is the larger amount of data as well as the easier implementation, as there is no need to maintain an index. The disadvantage is the relatively long time it takes to process requests. In addition, the ranking is of questionable value due to pure majority finding. The quality of the results may be reduced to the quality of the worst search engine surveyed. Metasearch engines are particularly useful for rare search terms.
Hybrid forms also exist. These have their own, often relatively small index, but also query other search engines and finally combine the individual results. So-called real-time search engines, for example, only start the indexing process after a query. For example, the pages found are always up-to-date, but the quality of the results is poor due to the lack of a broad database, especially for less common search terms.
A relatively new approach is distributed search engines or federated search engines. In this process, a search query is forwarded to a large number of individual computers, each of which operates its own search engine, and the results are merged. The advantage is the high level of reliability due to decentralization and – depending on your point of view – the inability to censor centrally. However, the ranking is difficult to solve, i.e. the sorting of the basically suitable documents according to their relevance to the request.
A special type of distributed search engines are those based on the peer-to-peer principle, which build a distributed index. On each of these peers, independent crawlers can censorship-resistant capture the parts of the web that the respective peer operator defines by simple local configuration. The best-known system, apart from some mainly academic projects, is the GNU-GPL free software YaCy.
Interpretation of the input
A user’s search query is interpreted before the actual search and put into a form that is understandable for the search algorithm used internally. This is to keep the syntax of the request as simple as possible while still allowing complex requests. Many search engines support the logical linking of different search words by Boolean operators and the exact search for one or more words in quotation marks. This makes it possible to find web pages that contain certain terms but not others.
A more recent development is the ability of a number of search engines to infer implicit information from the context of the search query itself and to evaluate it additionally. In this way, the ambiguities of the search query, which are typically present in incomplete search queries, can be reduced and the relevance of the search results (i.e., the correspondence with the conscious or unconscious expectations of the searcher) can be increased. From the semantic similarities of the entered search terms, one or more underlying meanings of the query are inferred (see also: Semantic Search). In this way, the result set is expanded to include hits for semantically related search terms that are not explicitly entered in the query. As a rule, this leads not only to a quantitative improvement of the results, but also to a qualitative improvement (of relevance) of the results, especially in the case of incomplete queries and search terms that are not optimally chosen, because the search intentions, which in these cases are rather blurred by the search terms, are reproduced surprisingly well in practice by the statistical methods used by the search engines.
Invisibly provided information (location information, and other information, in the case of queries from the mobile network) or inferred ‘meaning preferences’ from the user’s stored search history are further examples of information not explicitly specified in the entered search terms and used by several search engines to modify and improve the results.
There are also search engines that can only be queried with strictly formalized query languages, but can usually answer even very complex queries very precisely.