In our earlier article, we have informed that browser cache is a buffer memory of the web browser in which resources that have already been retrieved and stored on the user’s computer as a copy. When a resource is required again, it can be retrieved from the cache. A web server should provide cache information for each individual resource in order to ensure an up-to-date display for the user and to achieve the lowest possible communication effort for both users and servers.
The operator of the web server benefits from the fact that he does not have to constantly answer queries for unchanged resources and use computer and network capacity to do so. On the other hand, in order to preserve the frequency of visited pages and information about the readers, very small embedded snippets are used, in which the inclusion in the cache is suitably prevented and a recall is forced every time the page is displayed. The following principles apply on the part of the browser or proxy server:
- No system is obliged to observe any indications of the origin of the resource.
- In the interest of an attractive offer, browser developers are interested in evaluating information on cache management in order to display the pages both quickly and up-to-date.
For cache management, one of two identifiers is used for each individual resource, if provided by the server:
- Banal timestamp in UTC
- Field: Last-Modified
- Differentiation between substantially different versions of a resource.
- Can also be a timestamp in principle; but also, for example, a consecutively counted version number or a hash code.
- Field: ETag
If such information is missing, the cache management only knows the time of the last successful retrieval.
Methods of Browser Cache Management
Cache management uses its own algorithms to determine which resources should remain and which should be removed. This is used in particular if the resource has not been provided with appropriate information.
- Leave resources indefinitely, if the allowed disk space is sufficient; only in case of space problems.
- Resources that have been used for a short time may soon be needed again. Resources that have not been addressed for a long time must be deleted.
- Resources that are frequently mentioned should be retained. Resources that are rarely and for a long time unrequested must be deleted.
- File size – delete very large, long-term unused and infrequently used resources first; but leave many small resources rather than one.
- Stability – content that has been changed multiple times is a candidate for deletion. Resources that have remained unchanged over years and multiple validity checks should be kept if they are used frequently.
- Short remaining maturities are an indication of volatility, even if the current exceeding of the minimum shelf life does not necessarily mean that it is unusable.
- POST data in addition to the URL usually makes such pages unrepeatable, and is usually not cached. This would also be quite dangerous, because such information would often change the content of the landing page and a necessary confirmation would not be sent to the server.
- If a query is recognized in the URL am (e.g. during a database query), algorithms originally refrained from storing it because many different URLs were created by combining the query parameters without reusing. Increasingly, however, CMS is presenting all pages statically in this URL format, so that this assumption is no longer reliable.
The resource is provided with an expiration date (time) or a shelf life period after retrieval (e.g. “three days”) from which the time of invalidity can be calculated.
Expires:with a specific date and
Cache-Control: max-age=in seconds as a relative indication
Example: Weather report; always valid for the following 15 minutes.
- However, the invalidity of the resource does not necessarily result in deletion from the cache, but only a validation check, which can lead to an extension of the validity period if the content remains unchanged.
- If the time specified as is already in the past at the time of the query, this version cannot be included in the cache; Information about this URL would have to be deleted.
- If the server does not have any information about the validity period, it can be deduced from the time of the last change, if necessary from the behavior recorded by the caching, whether the resource changes frequently or is constant: If it was last changed three years ago, the resource is probably quite stable; if it is only a quarter of an hour old or if it has changed twice during the last day, it should be checked for topicality at short notice. How exactly the cache management deals adequately with missing meta information is left to the intelligence of the programmers. It would be clumsy and time-consuming as well as network capacity to retrieve a large file from the web server every time if information is missing.
- In the user configuration, a maximum age could have been specified, about two weeks.
A resource announces that it should not be kept in the cache and must be retrieved fresh from the server every time a page is loaded.
Traditionally, these two fields are transmitted at the same time, albeit redundantly, in the hope that the browser will understand at least one of them. This is often combined with the “expiry date” of January 1, 1970; this would also have the same effect.
Example: stock market prices; change with every second.
More precisely, a browser would be allowed to keep the resource in the cache; however, he would have to make sure that it was still up-to-date by checking the server before each access. It is at the discretion of the browser implementers to handle this or not to cache such information that is likely to become obsolete in the first place.
Based on an expiration point, the browser suspects that the resource currently in the cache may be outdated. There are then two options:
- Query the short HEAD information of the resource from the server (initially without the complete content), evaluate the result yourself and then, if necessary, request the content (GET).
- Send the known version information (Last-Modified/ETag) to the server. The server either responds with the HTTP status code (the version is still valid) or sends a new version – in the worst case,
304 Not Modified,
404 Not Foundetc are sent. See HTTP Error Codes.
If one of the relatively rare POST, PUT, or DELETE HTTP request methods used in forms or later detected for a URL, the entry for this URL would have to be deleted from the cache, because this could have changed this resource on the server.
By default, each successfully transferred resource is placed as a single file on the hard disk. If the transfer breaks down or even the computer crashes, the page loading can be continued with the intermediate results. In the early years of browsers, the real core memory was also limited, the networks slow, so that this approach could hardly be avoided. This applies to all resources of the currently displayed pages, regardless of the use of a cache.
- Once the page display is complete, those files that can be reused later are transferred to the data structure of the cache; all other files will be marked as temporary and deleted in due course.
- In the case of particularly sensitive resources (e.g. financial transactions, account data), traces could remain on the hard drive; for example, because “deleting” a file only means removing it from the visible file system, but not immediately physically overwriting it. If the browser or computer crashes, files could be left behind; even after apparent deletion, the physical hard drive could still contain sensitive information that would still be readable when logging into the user account in question.
- A browser should not cache resources to the hard disk, but only keep them volatile in the core memory.
- However, there is a loophole in the concept: it is rare for an application’s operating system to request core memory that is promised not to be swapped to the storage swap file on disk.