crawler information #
schist is a decentralized search engine. It is currently alpha software.
The schist project is committed to maintaining schist as a cooperative crawler. If your website is receiving excessive traffic from a schist instance, or requests which do not match the behavior described here, please open an issue.
user agent string #
When retrieving resources from the web, schist uses the User-Agent
header to identify itself and
indicate the purpose of the retrieval. schist user agent strings look like this:
Mozilla/5.0 (compatible; schist/<version>; +https://schist.net/crawlerinfo) (<purpose>)
denied and allowed resources #
schist recognizes two methods of indicating which resources are denied to crawlers and which resources are allowed to crawlers:
<meta name="robots">
, for HTML resources only. The resource must:- Have a
Content-Type
header oftext/html
orapplication/xhtml+xml
. - Be syntactically valid according to the indicated
Content-Type
. - Follow HTML's permitted content/permitted parents rules.
- Have a
If a given resource is not covered by either, schist will assume it is intended to be allowed to crawlers.
retrieval purposes #
There are a few different reasons schist might make a request for a resource on your website:
user-initiated retrievals: In this case, schist assumes the role of a browser. It retrieves resources only when instructed to by a user. User-initiated retrievals are performed even on resources which are denied to crawlers—just like a browser would retrieve and display a denied-to-crawlers webpage and allow the user to bookmark it.
If a resource is denied to crawlers, schist will only allow it to be indexed privately, for the single user who bookmarked it: it will not appear in search results for anyone else.
automatic crawling: In this case, schist assumes the role of a crawler. It retrieves and indexes resources by recursively following links, but only resources which are allowed to crawlers, as you would probably expect.
crawling instructions retrieval: This purpose is only used when retrieving robots.txt. robots.txt is always allowed to crawlers.
automatic self-tests: schist's test suite performs retrievals of built-in resources. These should be served by the test suite itself; they should never result in requests to any website.
public service usage: schist relies on information from certain sources in order to operate. This information is cached in order to limit requests to a reasonable volume. If you are the owner of one of the following websites and want schist to stop using it as a public service, please open an issue.
Regardless of the purpose, schist always rate-limits retrievals.