Skip to main content

crawler information #

schist is a decentralized search engine. It is currently alpha software.

The schist project is committed to maintaining schist as a cooperative crawler. If your website is receiving excessive traffic from a schist instance, or requests which do not match the behavior described here, please open an issue.

user agent string #

When retrieving resources from the web, schist uses the User-Agent header to identify itself and indicate the purpose of the retrieval. schist user agent strings look like this:

Mozilla/5.0 (compatible; schist/<version>; +https://schist.net/crawlerinfo) (<purpose>)

denied and allowed resources #

schist recognizes two methods of indicating which resources are denied to crawlers and which resources are allowed to crawlers:

  • robots.txt

  • <meta name="robots">, for HTML resources only. The resource must:

    • Have a Content-Type header of text/html or application/xhtml+xml.
    • Be syntactically valid according to the indicated Content-Type.
    • Follow HTML's permitted content/permitted parents rules.

If a given resource is not covered by either, schist will assume it is intended to be allowed to crawlers.

retrieval purposes #

There are a few different reasons schist might make a request for a resource on your website:

  • user-initiated retrievals: In this case, schist assumes the role of a browser. It retrieves resources only when instructed to by a user. User-initiated retrievals are performed even on resources which are denied to crawlers—just like a browser would retrieve and display a denied-to-crawlers webpage and allow the user to bookmark it.

    If a resource is denied to crawlers, schist will only allow it to be indexed privately, for the single user who bookmarked it: it will not appear in search results for anyone else.

  • automatic crawling: In this case, schist assumes the role of a crawler. It retrieves and indexes resources by recursively following links, but only resources which are allowed to crawlers, as you would probably expect.

  • crawling instructions retrieval: This purpose is only used when retrieving robots.txt. robots.txt is always allowed to crawlers.

  • automatic self-tests: schist's test suite performs retrievals of built-in resources. These should be served by the test suite itself; they should never result in requests to any website.

  • public service usage: schist relies on information from certain sources in order to operate. This information is cached in order to limit requests to a reasonable volume. If you are the owner of one of the following websites and want schist to stop using it as a public service, please open an issue.

Regardless of the purpose, schist always rate-limits retrievals.