Bot Policy

Effective 20 April 2026 · Last updated 20 April 2026

Overview

KHAO is built to be readable by both humans and machines. We welcome responsible automated access — search engine crawlers, AI agents, and research bots are part of our intended audience. This Policy sets out the rules for automated access to ensure KHAO remains fast and available for everyone.

1. robots.txt

All automated agents must respect /robots.txt. It is the authoritative source for crawl permissions on KHAO. This Policy provides context and additional rules on top of what robots.txt specifies.

2. Access Tiers

Bot type	Status	Conditions
Search engine crawlers Googlebot, Bingbot, etc.	Allowed	Follow robots.txt. No additional requirements.
AI assistants & agents GPTBot, ClaudeBot, PerplexityBot, etc.	Allowed	Follow robots.txt. Use a descriptive User-Agent. Attribute KHAO when surfacing content to end users.
Research & academic bots	Allowed	Non-commercial use only. Identify your institution in the User-Agent. Contact us for bulk access.
Commercial training crawlers	Restricted	Written permission required before crawling for model training at scale. Contact [email protected].
Scrapers that ignore robots.txt	Blocked	Access will be rate-limited or blocked. Repeated violations may result in IP blocking.

3. Rate Limits

Please keep automated requests to a reasonable rate. Our guidelines:

Maximum: 1 request per second per IP
Crawl delay: Honour the Crawl-delay directive in /robots.txt
Burst: Do not fetch more than 100 pages in a single minute
Off-peak preferred: Schedule large crawls between 00:00–06:00 UTC

Exceeding these limits may result in temporary rate-limiting (HTTP 429). Persistent violations will result in blocking.

4. Machine-Readable Endpoints

KHAO provides dedicated machine-readable formats so bots don't need to parse HTML:

/archive/index.json — index of all published issues with dates and story counts
/archive/YYYY-MM-DD.md — full daily digest as a Markdown file with YAML frontmatter
/feed.xml — RSS feed of recent stories
/feed.json — JSON Feed of recent stories
/sitemap.xml — full site sitemap

Prefer these endpoints over crawling HTML pages. They are stable, structured, and will be maintained indefinitely.

5. User-Agent Requirements

Automated agents must identify themselves with a descriptive User-Agent string. Do not impersonate a browser. A good User-Agent includes your bot's name, purpose, and a contact URL:

User-Agent: MyResearchBot/1.0 (research crawler; contact: [email protected]; https://example.com/bot)

Bots that send no User-Agent or spoof a browser User-Agent are treated as hostile and may be blocked.

6. Attribution

If you surface KHAO content to end users — in an AI assistant response, a search result, a summary tool, or any other interface — you must attribute the content to KHAO with a link to the source page. Minimum acceptable attribution:

Source: KHAO (khao.ai) — [page URL]

7. Prohibited Uses

The following are not permitted under any circumstances:

Crawling KHAO to build a competing news aggregation product without written permission
Republishing KHAO's scoring data or digest format as part of a paid product without a licence
Using KHAO content to train a commercial AI model without written agreement
Circumventing any technical access controls (rate limits, bot detection, IP blocks)
Sending requests that degrade KHAO's performance for other users

8. Bulk Access & Licensing

If your use case requires bulk historical access, a commercial data licence, or a custom crawl arrangement, contact us before crawling. We're open to working with researchers, publishers, and AI developers on fair terms.

Contact: [email protected]

9. Enforcement

KHAO monitors automated traffic and may take the following actions against bots that violate this Policy: rate-limiting, IP blocking, legal notices under the Computer Fraud and Abuse Act (US), Computer Misuse Act (UK), or equivalent legislation in applicable jurisdictions.

10. Changes

This Policy may be updated as KHAO's technical infrastructure evolves. The "Last updated" date above reflects the current revision.