Bot Policy
Overview
KHAO is built to be readable by both humans and machines. We welcome responsible automated access — search engine crawlers, AI agents, and research bots are part of our intended audience. This Policy sets out the rules for automated access to ensure KHAO remains fast and available for everyone.
1. robots.txt
All automated agents must respect /robots.txt. It is the authoritative source for crawl permissions on KHAO. This Policy provides context and additional rules on top of what robots.txt specifies.
2. Access Tiers
| Bot type | Status | Conditions |
|---|---|---|
| Search engine crawlers Googlebot, Bingbot, etc. |
Allowed | Follow robots.txt. No additional requirements. |
| AI assistants & agents GPTBot, ClaudeBot, PerplexityBot, etc. |
Allowed | Follow robots.txt. Use a descriptive User-Agent. Attribute KHAO when surfacing content to end users. |
| Research & academic bots | Allowed | Non-commercial use only. Identify your institution in the User-Agent. Contact us for bulk access. |
| Commercial training crawlers | Restricted | Written permission required before crawling for model training at scale. Contact [email protected]. |
| Scrapers that ignore robots.txt | Blocked | Access will be rate-limited or blocked. Repeated violations may result in IP blocking. |
3. Rate Limits
Please keep automated requests to a reasonable rate. Our guidelines:
- Maximum: 1 request per second per IP
- Crawl delay: Honour the
Crawl-delaydirective in/robots.txt - Burst: Do not fetch more than 100 pages in a single minute
- Off-peak preferred: Schedule large crawls between 00:00–06:00 UTC
Exceeding these limits may result in temporary rate-limiting (HTTP 429). Persistent violations will result in blocking.
4. Machine-Readable Endpoints
KHAO provides dedicated machine-readable formats so bots don't need to parse HTML:
/archive/index.json— index of all published issues with dates and story counts/archive/YYYY-MM-DD.md— full daily digest as a Markdown file with YAML frontmatter/feed.xml— RSS feed of recent stories/feed.json— JSON Feed of recent stories/sitemap.xml— full site sitemap
Prefer these endpoints over crawling HTML pages. They are stable, structured, and will be maintained indefinitely.
5. User-Agent Requirements
Automated agents must identify themselves with a descriptive User-Agent string. Do not impersonate a browser. A good User-Agent includes your bot's name, purpose, and a contact URL:
Bots that send no User-Agent or spoof a browser User-Agent are treated as hostile and may be blocked.
6. Attribution
If you surface KHAO content to end users — in an AI assistant response, a search result, a summary tool, or any other interface — you must attribute the content to KHAO with a link to the source page. Minimum acceptable attribution:
7. Prohibited Uses
The following are not permitted under any circumstances:
- Crawling KHAO to build a competing news aggregation product without written permission
- Republishing KHAO's scoring data or digest format as part of a paid product without a licence
- Using KHAO content to train a commercial AI model without written agreement
- Circumventing any technical access controls (rate limits, bot detection, IP blocks)
- Sending requests that degrade KHAO's performance for other users
8. Bulk Access & Licensing
If your use case requires bulk historical access, a commercial data licence, or a custom crawl arrangement, contact us before crawling. We're open to working with researchers, publishers, and AI developers on fair terms.
Contact: [email protected]
9. Enforcement
KHAO monitors automated traffic and may take the following actions against bots that violate this Policy: rate-limiting, IP blocking, legal notices under the Computer Fraud and Abuse Act (US), Computer Misuse Act (UK), or equivalent legislation in applicable jurisdictions.
10. Changes
This Policy may be updated as KHAO's technical infrastructure evolves. The "Last updated" date above reflects the current revision.