# KHAO Data Schema

KHAO publishes a daily AI & technology news digest in multiple machine-readable formats. This document describes each format's fields so external tools, RAG pipelines, and AI agents can ingest the data without guesswork.

---

## 1. JSON Feed (`/feed.json`)

Format: [JSON Feed 1.1](https://jsonfeed.org/version/1.1)

Top 40 stories from the most recent digest, sorted by importance score descending.

### Feed-level fields

| Field | Type | Description |
|---|---|---|
| `version` | string | Always `"https://jsonfeed.org/version/1.1"` |
| `title` | string | `"KHAO · Daily AI Signal"` |
| `home_page_url` | string | Canonical site URL |
| `feed_url` | string | Self-referencing feed URL |
| `description` | string | One-line description |
| `language` | string | Always `"en"` |
| `items` | array | Array of story objects (see below) |

### Item-level fields

| Field | Type | Description |
|---|---|---|
| `id` | string | Canonical URL (story page or original article) |
| `url` | string | Same as `id` |
| `title` | string | Story headline, emoji-stripped |
| `content_text` | string | Summary + key facts, plain text, ≤500 chars |
| `date_published` | string | ISO 8601 timestamp |
| `tags` | array | Category + topic tags (e.g. `["models", "openai"]`) |
| `_importance` | int | KHAO importance score, 0–100 (see scoring below) |
| `_source` | string | Publishing outlet name (e.g. `"Anthropic"`) |

---

## 2. RSS Feed (`/feed.xml`)

Format: RSS 2.0 with Atom and KHAO namespace extensions.

Namespace: `xmlns:khao="https://khao.ai/ns/1.0"`

### Standard item elements

| Element | Description |
|---|---|
| `<title>` | Story headline |
| `<link>` | Canonical URL |
| `<guid>` | Same as link, `isPermaLink="true"` |
| `<pubDate>` | RFC 822 timestamp |
| `<category>` | Category (Models, Research, Business, etc.) |
| `<description>` | Summary text, ≤500 chars |

### KHAO extension elements

| Element | Description |
|---|---|
| `<khao:importance>` | Importance score 0–100 (integer) |
| `<khao:source>` | Publishing outlet name |

---

## 3. Daily Markdown Digests (`/archive/YYYY-MM-DD.md`)

One file per day. YAML frontmatter followed by story blocks.

### YAML Frontmatter

```yaml
---
title: "KHAO — Monday, April 21, 2026"
issue_date: 2026-04-21
kind: daily-digest
story_count: 30
source_count: 18
generated_at: "2026-04-21T06:12:00Z"
---
```

### Story Block Format

```markdown
## Section · Models  (8)

## ★★ 82 · Story Title Here
**Source Name · 2026-04-21T09:30:00Z**
Two-sentence summary of the story. Second sentence with key detail.
Tags: `tag1`, `tag2`
Read: <https://original-article-url>

### Body
Full article body text, paragraphs separated by blank lines.

---
```

### Story Block Fields

| Field | Description |
|---|---|
| `★★ 82` | Stars (visual) + importance score (0–100) |
| Title | Headline, cleaned of emoji and attribution kickers |
| Source line | `**Outlet · ISO-timestamp**` |
| Summary | 1–3 sentences, plain English |
| Tags | Comma-separated topic tags in backticks |
| Read | Original article URL |
| Body | Full article body (extractive, 40–60% of original length) |

---

## 4. Archive Index (`/archive/index.json`)

Machine-readable catalogue of all available daily digests.

```json
{
  "digests": [
    {
      "date": "2026-04-21",
      "issue": 388,
      "story_count": 30,
      "md_url": "https://khao.ai/archive/2026-04-21.md",
      "html_url": "https://khao.ai/archive.html#2026-04-21"
    }
  ]
}
```

| Field | Type | Description |
|---|---|---|
| `date` | string | ISO 8601 date |
| `issue` | int | Sequential issue number (starts at 100) |
| `story_count` | int | Number of stories in that digest |
| `md_url` | string | Direct URL to the Markdown digest file |
| `html_url` | string | Deep link to the archive page anchored to that date |

---

## 5. Search Index (`/search-index.json`)

Flat array of all stories across all digests. Used by the archive page's client-side search. Also suitable for RAG ingestion.

```json
[
  {
    "t": "Story headline",
    "s": "Source Name",
    "x": "Two-sentence excerpt.",
    "d": "2026-04-21",
    "i": 82,
    "u": "https://original-url",
    "c": "models"
  }
]
```

| Key | Field | Type | Description |
|---|---|---|---|
| `t` | title | string | Story headline |
| `s` | source | string | Publishing outlet |
| `x` | excerpt | string | Summary text (1–2 sentences) |
| `d` | date | string | Issue date (YYYY-MM-DD) |
| `i` | importance | int | Importance score 0–100 |
| `u` | url | string | Original article URL |
| `c` | category | string | Category slug (see below) |

---

## 6. Importance Score (0–100)

KHAO scores every story on a 0–100 scale combining 7 factors:

| Score range | Meaning |
|---|---|
| 0–19 | Trivial or ultra-niche: vendor notes, minor version bumps |
| 20–39 | Low signal: incremental updates, regional news |
| 40–59 | Moderate: notable but not urgent |
| 60–74 | High signal: major model releases, significant research |
| 75–89 | Front-page: landmark results, widely covered events |
| 90–100 | Historic: paradigm-shifting breakthroughs, major regulation |

Scoring factors (weights vary slightly day-to-day):
- **Editorial merit** (35%) — Claude's standalone assessment of the story
- **Cross-source agreement** (20%) — how many independent outlets covered it
- **Source strength** (10%) — originator vs. rewriter prior
- **Velocity** (10%) — cluster size relative to day's biggest cluster
- **Entity prominence** (10%) — coverage of tracked major entities
- **Topic rarity** (10%) — inverse tag frequency (rare topics score higher)
- **Recency decay** (5%) — exponential decay, half-life 18h

---

## 7. Category Taxonomy

| Slug | Description |
|---|---|
| `models` | New model releases, capabilities, benchmarks |
| `research` | Papers, findings, safety research, alignment |
| `business` | Funding, M&A, earnings, policy, regulation, infrastructure |
| `tech` | Developer tools, hardware, cloud platforms |
| `world` | International AI news, geopolitics, non-US coverage |

---

## Questions / API access

Contact: contact@khao.ai

See also: [llms.txt](/llms.txt) for AI crawler guidance.
