# DNS Analytics — Home Server Foundation
`dns-analytics.codeandcore.dev` · Two-server home origin · Growth path to ISP multi-node

---

## Your Actual Infrastructure

| Role | Hostname | IP | Network served |
|---|---|---|---|
| Primary DNS + Analytics | `dns1.codeandcore.home` | `192.168.238.128` | `192.168.238.0/24` |
| Secondary DNS + Analytics | `dns2.codeandcore.home` | `10.10.10.250` | `10.10.10.0/24` |

Both servers run the full stack: BIND9 → Python ingestor → PostgreSQL. They are
independent resolvers for their respective LAN segments, with the option to federate
their analytics databases into a unified view later (see Section 4).

---

## Architecture Diagram

```
192.168.238.0/24 LAN                    10.10.10.0/24 LAN
────────────────────                    ─────────────────
  Clients                                  Clients
  192.168.238.x                           10.10.10.x
       │                                       │
       ▼                                       ▼
  BIND9 Resolver                         BIND9 Resolver
  192.168.238.128                        10.10.10.250
  dns1.codeandcore.home                  dns2.codeandcore.home
       │                                       │
       ▼                                       ▼
  Python Ingestor                        Python Ingestor
       │                                       │
       ▼                                       ▼
  PostgreSQL                             PostgreSQL
  dns_analytics (primary)                dns_analytics (secondary)
       │                                       │
       └─────────────── (future) ─────────────┘
                  Federated analytics view
                  (pg_fdw or ETL job)
```

Upstream for both:
```
8.8.8.8  8.8.4.4  1.1.1.1  9.9.9.9
```

---

## Section 1 — Server 1: 192.168.238.128

### named.conf.options

```named.conf
// DNS Analytics Server 1 — dns1.codeandcore.home
// Network: 192.168.238.0/24
// David Egwell · codeandcore

acl "trusted" {
    127.0.0.1;
    192.168.238.0/24;   // Primary home LAN
    10.10.10.0/24;      // Secondary home LAN (cross-query allowed)
    10.0.0.0/8;
    172.16.0.0/12;
};

options {
    directory "/var/cache/bind";
    listen-on { 127.0.0.1; 192.168.238.128; };
    listen-on-v6 { any; };
    allow-query     { trusted; };
    allow-recursion { trusted; };
    forwarders { 8.8.8.8; 8.8.4.4; 1.1.1.1; 9.9.9.9; };
    forward only;
    dnssec-validation auto;
    allow-transfer { none; };
    notify no;
    recursion yes;
    max-cache-size 256m;
    min-cache-ttl 60;
    max-cache-ttl 86400;
    rate-limit { responses-per-second 50; window 5; };
    pid-file "/run/named/named.pid";
};
```

### named.conf.local (logging + local zone)

```named.conf
// Authoritative zone for codeandcore.home
zone "codeandcore.home" {
    type master;
    file "/etc/bind/db.codeandcore.home";
    allow-update { none; };
    // Allow dns2 to transfer zone if secondary BIND is set up later
    allow-transfer { 10.10.10.250; };
};

logging {
    channel queries_log {
        file "/var/log/named/queries.log"
            versions 7 size 500m;
        severity dynamic;
        print-time yes;
        print-severity no;
        print-category no;
    };
    channel named_log {
        file "/var/log/named/named.log"
            versions 4 size 50m;
        severity info;
        print-time yes; print-severity yes; print-category yes;
    };
    channel errors_log {
        file "/var/log/named/query-errors.log"
            versions 4 size 100m;
        severity dynamic;
        print-time yes;
        print-severity no;
        print-category no;
    };
    category queries       { queries_log; };
    category query-errors  { errors_log; };   // Separate channel — fixes BUG-01
    category default       { named_log; };
    category general       { named_log; };
    category config        { named_log; };
    category network       { named_log; };
    category security      { named_log; };
    category lame-servers  { null; };
    category dnssec        { null; };
    category resolver      { null; };
    category cname         { null; };
    category xfer-in       { null; };
    category xfer-out      { null; };
    category notify        { null; };
    category client        { null; };
    category unmatched     { null; };
    category dispatch      { null; };
    category edns-disabled { null; };
    category rpz           { null; };
    category rate-limit    { null; };
};
```

> **BUG-01 fix applied here.** `query-errors` now writes to a separate
> `query-errors.log` channel so the ingestor can parse NXDOMAIN and SERVFAIL
> responses correctly. See Section 3 for the updated ingestor.

### Zone file: db.codeandcore.home (hosted on dns1)

```dns-zone
$TTL 86400
@   IN  SOA ns1.codeandcore.home. admin.codeandcore.home. (
        2026051301 ; serial — update when records change
        3600       ; refresh
        1800       ; retry
        1209600    ; expire
        86400 )    ; minimum

; Name servers
@       IN  NS  ns1.codeandcore.home.
@       IN  NS  ns2.codeandcore.home.

; Name server A records
ns1     IN  A   192.168.238.128
ns2     IN  A   10.10.10.250

; dns1 network (192.168.238.x)
dns1    IN  A   192.168.238.128
router1 IN  A   192.168.238.1

; dns2 network (10.10.10.x)
dns2    IN  A   10.10.10.250
router2 IN  A   10.10.10.10

; Shared services — point to whichever server hosts them
nas     IN  A   192.168.238.20
cam1    IN  A   192.168.238.30
cam2    IN  A   10.10.10.30
printer IN  A   192.168.238.40
```

### Verify dns1

```bash
sudo named-checkconf
sudo named-checkzone codeandcore.home /etc/bind/db.codeandcore.home
sudo systemctl restart bind9

# Test from within 192.168.238.0/24
dig @192.168.238.128 google.com A
dig @192.168.238.128 nas.codeandcore.home A
dig @192.168.238.128 dns2.codeandcore.home A

# Confirm query log is writing
sudo tail -10 /var/log/named/queries.log

# Confirm error log is separate and writing
sudo tail -10 /var/log/named/query-errors.log
```

---

## Section 2 — Server 2: 10.10.10.250

### named.conf.options

```named.conf
// DNS Analytics Server 2 — dns2.codeandcore.home
// Network: 10.10.10.0/24
// David Egwell · codeandcore

acl "trusted" {
    127.0.0.1;
    10.10.10.0/24;          // Primary LAN for this server
    192.168.238.0/24;       // Server 1 network (cross-query allowed)
    10.0.0.0/8;
    172.16.0.0/12;
};

options {
    directory "/var/cache/bind";
    listen-on { 127.0.0.1; 10.10.10.250; };
    listen-on-v6 { any; };
    allow-query     { trusted; };
    allow-recursion { trusted; };
    forwarders { 8.8.8.8; 8.8.4.4; 1.1.1.1; 9.9.9.9; };
    forward only;
    dnssec-validation auto;
    allow-transfer { none; };
    notify no;
    recursion yes;
    max-cache-size 256m;
    min-cache-ttl 60;
    max-cache-ttl 86400;
    rate-limit { responses-per-second 50; window 5; };
    pid-file "/run/named/named.pid";
};
```

### named.conf.local (logging + slave zone)

```named.conf
// Slave zone — receives transfers from dns1
zone "codeandcore.home" {
    type slave;
    masters { 192.168.238.128; };
    file "/var/cache/bind/db.codeandcore.home.slave";
};

logging {
    channel queries_log {
        file "/var/log/named/queries.log"
            versions 7 size 500m;
        severity dynamic;
        print-time yes;
        print-severity no;
        print-category no;
    };
    channel named_log {
        file "/var/log/named/named.log"
            versions 4 size 50m;
        severity info;
        print-time yes; print-severity yes; print-category yes;
    };
    channel errors_log {
        file "/var/log/named/query-errors.log"
            versions 4 size 100m;
        severity dynamic;
        print-time yes;
        print-severity no;
        print-category no;
    };
    category queries       { queries_log; };
    category query-errors  { errors_log; };
    category default       { named_log; };
    category general       { named_log; };
    category config        { named_log; };
    category network       { named_log; };
    category security      { named_log; };
    category lame-servers  { null; };
    category dnssec        { null; };
    category resolver      { null; };
    category cname         { null; };
    category xfer-in       { null; };
    category xfer-out      { null; };
    category notify        { null; };
    category client        { null; };
    category unmatched     { null; };
    category dispatch      { null; };
    category edns-disabled { null; };
    category rpz           { null; };
    category rate-limit    { null; };
};
```

### MikroTik DHCP config (10.10.10.0/24 router at 10.10.10.10)

```routeros
/ip dhcp-server network set [find address="10.10.10.0/24"] \
    gateway=10.10.10.10 \
    dns-server=10.10.10.250 \
    domain=codeandcore.home

/ip dns set allow-remote-requests=no servers=10.10.10.250

/ip dns static add name=router2.codeandcore.home address=10.10.10.10 ttl=1d
/ip dns static add name=dns2.codeandcore.home   address=10.10.10.250 ttl=1d
```

### Verify dns2

```bash
sudo named-checkconf
sudo systemctl restart bind9

# Confirm slave zone transferred from dns1
dig @10.10.10.250 nas.codeandcore.home A
dig @10.10.10.250 google.com A

# Confirm query log writing
sudo tail -10 /var/log/named/queries.log
sudo tail -10 /var/log/named/query-errors.log
```

---

## Section 3 — Updated Python Ingestor (BUG-01 fix + BUG-03 + BUG-04)

This version tails **two** log files simultaneously: `queries.log` (NOERROR traffic)
and `query-errors.log` (NXDOMAIN / SERVFAIL responses). It runs as a dedicated
system user and reads the DB password from environment.

Deploy this version to **both servers** — same script, same systemd unit, different
database on each host.

### /etc/dns-ingestor/db.env (chmod 600)

```bash
DB_HOST=127.0.0.1
DB_PORT=5432
DB_NAME=dns_analytics
DB_USER=dns_user
DB_PASSWORD=YourActualPassword
```

### /opt/dns-ingestor/ingestor.py

```python
#!/usr/bin/env python3
"""
DNS Query Log Ingestor — codeandcore
Tails queries.log and query-errors.log simultaneously.
Resolves BUG-01 (response_rcode), BUG-03 (root), BUG-04 (hardcoded password).
"""
import re, time, logging, os, signal, sys, threading
from datetime import datetime
import psycopg2
from psycopg2.extras import execute_values

DB = {
    'dbname':   os.environ['DB_NAME'],
    'user':     os.environ['DB_USER'],
    'password': os.environ['DB_PASSWORD'],
    'host':     os.environ.get('DB_HOST', '127.0.0.1'),
    'port':     int(os.environ.get('DB_PORT', 5432)),
}

QUERIES_LOG     = '/var/log/named/queries.log'
ERRORS_LOG      = '/var/log/named/query-errors.log'
BATCH_SIZE      = 100
FLUSH_INTERVAL  = 5

# Standard query line — produces NOERROR records
QUERY_RE = re.compile(
    r'^(?P<ts>\d{2}-\w{3}-\d{4} \d{2}:\d{2}:\d{2})\.\d+'
    r'\s+client\s+(?:@\S+\s+)?(?P<ip>[\d\.a-fA-F:]+)#(?P<port>\d+)'
    r'\s+\([^)]+\):\s+query:\s+(?P<domain>[\w.\-]+)\s+IN\s+(?P<qtype>\w+)'
    r'\s+(?P<flags>[+\-\w\s]*?)(?:\s+\([\d.]+\))?$'
)

# Error line — produces NXDOMAIN / SERVFAIL records
ERROR_RE = re.compile(
    r'^(?P<ts>\d{2}-\w{3}-\d{4} \d{2}:\d{2}:\d{2})\.\d+'
    r'\s+client\s+(?:@\S+\s+)?(?P<ip>[\d\.a-fA-F:]+)#(?P<port>\d+)'
    r'.*?(?P<domain>[\w.\-]+)\s+IN\s+(?P<qtype>\w+).*?'
    r'(?P<rcode>NXDOMAIN|SERVFAIL|REFUSED|FORMERR)'
)

def apex(d):
    p = d.rstrip('.').split('.')
    return '.'.join(p[-2:]) if len(p) >= 2 else d

def parse_ts(s):
    try:    return datetime.strptime(s, '%d-%b-%Y %H:%M:%S')
    except: return datetime.utcnow()

def parse_query(line):
    m = QUERY_RE.match(line.strip())
    if not m: return None
    d = m.groupdict()
    dom = d.get('domain', '').lower().rstrip('.')
    if not dom: return None
    flags = d.get('flags', '').strip()
    return {
        'queried_at':    parse_ts(d['ts']),
        'client_ip':     d.get('ip', '0.0.0.0'),
        'client_port':   int(d.get('port', 0)),
        'domain':        dom,
        'apex_domain':   apex(dom),
        'qtype':         d.get('qtype', 'A').upper(),
        'flags':         flags,
        'response_rcode':'NOERROR',
        'cached':        'CL' in flags,
    }

def parse_error(line):
    m = ERROR_RE.search(line.strip())
    if not m: return None
    d = m.groupdict()
    dom = d.get('domain', '').lower().rstrip('.')
    if not dom: return None
    return {
        'queried_at':    parse_ts(d['ts']),
        'client_ip':     d.get('ip', '0.0.0.0'),
        'client_port':   int(d.get('port', 0)),
        'domain':        dom,
        'apex_domain':   apex(dom),
        'qtype':         d.get('qtype', 'A').upper(),
        'flags':         '',
        'response_rcode':d.get('rcode', 'NXDOMAIN'),
        'cached':        False,
    }

def connect():
    while True:
        try:
            c = psycopg2.connect(**DB)
            c.autocommit = False
            logging.info('DB connected')
            return c
        except psycopg2.OperationalError as e:
            logging.error(f'DB: {e}')
            time.sleep(10)

def flush(conn, batch):
    if not batch: return 0
    sql = (
        'INSERT INTO dns_queries '
        '(queried_at,client_ip,client_port,domain,apex_domain,'
        'qtype,flags,response_rcode,cached) VALUES %s ON CONFLICT DO NOTHING'
    )
    rows = [(r['queried_at'], r['client_ip'], r['client_port'],
             r['domain'], r['apex_domain'], r['qtype'],
             r['flags'], r['response_rcode'], r['cached']) for r in batch]
    try:
        with conn.cursor() as cur:
            execute_values(cur, sql, rows, page_size=500)
        conn.commit()
        return len(batch)
    except psycopg2.Error as e:
        logging.error(f'Insert: {e}')
        conn.rollback()
        return 0

# Shared batch and lock across both tailer threads
_lock  = threading.Lock()
_batch = []

def tailer(log_file, parser_fn, label):
    """Tail a log file and append parsed records to shared batch."""
    inode, f = 0, None
    logging.info(f'[{label}] watching {log_file}')
    while True:
        try:
            cur_inode = os.stat(log_file).st_ino
        except FileNotFoundError:
            time.sleep(5)
            continue
        if cur_inode != inode:
            try:
                f = open(log_file, 'r')
                f.seek(0, 2)
                inode = cur_inode
            except IOError:
                time.sleep(5)
                continue
        while True:
            line = f.readline()
            if not line: break
            rec = parser_fn(line)
            if rec:
                with _lock:
                    _batch.append(rec)
        time.sleep(0.5)

def flusher(conn):
    """Periodically flush the shared batch to PostgreSQL."""
    total = 0
    while True:
        time.sleep(FLUSH_INTERVAL)
        with _lock:
            if not _batch: continue
            work = _batch[:]
            _batch.clear()
        n = flush(conn, work)
        total += n
        if n:
            logging.info(f'Flushed {n} | total {total:,}')

def main():
    os.makedirs('/var/log/dns-ingestor', exist_ok=True)
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s [%(levelname)s] %(message)s',
        handlers=[
            logging.StreamHandler(sys.stdout),
            logging.FileHandler('/var/log/dns-ingestor/ingestor.log'),
        ]
    )
    def bye(s, f): sys.exit(0)
    signal.signal(signal.SIGTERM, bye)
    signal.signal(signal.SIGINT,  bye)

    conn = connect()

    # Thread 1: tail queries.log (NOERROR traffic)
    t1 = threading.Thread(target=tailer, args=(QUERIES_LOG, parse_query, 'QUERY'), daemon=True)
    # Thread 2: tail query-errors.log (NXDOMAIN / SERVFAIL)
    t2 = threading.Thread(target=tailer, args=(ERRORS_LOG,  parse_error,  'ERROR'), daemon=True)
    t1.start()
    t2.start()

    # Main thread flushes
    flusher(conn)

if __name__ == '__main__':
    main()
```

### /etc/systemd/system/dns-ingestor.service

```ini
[Unit]
Description=DNS Query Log Ingestor — codeandcore
After=network.target postgresql.service bind9.service
Requires=postgresql.service

[Service]
Type=simple
User=dns-ingestor
Group=dns-ingestor
EnvironmentFile=/etc/dns-ingestor/db.env
WorkingDirectory=/opt/dns-ingestor
ExecStart=/opt/dns-ingestor/venv/bin/python3 /opt/dns-ingestor/ingestor.py
Restart=always
RestartSec=10s
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
```

### Deploy on each server

```bash
# Create system user
sudo useradd --system --no-create-home --shell /usr/sbin/nologin dns-ingestor

# Set up credential file
sudo mkdir -p /etc/dns-ingestor
sudo nano /etc/dns-ingestor/db.env        # paste DB credentials
sudo chmod 600 /etc/dns-ingestor/db.env
sudo chown dns-ingestor:dns-ingestor /etc/dns-ingestor/db.env

# Grant log access
sudo apt-get install -y acl
sudo setfacl -m u:dns-ingestor:rx /var/log/named
sudo setfacl -m u:dns-ingestor:r  /var/log/named/queries.log
sudo setfacl -m u:dns-ingestor:r  /var/log/named/query-errors.log

# Deploy and start
sudo systemctl daemon-reload
sudo systemctl enable --now dns-ingestor
sudo journalctl -u dns-ingestor -f
```

---

## Section 4 — Unified Analytics View (Two Servers → One Dashboard)

Both servers write to their own local `dns_analytics` PostgreSQL database. To see
combined traffic from both networks in one dashboard, use PostgreSQL Foreign Data
Wrapper (`pg_fdw`) to query dns2 from dns1 without moving data.

### On dns2 (10.10.10.250) — allow remote connection from dns1

```sql
-- In pg_hba.conf on dns2, add:
-- host  dns_analytics  dns_user  192.168.238.128/32  md5
-- Then reload:
-- sudo systemctl reload postgresql
```

### On dns1 (192.168.238.128) — mount dns2 as foreign tables

```sql
-- Run as postgres superuser on dns1
CREATE EXTENSION IF NOT EXISTS postgres_fdw;

CREATE SERVER dns2_server
    FOREIGN DATA WRAPPER postgres_fdw
    OPTIONS (host '10.10.10.250', port '5432', dbname 'dns_analytics');

CREATE USER MAPPING FOR dns_user
    SERVER dns2_server
    OPTIONS (user 'dns_user', password 'YourActualPassword');

-- Import the stats tables from dns2 as foreign tables
CREATE FOREIGN TABLE dns2_daily_stats (
    stat_date DATE,
    domain TEXT,
    apex_domain TEXT,
    qtype VARCHAR(10),
    hit_count BIGINT,
    unique_clients INTEGER,
    nxdomain_count BIGINT,
    servfail_count BIGINT,
    first_seen TIMESTAMPTZ,
    last_seen TIMESTAMPTZ
)
SERVER dns2_server
OPTIONS (schema_name 'public', table_name 'dns_daily_stats');
```

### Combined top domains query (both networks)

```sql
-- Query both servers in one statement from dns1
SELECT apex_domain,
    SUM(hit_count) AS total_queries,
    SUM(unique_clients) AS total_clients,
    'combined' AS source
FROM (
    SELECT apex_domain, hit_count, unique_clients FROM dns_daily_stats
    WHERE stat_date >= CURRENT_DATE - INTERVAL '7 days'
    UNION ALL
    SELECT apex_domain, hit_count, unique_clients FROM dns2_daily_stats
    WHERE stat_date >= CURRENT_DATE - INTERVAL '7 days'
) combined
GROUP BY apex_domain
ORDER BY total_queries DESC
LIMIT 50;
```

---

## Section 5 — Growth Path to ISP Multi-Server

This is how the two home servers grow into the SprintUG/SprintTZ production model.
No architecture changes — only scale changes.

| Stage | Servers | Transport | Database |
|---|---|---|---|
| **Home now** | dns1 (192.168.238.128) + dns2 (10.10.10.250) | File tail | PostgreSQL local + pg_fdw |
| **Small ISP PoP** | 2–4 BIND9 nodes per PoP | File tail → shared NFS log mount | PostgreSQL primary + read replica |
| **Multi-PoP ISP** | 8–16 resolvers across PoPs | dnstap binary → Kafka topic | TimescaleDB cluster |
| **Full ISP scale** | Unbound anycast cluster | dnstap → Kafka → Faust workers | TimescaleDB + continuous aggregates |

### What stays the same at every stage

- PostgreSQL schema (same tables, same rollup SQL, same analytics queries)
- Python ingestor logic (same parse → batch → insert loop)
- cron rollup jobs (same SQL files)
- Dashboard queries (same SQL, more rows)

### What changes at each stage

- Transport: file tail → shared mount → Kafka
- Resolver: BIND9 → Unbound + anycast VIP
- Database: local → replicated → TimescaleDB hypertables
- Ingestor: single process → Faust distributed workers

### ISP node registration table (future schema addition)

When you go multi-server at ISP scale, add a node registry to the schema:

```sql
CREATE TABLE dns_nodes (
    node_id     SERIAL PRIMARY KEY,
    node_name   TEXT NOT NULL,          -- e.g. 'raxio-dns1', 'airtel-dns1'
    node_ip     INET NOT NULL,
    location    TEXT NOT NULL,          -- e.g. 'Raxio DSM', 'Airtel House KLA'
    network     CIDR NOT NULL,          -- e.g. '196.43.0.0/16'
    active      BOOLEAN DEFAULT TRUE,
    added_at    TIMESTAMPTZ DEFAULT now()
);

INSERT INTO dns_nodes VALUES
(DEFAULT, 'home-dns1', '192.168.238.128', 'Home Lab LAN1', '192.168.238.0/24', TRUE, now()),
(DEFAULT, 'home-dns2', '10.10.10.250',    'Home Lab LAN2', '10.10.10.0/24',    TRUE, now());

-- When ISP nodes are added:
-- (DEFAULT, 'raxio-dns1',  '196.x.x.x', 'Raxio Namanve DSM',  '...',  TRUE, now()),
-- (DEFAULT, 'airtel-dns1', '196.x.x.x', 'Airtel House KLA',   '...',  TRUE, now()),
```

Then add `node_id` to `dns_queries` when multi-node ingestion begins, so every
query row carries its origin node. The dashboard can then filter by node, location,
or network — from two home servers all the way to eight PoPs.

---

## IP Reference (update guide throughout)

The following IPs in the original guide need replacing. Search and replace:

| Old value | Replace with | Context |
|---|---|---|
| `192.168.234.128` | `192.168.238.128` | Server 1 throughout |
| `Server IP 192.168.234.128` | `dns1: 192.168.238.128 / dns2: 10.10.10.250` | Overview metadata block |
| `dig @192.168.234.128` | `dig @192.168.238.128` (dns1) or `dig @10.10.10.250` (dns2) | Test commands |
| `10.10.10.10` (DNS server in MikroTik chapter) | `10.10.10.250` | MikroTik DHCP dns-server |

---

*Foundation document for dns-analytics.codeandcore.dev*
*Servers: dns1 @ 192.168.238.128 · dns2 @ 10.10.10.250*
*Author: David Egwell · codeandcore*