Skip to content

Instantly share code, notes, and snippets.

@Elizafox
Last active April 3, 2025 22:26
Show Gist options
  • Select an option

  • Save Elizafox/95aa5268725083d03854b9078a63f821 to your computer and use it in GitHub Desktop.

Select an option

Save Elizafox/95aa5268725083d03854b9078a63f821 to your computer and use it in GitHub Desktop.
SDAT — Sensor Data Archive & Transport
title author version date license
SDAT — Sensor Data Archive & Transport
Elizabeth Ashford <elizabeth.jennifer.myers@gmail.com>
1
3 April 2025
CC0 (Public Domain Dedication)

SDAT v1 - Sensor Data Archive & Transport

Version: 1.0
Extension: .sdat
Purpose: Efficient, resilient, and flash-friendly storage of high-frequency sensor logs with block-level compression and random-access support.

Overview

SDAT is a binary file format optimised for embedded systems like the ESP32 family. It is designed for:

  • Append-only, streaming-safe writes
  • Flash and SD card compatibility and safety
  • High-frequency logging (e.g., every 5 seconds)
  • Per-block compression (Zstandard or raw)
  • Indexing for fast seeking and temporal queries
  • Recovery from partial writes and power loss

Records can also be streamed and decoded linearly without the block table, although the block table allows for improved integrity verification and efficient indexing.

File Structure

All major sections in the file are aligned to 4096-byte (0x1000) boundaries. This improves flash memory wear-leveling, performance on block devices, and simplifies implementation by providing predictable offsets for seeking.

+-----------------------------+ 0x0000
| Primary Header             |
+-----------------------------+ 0x1000
| Secondary Header           |
+-----------------------------+ 0x2000
| Block Table (num_blocks)   |
+-----------------------------+ (data_offset, aligned)
| Block 0 (raw or compressed)|
| Block 1                    |
| ...                        |
+-----------------------------+

Primary & Secondary Headers (64 bytes each)

struct Header {
    uint32_t magic;            // 'SDAT' = 0x54414453

    uint8_t  version;          // Format version (1)
    uint16_t flags;            // HeaderFlags (see below)
    uint8_t  reserved0;        // Padding and future use (must be zero)

    uint32_t header_crc;       // CRC32 of this header with crc field set to 0
    uint32_t data_crc;         // CRC32 of block table + data (optional)

    uint32_t sample_interval;  // In seconds
    uint32_t num_samples;      // Total number of samples

    uint32_t num_blocks;       // Number of entries in block table
    uint32_t data_offset;      // Start of data, 0x1000-aligned

    uint32_t sequence_number;  // Used to detect newer headers

    uint8_t  reserved1[28];    // Padding and future use (must be zero)
};

Both headers are identical; the most recent valid one is used on load. Either header may be used if the headers are equivalent and valid.

All fields marked reserved must be zero when written. This convention aids compression and allows future extensibility.

Header flags

enum HeaderFlags {
    HEADER_FLAG_FINALIZED = 0x01  // File has been finalised and all CRCs are valid
};

Verification

The header CRC is computed as if the header_crc field is zero, and then added to the header. To verify, copy the CRC out of the structure, set it to zero, and run the checksum again. If the two values are equal, the header passes the checksum.

Block Table (16 bytes per entry)

struct BlockTableEntry {
    uint32_t timestamp;       // Start time of block
    uint32_t offset;          // Offset of block in file
    uint16_t record_count;    // Number of samples in block
    uint16_t flags;           // See BlockFlags
    uint32_t block_crc;       // CRC32 of block contents (compressed or raw)
};

The block table is stored at 0x2000 after the two headers. Its size is fixed. Its length in entries is stored in the num_blocks section of the header. The total size of the region is:

$$ Length = (16n) + (-n \bmod 4096) $$

Where $n$ is the number of entries.

Block flags

enum BlockFlags {
    BLOCK_FLAG_FINALIZED   = 0x01, // Block is complete and not subject to change
    BLOCK_FLAG_COMPRESSED  = 0x02, // Block data is compressed with Zstd
    BLOCK_FLAG_TOMBSTONE   = 0x04  // Block is invalid or discarded during recovery
};

Data Blocks

Each block consists of either:

  • A full sensor record followed by delta-encoded entries and optional checkpoints, or
  • A full sensor record only

Compressed blocks use Zstandard. Raw blocks are uncompressed.
Note: CRCs are computed independently, not using Zstandard's internal CRC, which is omitted.

If a compressed block is larger than the raw block, it is stored uncompressed instead.

Record Structure

Records are stored in the following format:

  • Each record begins with a 2-byte marker.
  • The record type is stored in bits 13–15:
    • 0b100 – Full record
    • 0b000 – Delta record
    • 0b001 – Checkpoint record
    • All other values are reserved for future use
  • The field mask is stored in bits 0–8, where each bit corresponds to one field:
    • Bit 0: pm1_0
    • Bit 1: pm2_5
    • Bit 2: pm10
    • Bit 3: voc
    • Bit 4: pressure
    • Bit 5: temperature
    • Bit 6: humidity
    • Bit 7: co2
    • Bit 8: aqi

Note: This is what is used in HomeSensor; the field mask can be changed based on other needs.

Only fields marked in the field mask are encoded. For full readings, the field mask may be ignored.

Full Record (32 bytes)

struct FullRecord {
    uint32_t timestamp;       // UNIX timestamp (UTC)

    uint16_t pm1_0;           // PM1.0 (µg/cm^3)
    uint16_t pm2_5;           // PM2.5 (µg/cm^3)
    uint16_t pm10;            // PM10.0 (µg/cm^3)

    uint16_t voc;             // Volatile Organic Compounds (PPM)

    uint32_t pressure;        // Atmospheric pressure (Pa)
    int16_t  temperature;     // Temperature (°C * 100)
    uint16_t humidity;        // Relative Humidity (% * 100)

    uint16_t co2;             // CO₂ concentration (PPM)
    uint8_t  aqi;             // Air Quality Index (0–5)

    uint8_t  reserved[9];     // Padding and alignment (must be zero)
};

Note: The structure could be adapted to other sensor types. This is the structure used by HomeSensor.

All fields marked reserved must be zero when written. This convention aids compression and allows future extensibility.

Delta Record (variable length)

Delta records allow compact storage of sensor data by only storing fields that have changed significantly from the previous record.

They are encoded dynamically with the following format:

  • Each field present in the field mask is in the record.
  • Each field present is encoded using ZigZag encoding and then VarInt encoding.
  • Fields are encoded in fixed order as listed above.

The values from the delta record (which may be positive or negative) are added to the last known value to give the latest readings. The new readings become the last known value.

Filtering recommendations
  • To avoid encoding noise, only include deltas that exceed a minimum threshold (e.g., ±2 for pm1_0).
  • If a field changes by less than this threshold, omit it from the delta.
  • Ensure you account for the need to "catch up" for slow drift of values.
  • Periodically write full records (e.g., once per block or once per 5 minutes) to ensure decoding sync.
  • Catch-up logic can be used to encode small accumulated differences after several suppressed deltas.

This method reduces file size while maintaining fidelity over time.

Checkpoint Record (16 bytes)

A checkpoint is a lightweight hash of the current full sensor state. It provides integrity markers for delta decoding and can assist recovery.

Checkpoint entries:

  • Appear periodically (e.g., every 10 deltas)
  • Have their own marker with type field set to 0b001 in bits 13–15
  • Contain the following structure:
struct Checkpoint {
    uint32_t timestamp;      // Timestamp of the checkpoint
    uint32_t record_crc;     // CRC32 of the reconstructed FullRecord
    uint16_t record_count;   // Number of records since start of block
    uint8_t  reserved[6];    // Reserved for alignment and future use (must be zero)
};

All fields marked reserved must be zero when written. This convention aids compression and allows future extensibility.

Checkpoints are optional but recommended to improve resilience.

Compression

  • Each data block may be compressed with Zstandard
  • Compression is per-block, not per-file
  • Zstandard is used purely as a codec — not for CRC or framing
  • If compression inflates the block size, the block is stored uncompressed instead

Application notes

Writing Strategy

  • Data is appended to an internal buffer representing the current block
  • Once a block reaches a set number of records or size, it is compressed
  • A new BlockTableEntry is written with offset, record count, and CRC
  • The block is flushed to disk at a 0x1000-aligned offset
  • Only finalised blocks are included in the block table
  • During shutdown, the current block should be finalised and flushed to ensure integrity
  • The num_samples and num_blocks fields are updated only when the file is finalised
  • A new full record is written at the start of each block
  • Incomplete or in-progress blocks are skipped during recovery unless their CRCs match

Reading Strategy

  • Read the primary and secondary headers and select the one with the valid CRC and highest sequence number
  • Read the block table up to num_blocks entries
  • Seek to the appropriate block using the timestamp and offset
  • Decompress block if BLOCK_FLAG_COMPRESSED is set
  • Iterate over full + delta records
  • Optional: Use checkpoint or block-aligned reads to optimise range queries
  • If no block table is available, records can still be streamed sequentially by parsing full and delta records in order

Recovery Strategy

  • Use the most recent valid header (valid CRC + highest sequence number)
  • If CRC of a block does not match, mark it as BLOCK_FLAG_TOMBSTONE
  • Finalised blocks are assumed valid
  • Incomplete blocks may be retained if CRC matches after skipping trailing deltas
  • The block table should not be extended unless the block is finalised and flushed
  • It is recommended to finalise and flush blocks periodically (e.g. on shutdown or idle)
  • Checkpoints can be used to validate decoded deltas without decompressing the whole block

Optimal block table entry interval

It is recommended that block table entries be written to the minimum required by the application, for the sake of space in the table and for better compression ratios. For example, in HomeSensor, a block is written every five minutes.

Extension & Compatibility

  • Backward compatible within major version (e.g., v1.x)
  • Secondary header allows for dual-write safety
  • Block table allows for optional checkpointing and per-block metadata

Suggested File Naming

<application>_YYYYMMDD.sdat
example: sensorlog_20250403.sdat

Status

Current Version: SDATv1
Maintainer: Elizabeth Ashford
Contact: elizabeth.jennifer.myers@gmail.com
License: CC0 (Public Domain Dedication)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment