Elizafox/SDAT.md

## SDAT.md

      
    Raw
  

              SDAT.md
            
          
  title
  author
  version
  date
  license
  
  
  SDAT — Sensor Data Archive & Transport
  Elizabeth Ashford <elizabeth.jennifer.myers@gmail.com>
  1
  3 April 2025
  CC0 (Public Domain Dedication)
  
  
SDAT v1 - Sensor Data Archive & Transport

Version: 1.0

Extension: .sdat

Purpose: Efficient, resilient, and flash-friendly storage of high-frequency sensor logs with block-level compression and random-access support.
Overview

SDAT is a binary file format optimised for embedded systems like the ESP32 family. It is designed for:

Append-only, streaming-safe writes
Flash and SD card compatibility and safety
High-frequency logging (e.g., every 5 seconds)
Per-block compression (Zstandard or raw)
Indexing for fast seeking and temporal queries
Recovery from partial writes and power loss

Records can also be streamed and decoded linearly without the block table, although the block table allows for improved integrity verification and efficient indexing.
File Structure

All major sections in the file are aligned to 4096-byte (0x1000) boundaries. This improves flash memory wear-leveling, performance on block devices, and simplifies implementation by providing predictable offsets for seeking.
+-----------------------------+ 0x0000
| Primary Header             |
+-----------------------------+ 0x1000
| Secondary Header           |
+-----------------------------+ 0x2000
| Block Table (num_blocks)   |
+-----------------------------+ (data_offset, aligned)
| Block 0 (raw or compressed)|
| Block 1                    |
| ...                        |
+-----------------------------+

Primary & Secondary Headers (64 bytes each)

struct Header {
    uint32_t magic;            // 'SDAT' = 0x54414453

    uint8_t  version;          // Format version (1)
    uint16_t flags;            // HeaderFlags (see below)
    uint8_t  reserved0;        // Padding and future use (must be zero)

    uint32_t header_crc;       // CRC32 of this header with crc field set to 0
    uint32_t data_crc;         // CRC32 of block table + data (optional)

    uint32_t sample_interval;  // In seconds
    uint32_t num_samples;      // Total number of samples

    uint32_t num_blocks;       // Number of entries in block table
    uint32_t data_offset;      // Start of data, 0x1000-aligned

    uint32_t sequence_number;  // Used to detect newer headers

    uint8_t  reserved1[28];    // Padding and future use (must be zero)
};
Both headers are identical; the most recent valid one is used on load. Either header may be used if the headers are equivalent and valid.
All fields marked reserved must be zero when written. This convention aids compression and allows future extensibility.
Header flags

enum HeaderFlags {
    HEADER_FLAG_FINALIZED = 0x01  // File has been finalised and all CRCs are valid
};
Verification

The header CRC is computed as if the header_crc field is zero, and then added to the header. To verify, copy the CRC out of the structure, set it to zero, and run the checksum again. If the two values are equal, the header passes the checksum.
Block Table (16 bytes per entry)

struct BlockTableEntry {
    uint32_t timestamp;       // Start time of block
    uint32_t offset;          // Offset of block in file
    uint16_t record_count;    // Number of samples in block
    uint16_t flags;           // See BlockFlags
    uint32_t block_crc;       // CRC32 of block contents (compressed or raw)
};
The block table is stored at 0x2000 after the two headers. Its size is fixed. Its length in entries is stored in the num_blocks section of the header. The total size of the region is:
$$
Length = (16n) + (-n \bmod 4096)
$$
Where $n$ is the number of entries.
Block flags

enum BlockFlags {
    BLOCK_FLAG_FINALIZED   = 0x01, // Block is complete and not subject to change
    BLOCK_FLAG_COMPRESSED  = 0x02, // Block data is compressed with Zstd
    BLOCK_FLAG_TOMBSTONE   = 0x04  // Block is invalid or discarded during recovery
};
Data Blocks

Each block consists of either:

A full sensor record followed by delta-encoded entries and optional checkpoints, or
A full sensor record only

Compressed blocks use Zstandard. Raw blocks are uncompressed.

Note: CRCs are computed independently, not using Zstandard's internal CRC, which is omitted.
If a compressed block is larger than the raw block, it is stored uncompressed instead.
Record Structure

Records are stored in the following format:

Each record begins with a 2-byte marker.
The record type is stored in bits 13–15:

0b100 – Full record
0b000 – Delta record
0b001 – Checkpoint record
All other values are reserved for future use


The field mask is stored in bits 0–8, where each bit corresponds to one field:

Bit 0: pm1_0
Bit 1: pm2_5
Bit 2: pm10
Bit 3: voc
Bit 4: pressure
Bit 5: temperature
Bit 6: humidity
Bit 7: co2
Bit 8: aqi


Note: This is what is used in HomeSensor; the field mask can be changed based on other needs.
Only fields marked in the field mask are encoded. For full readings, the field mask may be ignored.
Full Record (32 bytes)

struct FullRecord {
    uint32_t timestamp;       // UNIX timestamp (UTC)

    uint16_t pm1_0;           // PM1.0 (µg/cm^3)
    uint16_t pm2_5;           // PM2.5 (µg/cm^3)
    uint16_t pm10;            // PM10.0 (µg/cm^3)

    uint16_t voc;             // Volatile Organic Compounds (PPM)

    uint32_t pressure;        // Atmospheric pressure (Pa)
    int16_t  temperature;     // Temperature (°C * 100)
    uint16_t humidity;        // Relative Humidity (% * 100)

    uint16_t co2;             // CO₂ concentration (PPM)
    uint8_t  aqi;             // Air Quality Index (0–5)

    uint8_t  reserved[9];     // Padding and alignment (must be zero)
};
Note: The structure could be adapted to other sensor types. This is the structure used by HomeSensor.
All fields marked reserved must be zero when written. This convention aids compression and allows future extensibility.
Delta Record (variable length)

Delta records allow compact storage of sensor data by only storing fields that have changed significantly from the previous record.
They are encoded dynamically with the following format:

Each field present in the field mask is in the record.
Each field present is encoded using ZigZag encoding and then VarInt encoding.
Fields are encoded in fixed order as listed above.

The values from the delta record (which may be positive or negative) are added to the last known value to give the latest readings. The new readings become the last known value.
Filtering recommendations


To avoid encoding noise, only include deltas that exceed a minimum threshold (e.g., ±2 for pm1_0).
If a field changes by less than this threshold, omit it from the delta.
Ensure you account for the need to "catch up" for slow drift of values.
Periodically write full records (e.g., once per block or once per 5 minutes) to ensure decoding sync.
Catch-up logic can be used to encode small accumulated differences after several suppressed deltas.

This method reduces file size while maintaining fidelity over time.
Checkpoint Record (16 bytes)

A checkpoint is a lightweight hash of the current full sensor state. It provides integrity markers for delta decoding and can assist recovery.
Checkpoint entries:

Appear periodically (e.g., every 10 deltas)
Have their own marker with type field set to 0b001 in bits 13–15
Contain the following structure:

struct Checkpoint {
    uint32_t timestamp;      // Timestamp of the checkpoint
    uint32_t record_crc;     // CRC32 of the reconstructed FullRecord
    uint16_t record_count;   // Number of records since start of block
    uint8_t  reserved[6];    // Reserved for alignment and future use (must be zero)
};
All fields marked reserved must be zero when written. This convention aids compression and allows future extensibility.
Checkpoints are optional but recommended to improve resilience.
Compression


Each data block may be compressed with Zstandard
Compression is per-block, not per-file
Zstandard is used purely as a codec — not for CRC or framing
If compression inflates the block size, the block is stored uncompressed instead

Application notes

Writing Strategy


Data is appended to an internal buffer representing the current block
Once a block reaches a set number of records or size, it is compressed
A new BlockTableEntry is written with offset, record count, and CRC
The block is flushed to disk at a 0x1000-aligned offset
Only finalised blocks are included in the block table
During shutdown, the current block should be finalised and flushed to ensure integrity
The num_samples and num_blocks fields are updated only when the file is finalised
A new full record is written at the start of each block
Incomplete or in-progress blocks are skipped during recovery unless their CRCs match

Reading Strategy


Read the primary and secondary headers and select the one with the valid CRC and highest sequence number
Read the block table up to num_blocks entries
Seek to the appropriate block using the timestamp and offset
Decompress block if BLOCK_FLAG_COMPRESSED is set
Iterate over full + delta records
Optional: Use checkpoint or block-aligned reads to optimise range queries
If no block table is available, records can still be streamed sequentially by parsing full and delta records in order

Recovery Strategy


Use the most recent valid header (valid CRC + highest sequence number)
If CRC of a block does not match, mark it as BLOCK_FLAG_TOMBSTONE
Finalised blocks are assumed valid
Incomplete blocks may be retained if CRC matches after skipping trailing deltas
The block table should not be extended unless the block is finalised and flushed
It is recommended to finalise and flush blocks periodically (e.g. on shutdown or idle)
Checkpoints can be used to validate decoded deltas without decompressing the whole block

Optimal block table entry interval

It is recommended that block table entries be written to the minimum required by the application, for the sake of space in the table and for better compression ratios. For example, in HomeSensor, a block is written every five minutes.
Extension & Compatibility


Backward compatible within major version (e.g., v1.x)
Secondary header allows for dual-write safety
Block table allows for optional checkpointing and per-block metadata

Suggested File Naming

<application>_YYYYMMDD.sdat
example: sensorlog_20250403.sdat

Status

Current Version: SDATv1

Maintainer: Elizabeth Ashford

Contact: elizabeth.jennifer.myers@gmail.com

License: CC0 (Public Domain Dedication)
No results found