rwegener2/overview.md

## overview.md

      
    Raw
  

              overview.md
            
          
    Comparing v006 and v007 ATL03 Reads

Overview

When using equivalent methods the read times of v007 and v006 data are similar, with v007 being slightly faster. For both v006 and v007 data using the blocksize fsspec parameter results is a substantial speedup.
Methods

Ten v007 ATL03 files of varying file size were compared for read speeds on Cryocloud (AWS us-west-2 based Jupyter Hub) using one group of data (gt1l) using the h5py Python library. The h_ph variable as well as three relevant coordinate variables (lat_ph, lon_ph, and delta_time) were read. Times shown are the mean time of 6 independent reads. Error bars show the standard deviation of those times. All files were read 7 times. The first read was not included in the averages because it often was much slower than the subsequent reads (often 2x slower), likely due to an optimization technique from s3 that speeds up sequential reads. Additionally, each file was read 6 times consecutively prior to beginning timed reads, to decrease the effect of any warm-up time from AWS.
Two data versions are compared, which indicate a change in the data file organization. ATL07 data was 1) repacked to be stored in 8 MB pages and 2) chunked into 100,000 element chunks. ATL06 is not paged and has ~10,000 element chunks. Read more about the details of the data version changes here https://nsidc.github.io/cloud-optimized-icesat2/
The “optimized” data reads included the following parameters into the fsspec open() method:
{
  "cache_type": "blockcache", 
  "block_size": 8*1024*1024
  }

The high level effects of these parameters is:

cache_type (fsspec): Download and cache data chunks from the file for caching (not, for example, the whole file)
block_size (fsspec): How much data to request at once for buffering

Speedup calculations
Execution time speed up was calculated using:
$$ S_{latency} = \frac{L_{old}}{L_{new}} $$
where $S$ is speedup and $L$ is latency (definition). For example:
$$ S_{latency} = \frac{L_{old}}{L_{new}} = \frac{3}{2} = 1.5 $$
Results

Figure 1: Read times (below)

The left figure shows read times for v006 and v007 with the blocksize set to 8 MiB. The right figure shows read times without specifing the blocksize parameter.

* indicates that the file used is the same files as used in Lopez et al. (2025)
Table 1: Average Speedup by Filesize

The table below shows the average speedup time for files of each filesize: Small (~200MB), Medium (~700 MB), MediumLarge (~4 GB), and Large (~8GB). Not that these sizes are the total filesize. Data is only being read from 1 group, so the read size is smaller.

"v006 vs. v006 opt" shows the change between reading v006 data without a blocksize param specified ($L_{old}$) and v006 with a blocksize param specified ($L_{new}$).
"v007 vs. v007 opt" shows the change between reading v007 data without a blocksize param specified ($L_{old}$) and v007 with a blocksize param specified ($L_{new}$).
v006 opt vs. v007 opt" shows the change between reading v006 with a blocksize param specified ($L_{old}$) and reading v007 with a blocksize param specified ($L_{new}$).


Size Group
v006 vs. v006 opt
v007 vs. v007 opt.
v006 opt vs. v007 opt


Small
9.84
10.14
1.31


Medium
3.85
4.53
1.07


Mediumlarge
2.31
4.42
1.06


Large
1.86
3.92
1.05


All times in seconds.
Takeaways

Overall we see that:

Reading the v007 is faster than reading the v006 data (between 1.05 and 1.3 times faster)
Using the blocksize parameter results is substantial speedup (between 1.8 and 10 times faster), independent of the data version (repacked and chunked vs. not)
Both read improvements (v007 repacking and including the blocksize param) have the biggest impact on the smaller files
Size Group	v006 vs. v006 opt	v007 vs. v007 opt.	v006 opt vs. v007 opt
Small	9.84	10.14	1.31
Medium	3.85	4.53	1.07
Mediumlarge	2.31	4.42	1.06
Large	1.86	3.92	1.05
No results found