fabianbaechli/soco_asg03_java.md

## soco_asg03_java.md

      
    Raw
  

              soco_asg03_java.md
            
          
    Assignment 3, Java

zvfs.java

Why read chunks of 8192 Bytes?

Buffers have to be a power of 2 because otherwise it’s possible to have a misalignment with filesystem blocks provided by operating system (requiring more blocks to be read for each operation, making IO slower)
8192 is larger than 4096 which is the default on windows (⇒ performance gain)
8192 (8 KiB) provides a good balance between performance and memory usage

What does this line private static final long MAX_FILE_SIZE = (1L << 32); // 4GB max  do?

It shifts 1L 32 times to the left: 00000001 ⇒ 10000000 00000000 00000000 00000000 ⇒ 2^32 bits, which is 4.294 GB

1 as Long because 2^32 is greater than INT_MAX in Java which is 2^31. If we shifted just 1 without the L, we would get a buffer overflow, wrapping around to negative


What does this line return (value + 63) & ~63L; do?

Used in the align64 method
value + 63: pushes non 64 aligned values to the next bucket (e.g. 0 stays in first bucket, 1 to 64 stay in second bucket and so on)
~63L ⇒ 63 is bitwise 00111111, ~ inverts it, so it becomes … 11111111 11000000 ⇒ all 1 except the lowest 6 bits
We apply the mask with bitwise &  ⇒ setting the lowest 6 bits to 0
With the lowest 6 bits set to 0, the resulting number will always be a multiple of 64!!!!!

Analogy for decimal: adding 100 to a number and turning the lowest 2 digits of the resulting number to 0 will have the resulting number always be a multiple of 100: 392 + 100 ⇒ 492 ⇒ 400 ⇒ 4 * 100


⇒ Remember: To 64-align a number, add 63 to it and clear the lowest 6 bits
Result
0 + 63 & ~63 => 0
1 + 63 & ~63 => 64
64 + 63 & ~63 => 64
65 + 63 & ~63 => 128
Where is align64 needed in the code?

It is only explicitly used when reading from and writing to the file data block

But this is only because file entries and header are already implicitly 64-aligned (they both are 64 bytes each)


Difference between a ByteBuffer and a Byte in Java?

A byte buffer is an abstraction over the primitive type byte[] it allows for reading and writing of various primitive types in different byte orders (allows reading as LittleEndian for example)
ByteBuffers operate with three variables all the time:

Capacity: Fixed size of the buffer, cannot change over time
Position: Where the next read / write happens
Limit: The “stop here” boundary

Practical implication for our code: in copyRange  we check whether remaining bytes are smaller than buffer.capacity if so, we shrink its limit  to remaining so that we don’t read more than necessary


Explain in this code snippet, why the file is opened in READ and WRITE:
  public static void writeAtPosition(Path fsPath, long position, byte[] data) throws IOException {
    try (FileChannel fc = FileChannel.open(fsPath, StandardOpenOption.READ, StandardOpenOption.WRITE)) {
      fc.position(position);
      fc.write(ByteBuffer.wrap(data));
    }
  }

It’s an oversight: here, only WRITE is necessary

What does buffer.flip() do?

It switches the buffer from write mode to read mode or from read mode to write mode (it uses position, limit, capacity internally which all get reset to the correct value when flipping)

Example: Switching from write mode to read mode
ByteBuffer buffer = ByteBuffer.allocate(totalSize);
buffer.order(ByteOrder.LITTLE_ENDIAN);

// Write to buffer
buffer.put(headerBytes);

// Switching from write mode to read mode
buffer.flip();
try (FileChannel channel = FileChannel.open(path, StandardOpenOption.CREATE, StandardOpenOption.WRITE, StandardOpenOption.TRUNCATE_EXISTING)) {
  // Read from buffer
  channel.write(buffer);
}
Explain why this code has to be wrapped in a loop:
while(buffer.hasRemaining() {
  out.write(buffer)
}

// why can't we just do:
out.write(buffer)

Because WritableByteChannel is allowed to only write a partial amount of data from the buffer

Explain what this code snippet does:
ByteBuffer buffer = ByteBuffer.allocateDirect(BUFFER_SIZE);

It allocates a ByteBuffer that points to off-heap memory (the ByteBuffer variable itself lives (like all java-variables) in the heap, but it points to off-heap-memory). This is called a direct buffer.
Direct buffers can be faster when reusing the buffer many times (as we do in copyRange) but are more expensive to instantiate (which is of no concern for us because we only instantiate once)

When reading the data from in , the data is read from the filesystem into user-space (= total memory off current java program, heap + off-heap memory)
With a heap ByteBuffer, Java may need a temporary native buffer and copy data between native memory (= off heap memory) and the heap buffer. With a direct ByteBuffer, the OS can often read/write directly into that off-heap memory, avoiding that extra copy to the intermediate buffer before and after each invocation.
Sidenote: Because memory of direct buffers lives outside the heap which is garbage collected, memory impact of direct buffers is hard to debug. That’s why heap buffers are preferred when speed isn’t critical


BUT: this is not the fastest way to do it

And a major oversight at our part
We could’ve used in.**transferTo** which would’ve let the OS handle the whole copying of bytes between two files without the need to load any data into our program
This would’ve made a much bigger performance impact than using direct buffer instead of a heap buffer!


Why can we do int bytesRead = in.read(buffer); in readHeader when we had to use a while loop when writing a buffer in copyRange?

It’s an oversight at our end:  in.read can in theory return less bytes than the ones in buffer, even if the file has enough bytes ⇒ there could be a full Header in the file but we error since less bytes than buffer are read
In practice, the JVM tries to read as many bytes as possible, so the case that enough bytes are present but fewer are read is rare

What does Math.round( (totalSizeBytes / (1024.0 * 1024.0)) * 1000.0 ) / 1000.0;  do?

1024 * 1024 = 1'048'576

This is the amount of bytes in a Mebibyte

1024 = 2^10, so 1024 * 1024 = 2^20


Storage requirements in CS are often given as a multiple of two. The corresponding labels are “kibibyte”, “mebibyte”, “gibibyte” and so on
They are slightly larger than their decimal counterparts (kilobyte, megabyte, gigabyte…)
The correct abbreviation would be MiB  (we write MB which is technically wrong)


The Math.round(x * 1000) / 1000  rounds to the last three decimals

In Java, there is no Math.round(x, 3) for example, so we have to shift by 3, round to nearest int, shift back by three (resulting in the number being rounded at the 3. decimal digit)


What does this code do?
String appendix = (deletedFiles > 0) ? ", " + deletedFiles + " files are deleted, defragment first" : "";
System.err.println("No more free slots " + appendix);

It is a ternary operator
Appendix is “x files are deleted, defragment first” when there are deleted files in the FS, appendix is empty, if there are no deleted files

Why is the data start offset 2112?
Header (64 Bytes) + 32 * 64 (64 Bytes per FileEntry) = 2112
Why is entryOffset a long here: long entryOffset = header.getFileTableOffset() + (long) i * FileEntry.FILE_ENTRY_SIZE; ?

Is an error; getFileTableOffset() always returns 64, i can at most be 31, FILE_ENTRY_SIZE is 64, so entryOffset is 2'048 at most
Technically, a short could be used here which goes from -32k to +32k (roughly)

What happens in rmfs if the file to be deleted is found?

Two things: the flag of the FileEntry is set to 1 (deleted) and the header is updated: deleted files count incremented, file count decremented
Neither the file entry nor the data chunk is deleted!

Happens only in defrag


What’s the difference between getfs and catfs ?

Getfs (write data chunk of entry to file) and catfs (print data chunk to console) both call catfs
Getfs passes a file channel as out while catfs passes Channels.newChannel(System.out)

Both, the file channel and the console channel implement WritableByteChannel


In copyRange , why is in a FileChannel and out a WritableByteChannel when a ReadablyByteChannel could be used for in ? (Why the asymmetry?)

In copyRange we want to read from in at an offset and write at position 0 in out
ReadablyByteChannel does not implement reading from an offset
So handling of this offset would be delegated to outside the copyRange  method and make the code more complicated everywhere copyRange is called

Explain the date formatting at these two points in code (lsfs )
SimpleDateFormat simpleDateFormat = new SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss:S");
...
String dateString = simpleDateFormat.format(entry.getCreated() * 1000L);

The first line defines the format we want to print our date in

The single S at the end states milliseconds, non padded!

Non padded: 7ms ⇒ “7”, 70ms ⇒ “70”
Padded (via SSS): 7ms ⇒ “007”, 70ms ⇒ “070”

Is the usual pattern, gives fixed-length formats


created field in FileEntry is stored as a timestamp in seconds (Seconds passed since 1.1.1970), we even divide System.currentTimeMillis() by 1000 to get seconds when making a new FileEntry in addfs

This is why we have to multiply the timestamp by 1000 in the second line and results in our timestamps always ending in 000
This is a weird pattern to have:

When making the FileEntry: get milliseconds, truncate to seconds
When printing the FileEntry: get creation timestamp, multiply by 1000 to imply ms-precision when this precision was explicitly dropped


Explain this piece of code
List<FileEntry> alive = fileEntries.stream()
  .filter(e -> e.getFlag() == 0)
  .toList();

It creates a new list containing FileEntries which are alive. It uses the Java streams API for this
Streams are chained together, intermediate operations (filter in this example) return an altered stream on which another map, filter, reduce  (and more) operation can be performed
Intermediate operations on streams do not alter the underlying data (fileEntries is the same before and after the call)

Explain how the FileChannel in defrag is used

One FileChannel is instantiated for the whole defrag process

This channel doesn’t get closed nor reopened during the rewriting of file data chunks!


The channel is passed in as both input and output into copyRange

Effectively reading from the file at one position and writing the contents to another position


The position field of the channel is updated for each entry

This works because copyRange reads from an absolute position and writes to a relative position

Read: int bytesRead = in.read(buffer, readPos);  → Absolute
Write: out.write(buffer); → Relative, uses channel.position


There are two ways a channel can be written/read:

Relative: channel.read(buffer)  → Takes channel.position field for reading/writing
Absolute: channel.read(buffer, position)  → Ignores channel.position


⇒ In essence, dfrgfs  controls where to read from via the parameters start and length passed into copyRange and controls where to write to via channel.position
Explain oldNext, newOffset, alignedOffset, padding, finalSize in defrag

oldNext: This is the end of the data block in the filesystem before defragging. This value is used in the end (difference to header.getNextFreeOffset() ) to compute the total cleaned bytes. Important: this value is not used anywhere else!
newOffset: This is the start position for each entry not flagged for deletion. In the beginning, it is set to the start of the data block (position after last FileEntry). It grows larger as defrag progresses, incremented by the size of each block. Important: This position is potentially unaligned at the start of an iteration but gets aligned afterwards! (because it is used in conjunction with alignedOffset)
alignedOffset: This is the 64-aligned version of newOffset. There is an aligned and unaligned version of newOffset in order to compute the required padding.
padding: The difference between newOffset(initially unaligned) and alignedOffset. padding zero bytes are written at the end of the last data chunk to make it aligned. newOffset  is set to alignedOffset after zero bytes have been written, making it aligned before writing the contents of the actual new block

All in all, the cycle looks like this:

Check whether newOffset is 64-aligned

If not, write the required amount of zero bytes to get to the next multiple of 64, update newOffset to that aligned position → This pads the data block of the last written file entry!


Write the file contents of the current entry (unaligned), increment newOffset  by the number of bytes written
Repeat the cycle

For clarification: alignedOffset and padding could be dropped, the whole loop could be written like this:
// DataStartOffset is always 64 aligned (is 2112)
newOffset = header.getDataStartOffset();

for (FileEntry e : aliveSorted) {
	// Always aligned
  long dest = newOffset;
  // Set write position
  channel.position(dest);
  copyRange(channel, e.getStart(), e.getLength(), channel);

  e.setStart(dest);

	// newOffset is potentially unaligned at the moment
  newOffset += e.getLength();
  // Aligned version of newOffset
  long nextAligned = align64(newOffset);

	// Fill the data block with zero bytes at the end of the iteration
  if (nextAligned > newOffset) {
    channel.position(newOffset);
    channel.write(ByteBuffer.allocate((int)(nextAligned - newOffset)));
  }
	
	// Align newOffset for next iteration after padding
  newOffset = nextAligned;
}
No results found