There are many tutorials on the web that explain how to build a simple "Hello, World" in C without the use of libc on AMD64, but most of them stop there.
This guide hopes to provide a more complete explanation that will allow you to build yourself a small framework to write more complex programs. The code will support both AMD64, and i386.
We will compile with the flag -g as for debug information, as-well as no optimization -O0 to be able to see as much as possible in the debugger. You'll need to follow the next steps to see how to do this.
- Firstly, run the following.
$ cat > hello.c << "EOF"
#include <stdio.h>
int main(int argc, char* argv[])
{
printf("Hello, World\n");
return 0;
}
EOF- To run this program, we'll run the following command.
$ gcc -O0 -g hello.c # After running, continue to the next line.
$ ./a.out-
This outputs a simple "Hello, World", followed by a line feed in our console.
-
To debug this program, we'll use GNU's debugger,
gdbon the output filea.out
$ gdb a.out
(gdb) break main
(gdb) run
(gdb) backtrace- This will output
#0 main (argc=1, argv=0x7fffffffda08) at hello.c:5
Although we retrieve some useful information from this, past-entry information is still hidden from us. We need to specify to gdb that we want to back-trace lib-c's past-main and past-entry functions.
$ gdb a.out
(gdb) break main
(gdb) run
(gdb) backtrace
(gdb) set backtrace past-main on
(gdb) set backtrace past-entry on
(gdb) bt- Our new output
#0 main (argc=1, argv=0x7fffffffda08) at hello.c:5
#1 0x00007ffff7df52ca in ?? () from /lib64/libc.so.6
#2 0x00007ffff7df5385 in __libc_start_main () from /lib64/libc.so.6
#3 0x0000555555555071 in _start ()That is definitely much better, as we can see, the first function that's actually called is _start, which then calls __libc_start_main which is clearly a standard library initialization function to invoke main.
You can take a look at _start and __libc_start_main in the glibc source if you're interested. It's not that interesting for us, as it sets a dynamic linker, and such that we will neveruse since we want a static executable.
Let's try recompiling our "Hello, World" program with optimization flags this time (-O2), without debug information and with stripping (-s) to see how large it is.
$ gcc -s -O2 hello.c
$ wc -c a.out
6208 a.out6 KiB for a simple Hello World? That's a lot.
Even if I add another size optimization flag, such as -Wl, --gc-sections -fno-unwind-tables -fno-asynchronous-unwind-tables -Os, it persists at 6Kibs.
We will now progressively strip this program down by first getting rid of the standard library, then learning how to invoke syscalls without the necessity of headers.
So how do we get rid of the standard library? Of course if we try to compile our current code with -nostdlib we will run into linker errors. So first, let's trouble-shoot our linker errors
$ gcc -s -02 -nostdlib hello.c
/usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000001020
/usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: /tmp/ccAZZZwG.o: in function `main':
hello.c:(.text.startup+0xc): undefined reference to `puts'
collect2: error: ld returned 1 exit statusThe linker is complaining about _start missing, which is what we would expect from our previous debugging.
We also have a linker error on puts, which is to be expected since it is a function included from libc. But how do we print "Hello, World" without puts?
The Linux Kernel exposes a bunch of syscalls, which are functions that user-space programs can enter to interact with the Operating System. You can see a list of syscalls by running man syscalls, or you can visit man7's syscalls webpage.
So, How do we find out which sycall puts uses? We can either look through the syscall list, or simple install strace to trace syscalls and write a simple program that uses puts.
The strace method is extemely useful to us. If you don't know how to do something with syscalls, do it with libc, then, strace it to decipher which syscalls it uses on the target architecture.
Let's try this out.
- Our simple program which uses
putsfromstdio.h.
#include <stdio.h>
int main(int argc, char* argv[])
{
puts("Hello, World");
return 0;
}- Using strace to decipher the syscall we want.
$ gcc puts.c
$ strace ./a.out > /dev/null
write(1, "Hello, World\n", 13) = 13
exit_group(0) = ?
+++ exited with 0 +++Note that
stdoutis piped to/dev/nullin strace, that's because strace outputs is in stderr and we don't want to have it mixed witha.out's output.
So we can derive from this that puts uses the write syscall.
Let's check the manpage for write.
$ man 2 write
NAME
write - write to a file descriptor
SYNOPSIS
#include <unistd.h>
ssize_t write(int fd, const void *buf, size_t count);
DESCRIPTION
write() writes up to count bytes from the buffer starting at buf
to the file referred to by the file descriptor fd.In Linux, there are three stardard file descriptors,
stdinUsed to pipe data into the program or read user input.stdoutUsed to output information.stderrUsed as an alternet output for error messaging.
If we read man stdout, we read that these are simply defined as 0, 1, and 2.
So all we have to do is replace our puts() with a write() to stream 1, which is stdout.
So let's try that.
#include <unistd.h>
int main(int argc, char* argv[])
{
write(1, "Hello, World\n", 13);
return 0;
}Let's try to compile that again.
$ gcc -s -O2 -nostdlib hello.c
hello.c: In function 'main':
hello.c:5:5: warning: ignoring return value of 'write' declared with attribute 'warn_unused_result' [-Wunused-result]
5 | write(1, "Hello, World\n", 13);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000001020
/usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: /tmp/ccqJWxSf.o: in function `main':
hello.c:(.text.startup+0x16): undefined reference to `write'
collect2: error: ld returned 1 exit statusIt seems our write() function is also apart of the standard library. How do we invoke syscalls without having to link the standard lib?
Let's take a look at section A.2.1 Calling Conventions in the AMD64 ABI Specification.
If you're completely clueless about assembly, you should still be able to understand once you see an example.
User-level applications use as integer registers for passing the sequence
%rdi,%rsi,%rdx,%rcx,%r8and%r9. The kernel interface uses%rdi,%rsi,%rdx,%r10,%r8and%r9.A system-call is done via the
sycallinstruction. The kernel destroys registers%rcxand%r11.The number of the syscall has to be passed in register
%rax.System-calls are limited to six arguments, no argument is passed directly on the stack.
Returning from the
syscall, register%raxcontains the result of the system-call. A value in the range between -4095 and -1 indicates an error, it is-errno.Only values of class INTEGER or class MEMORY are passed to the kernel.
System V Application Binary Interface, Appendix A § 2.1, Calling Conventions.
In poor words, all we need to do is write an assembly wrapper that will
- Take the syscall numbers followed by either pointers or integers as parameters.
- Set
%raxto the syscall number. - Set
%rdi,%rsi,%rdx,%r10,%r9, and%r8to the parameters. Calls that take less than 6 arguments will ignore the excess ones. - Executes
syscall. - Returns the content of
%rax.
If we read section 3.4 of the specification or the quick cheatsheet on osdev.org, we will see that on AMD64, the registers used to pass parameters to regular functions are almost the same as the syscalls, except for %r10 which is replaced with %rcx. The return register is also the same (%rax).
This means that our syscall wrapper will only be able to accept and forward a maximum of five parameters, this is because the first parameter is already being used to pass a syscall number.
We could use the stack to take more than six arguments, but let's not make our lives more complicated when we don't even need to call syscalls with more than six parameters yet.
The Application Binary Interface also states that:
Registers
%rbp,%rbx, and%r12through%r15“belong” to the calling function and the called function is required to preserve their values. In other words, a called function must preserve these registers’ values for its caller. Remaining registers “belong” to the called function If a calling function wants to preserve such a register value across a function call, it must save the value in its local stack frame.
THis means that we don't have to worry about saving and restoring the values of %rdi, %rsi, %rdx, %r10, %r9, and %r8 inside of our syscall wrapper, because it's up to the caller to save and gcc will take care of that because we are callling from C code.
By putting this all together, it will become our syscall wrapper.
mov %rdi, %rax /* %rax (syscall number) = func param 1 (%rdi) */
mov %rsi, %rdi /* %rdi (syscall param 1) = func param 2 (%rsi) */
mov %rdx, %rsi /* %rsi (syscall param 2) = func param 3 (%rdx) */
mov %rcx, %rdx /* %rdx (syscall param 3) = func param 4 (%rcx) */
mov %r8, %r10 /* %r10 (syscall param 4) = func param 5 (%r8) */
mov %r9, %r8 /* %r8 (syscall param 5) = func param 6 (%r9) */
syscall /* Enter a syscall (return value in %rax) */
ret /* Return value is already in %rax, we can return. */How do we embed our arbitrary assembly into our program though? One day is via the gcc inline assembler. However, the syntax is ugly.
We're going to write a .S file for the GNU Assembler, and compile and link it to our hello.c program with gcc.
.global syscall5 /* Exporting syscall to other compilation units. */
.text /* Marking the .text, which marks the PE, making our program executable. */
syscall5:
mov %rdi, %rax
mov %rsi, %rdi
mov %rdx, %rsi
mov %rcx, %rdx
mov %r8, %r10
mov %r9, %r8
syscall
retTo find any syscall numbers, refer to filippo.io/linux-syscall-table/.
Additionally, you can simply use a C preprocessor print it for you
$ printf "#include <sys/syscall.h>\n SYS_write" | gcc -E - | sed "/^#.*/d"
1-ERuns the preprocessor on the file, expanding all macros and therefore replacing#defineconstants with their corresponding value.-Means that we usestdinas input, which we pipe here withprintf.- We simply use sed to remove lines we don't want, I would assume you know what sed is.
- Optionally, you can use the
-m32flag for 32-bit calls.
Syscall numbers are usually prefixed by
SYS_.
Back to our prototype from earlier,
ssize_t write(int fd, const void *buf, size_t count);ssize_tandsize_tare types defined by unistd. A quick inspection of the class reveals that they are 64-bit integers, and that the extrasinssizemeans it is a signed value.
$ printf "#include <unistd.h>" | gcc -E - | grep size_t
typedef long int __blksize_t;
typedef long int __ssize_t;
typedef __ssize_t ssize_t;
typedef long unsigned int size_t;If we try an -m32 flag, we see that this will be a 32-bit. This means that ssize_t and size_t are the same size as the architecture's pointers.
We can now import syscall5 from hello.s into our hello.c program and make a write function that calls it, that is demonstrated below.
void* syscall5(
void* number,
void* arg1,
void* arg2,
void* arg3,
void* arg4,
void* arg5
);
typedef unsigned long int uintptr; /* size_t */
typedef long int intptr; /* ssize_t */
static intptr write(int fd, void const* data, uintptr nbytes)
{
return (intptr)
syscall5(
(void*) 1, /* SYS_write, call number 1 */
(void*) (intptr) fd,
(void*) data,
(void*) nbytes,
0, /* Ignored */
0 /* Ignored */
);
}
int main(int argc, char* argv[])
{
write(1, "Hello, World\n", 13);
return 0;
}See that (void*)(intptr) double cast on fd? If fd is 32-bit and void* is 64-bit, we would get a warning that we are implicitly casting it to a different size, so we need to explicitly specify that we want that conversion by adding the intptr cast.
This should be done every time you cast to and from pointers when the destination type is not guaranteed to be the same size as pointers. Especially when targeting multiple architectures.
Note how we cast the
constqualifier away from data to avoid a warning.
Back to the AMD64 ABI documentation. In figure 3.11, we can see the initial state of the stack.
argc is a non-negative argument count
argv is an array of argument strings, with
argv[argc] == 0
Figure 3.11: Initial Process Stack
Purpose Start Address Length Unspecified High Address Information block, including arguments, strings, environments strings, auxiliary information ... varies Unspecified Null auxiliary vector entry 1 eightbyte Auxiliary vector entries ... 2 eightbytes each 0 eightbyte Environment pointers ... 1 eightbyte each 0 8 + 8 * argc + % rspeightbyte Argument pointers 8 + %rspargc eightbytes Argument count %rspeightbyte Undefined Low Address
Although we don't care about this much, right beneath this figure, we have the initial state of the registers, which is very important to us.
%rbpThe content of this register is unspecified at process initialization time, but the user code should mark the deepest stack frame by setting the frame pointer to zero.
%rspThe stack pointer holds the address of the byte with lowest address which is part of the stack. It is guaranteed to be 16-byte aligned at process entry.
%rdxA function pointer that the application should register withatexit(BA_OS).
So now that we know %rdp must be zeroed, and that %rsp points to the top of the stack. We don't need to worry about %rdx.
If you don't understand how the stack works, it's just a chunk of memory where data is appended, and retrieved at the end. This is done through a push and a pop.
In AMD64's convention, we're actually prepending and removing data at the beginning of the memory sequence, since the stack is said to "grow downwards", which means that when we push something onto the stack, the stack pointer gets lower.
Since the ABI states that the stack pointer is 16-byte aligned, we must remember always to push data whose size is a multiple of 16. For example, 2 64-bit integers are 16 bytes. It's often necessary to either push useless data or simply align the stack pointer when the pushed values don't happen to be aligned.
To put it all together, our _start function needs to do the following.
- Zero
%rbp. - Put
argcinto%rdi(first parameter for main). - Put the stack address of
argv[0]into%rsi(second parameter for main), which will be interpreted as an array of char pointers. - Align the stack to 16-bytes.
- Call main.
So, Let's do that,
- Our new
hello.sshould look something like this.
.global _start, syscall5 /* Exporting syscall to other compilation units. */
.text /* Marking the .text, which marks the PE, making our program executable. */
_start:
xor %rbp, %rbp /* XOR-ing a value with iself will set its value to 0. */
pop %rdi /* %rdi = argc, adds 8 to %rsp as-well. */
mov %rsp, %rsi /* Set the rest of the stack to an array of char pointers. */
/**
* Zero the last four bits of %rsp, aligning it to 16 bytes same
* as "and %rsp, 0xFFFFFFFFFFFFFFF0" because negative numbers
* are represented as max_unsigned_value + abs(negative_num)
*/
and $-16, %rsp
call main
ret
syscall5:
mov %rdi, %rax
mov %rsi, %rdi
mov %rdx, %rsi
mov %rcx, %rdx
mov %r8, %r10
mov %r9, %r8
syscall
retUnfortunately, upon exit of this program, it throws a segmentation fault.
$ gcc -s -O2 -nostdlib hello.s hello.c
$ ./a.out
Hello, World
Segmentation faultBut why?
When we execute a call instruction, the return address1 is pushed onto the stack implicitely and the ret intruction implicitly pops it and jumps to it.
The _start procedure is very special, as it has no return type, which makes it a procedure, rather than a function. This seems to be our issue, as we can see, our ret instruction in _start is trying to jump back to _starts return address, which is memory address that doesn't exist, or doesn't contain data relevent to our program, which triggers access violations.
We need to tell the OS to kill our process and never reach the ret in _start. The syscall _EXIT() is just what we need:
- The Address of the instruction to jump to after a function returns.
- First, let's look at its man page.
$ man 2 _EXIT
NAME
_exit, _Exit - terminate the calling process
SYNOPSIS
#include <unistd.h>
noreturn void _exit(int status);
#include <stdlib.h>
noreturn void _Exit(int status);- Now, let's use a preprocessor to locate the syscall number.
$ printf "#include <sys/syscall.h>\n SYS_exit" | gcc -E - | sed "/^#.*/d""
60The status code will simply return the value of main, which is stored in %rax as we know.
With this information, let's write a new hello.s.
.global _start, syscall5 /* Exporting syscall to other compilation units. */
.text /* Marking the .text, which marks the PE, making our program executable. */
_start:
xor %rbp, %rbp /* Upon instructing XOR an two of the same operands, it will set its value to 0. */
pop %rdi /* %rdi = argc, adds 8 to %rsp as-well. */
mov %rsp, %rsi /* Set the rest of the stack to an array of char pointers. */
/**
* Zero the last four bits of %rsp, aligning it to 16 bytes same
* as "and %rsp, 0xFFFFFFFFFFFFFFF0" because negative numbers
* are represented as max_unsigned_value + abs(negative_num)
*/
and $-16, %rsp /* Not using Hex to better represent a negative decimal. */
call main
/**
* Our new syscall to SYS_exit.
*/
mov %rax, %rdi /* syscall param 1 = %rax (ret value of main) */
mov $0x3C, %rax /* 0x3C -> 60 in decimal, syscall for SYS_exit. */
syscall
ret /* This sholud now never be reached. */
syscall5:
mov %rdi, %rax
mov %rsi, %rdi
mov %rdx, %rsi
mov %rcx, %rdx
mov %r8, %r10
mov %r9, %r8
syscall
retOur program seems to finally terminate correctly!
$ gcc -s -O2 -nostdlib hello.s hello.c
$ ./a.out
Hello, World
We can shrink our executable size by removing unneeded unwind tables, we can do this by running the following.
$ gcc -s -O2 -nostdlib -fno-unwind-tables -fno-asynchronous-unwind-tables hello.s hello.c