Skip to content
WALDEMAR KOZACZUK edited this page Apr 11, 2023 · 14 revisions

Introduction

By design, most applications running on OSv do not execute system calls when calling the libc functions. For example, an invocation of a mmap() is a direct local function call resolved by OSv dynamic linker that involves very few instructions and is therefore very fast. On Linux, the same call is way more expensive as it goes through a wrapper function in glibc which then invokes the system call SYS_mmap that involves a CPU ring and virtual address space switch among other things. This OSv optimization may not be as relevant as one would hope, especially when applications make few mmap() calls as is often the case, but this is a topic for another story.

Some applications like Golang or statically linked applications (see this for more details) bypass the libc layer and invoke systems calls directly using the SYSCALL (x86_64) or SVC (aarch64) instructions. To support those, OSv implemented the system handler machinery in assembly for both x86_64 and aarch64.

Unlike Linux, where libc functions like mmap() delegate to the corresponding system calls (SYS_mmap in the example above), in OSv the opposite happens. Just like in Linux, OSv implements the SYSCALL and SVC instructions for x86_64 and aarch64 respectively (see syscall_entry in arch/x64/entry.S and handle_system_call in arch/aarch64/entry.S). This tricky low-level assembly code switches to a dedicated system call stack, saves all necessary registers, and delegates to syscall_wrapper() and eventually syscall() functions implemented in linux.cc. The syscall() function has a case statement that invokes the relevant libc function.

Implemented

  • accept4
  • bind
  • clock_getres
  • clock_gettime
  • close
  • connect
  • dup3
  • epoll_create1
  • epoll_ctl
  • epoll_pwait
  • epoll_wait
  • eventfd2
  • exit
  • exit_group
  • fcntl
  • fdatasync
  • flock
  • fstat
  • fstatat
  • fsync
  • ftruncate
  • futex
  • getcwd
  • getdents64
  • getgid
  • get_mempolicy
  • getpeername
  • getpid
  • getrandom
  • getsockname
  • getsockopt
  • gettid
  • getuid
  • ioctl
  • listen
  • lseek
  • madvise
  • mincore
  • mkdir
  • mkdirat
  • mmap
  • munmap
  • nanosleep
  • open
  • openat
  • pipe2
  • pread64
  • pselect6
  • pwrite64
  • read
  • readlinkat
  • recvfrom
  • recvmsg
  • renameat
  • rt_sigaction
  • rt_sigprocmask
  • sched_getaffinity
  • sched_setaffinity
  • sched_yield
  • select
  • sendmsg
  • sendto
  • set_mempolicy
  • setsockopt
  • sigaltstack
  • socket
  • stat
  • statfs
  • symlinkat
  • tgkill
  • uname
  • unlinkat
  • write

Trivial to Implement

  • accept
  • access
  • alarm
  • chdir
  • creat
  • dup
  • dup2
  • epoll_create
  • eventfd
  • fallocate
  • faccessat
  • fchdir
  • fstatfs
  • futimesat
  • getitimer
  • getpriority
  • getrlimit
  • getrusage
  • gettimeofday
  • kill
  • lstat
  • mprotect
  • msync
  • pause
  • pipe
  • poll
  • ppoll
  • prctl
  • readlink
  • readv
  • rename
  • rmdir
  • sched_get_priority_max
  • sched_get_priority_min
  • sendfile
  • sethostname
  • setitimer
  • setpriority
  • setrlimit
  • shmget
  • shmat
  • shmctl
  • shmdt
  • shutdown
  • socketpair
  • symlink
  • sync
  • sysinfo
  • time
  • timerfd_create
  • timerfd_gettime
  • timerfd_settime
  • times
  • truncate
  • umask
  • unlink
  • utime
  • utimensat
  • utimes
  • writev
Clone this wiki locally