Multicore & Locks
By far, we have been running the operating system and user applications on a single CPU core, while multicore CPUs have been widely used. The CPU on a laptop or a mobile phone typically has 4 to 16 cores, and the CPU on a server machine in a datacenter could have 64 or more cores. Multicore leads to higher software performance because multiple processes can then run on different cores at the same time, and therefore multiple software tasks can make progress in parallel.
Dealing with multicore involves an important concept, mutual exclusion. In general, mutual exclusion prevents multiple CPU cores from accessing a shared resource at the same time. For example, when the kernel code is running on a core and updating some data structures (e.g., the PCB), it may need exclusive access to such data structures while preventing other cores from updating them simultaneously. Otherwise, these kernel data structures could be corrupted by simultaneous updates. We thus start by introducing mutual exclusion.
Mutual exclusion
When you run egos-2000 on QEMU, you may have noticed that the first line printed is one of the four possibilities listed below. In other words, a core is randomly selected from #1 .. #4 to run egos-2000 when you run it on QEMU.
[CRITICAL] --- Booting on QEMU with core #1 ---
[CRITICAL] --- Booting on QEMU with core #2 ---
[CRITICAL] --- Booting on QEMU with core #3 ---
[CRITICAL] --- Booting on QEMU with core #4 ---
This is a typical example of mutual exclusion. Specifically, the public egos-2000 code only supports a single core, so the booting code (aka. boot loader) will prevent the other 3 cores from running egos-2000 after egos-2000 has already started to run on a core. Such mutual exclusion requires a new class of CPU instructions called atomic memory operations which is an extension to the basic RISC-V instruction set architecture.
Atomic memory operation
The assembly code below is from earth/boot.s
which shows the first instructions that all the CPU cores will execute when running egos-2000.
boot_loader:
la t0, boot_lock
li t1, 1
amoswap.w.aq t1, t1, (t0)
bnez t1, boot_loader
li sp, 0x80400000
call boot
.bss
boot_lock: .word 0
booted_core_cnt: .word 0
Line 9 means that there is a 4-byte variable named boot_lock
that is initialized as 0 before all the cores start to execute the first instruction (i.e., the la
instruction in boot_loader
). The first two instructions load the address of boot_lock
to register t0
and load value 1
to register t1
. The magic happens at the amoswap.w.aq
instruction where amo
stands for atomic memory operation. It swaps the value of t1
with the 4 bytes at memory address t0
atomically. Atomicity here means that only one of the CPU cores could swap back the initial 0
value of boot_lock
, after which all the other cores will swap back value 1
.
Therefore, the first core that completes amoswap.w.aq
will proceed while the other cores will trap into an infinite loop due to the bnez
instruction as long as boot_lock
continues to hold value 1
.
TIP
The magic of amoswap.w.aq
is enforced by the CPU hardware design. The CPU hardware decides which core should be the first when multiple cores try to execute this instruction at the same time. This is part of the memory coherence problem in computer architecture.
We say that the code above acquires the boot lock and one can release this lock by writing value 0
to the address of boot_lock
. After releasing the boot lock, another core would be able to acquire it and proceed with the code logic after the bnez
instruction.
In general, a lock is simply a 4-byte variable in the memory holding either 0
or 1
. Given a lock variable x
in C, we have provided two macros for you in library/egos.h
.
#define release(x) __sync_lock_release(&x);
#define acquire(x) while (__sync_lock_test_and_set(&x, 1) != 0);
/* The __sync_lock_* functions are defined within the C compiler. */
While you still need to write some assembly code in this project, you can use these macros whenever you need to acquire or release a lock in your C code. Let's now release the boot lock and see what will happen.
Release the boot lock
After acquiring boot_lock
, the first booted core sets the stack pointer as 0x80400000
and calls the boot
function in erath/boot.c
. Because variable booted_core_cnt
is initially 0
, this core will enter the if branch in earth/boot.c
and calls the 4 initialization functions: tty_init
, disk_init
, mmu_init
and intr_init
. These functions initialize the TTY and disk devices as well as the CSRs related to virtual memory and interrupts. Finally, boot()
will call the grass_entry
function defined in grass/init.c
.
The grass_entry
function loads the binary executable for apps/system/sys_proc.c
into memory as the first process, and start to run this process after the mret
instruction. Note that after this mret
, the first process will set the stack pointer to 0x80800000
as shown in apps/app.s
. This means that the first booted core has finished using the kernel stack, so we can safely release the boot lock and let the next booted core use the kernel stack.
TIP
Think about why the stack memory would be corrupted if multiple CPU cores use the same stack at the same time, and what undesirable consequences could happen.
We thus ask you to release the boot lock at the start of apps/system/sys_proc.c
. Because grass_entry()
has passed the address of boot_lock
to this first process as the second argument of its main
function, you can access boot_lock
through argument boot
. Here is one possible printing after releasing the boot lock.
> make qemu
...
[CRITICAL] --- Booting on QEMU with core #1 ---
...
[SUCCESS] Enter kernel process GPID_PROCESS
[SUCCESS] --- Core #4[INFO] Load kern el process #2: sys_tserminal
tarts running ---
[INFO] Load 0xde4 bytes to 0x80400000
[INFO] Load 0x12c bytes to 0x80408000
[SUCCESS] Enter kernel process GPID_TERMINAL
...
Essentially, the first process prints out Enter kernel process GPID_PROCESS
normally, but the printing of [INFO] Load kernel process #2: sys_terminal
is mixed with the printing of [SUCCESS] --- Core #4 starts running ---
defined in earth/boot.c
. This is indeed the result of core #1 running the first process while core #4 running boot()
in parallel.
Complete the boot loader
Start with a fresh copy of egos-2000 and incorporate your solutions of P4, virtual memory.
Your first task is to complete the code in apps/system/sys_proc.c
and earth/boot.c
, so all the 4 cores can finish booting. Specifically, the 3 cores booting after the first one should setup their CSRs for interrupts and virtual memory, but they do not initialize the disk or TTY devices again. In particular, multicore requires page table translation, so you can simply do FATAL
in your code if SOFT_TLB
has been chosen as the translation mechanism.
The goal is to see [SUCCESS] --- Core #? starts running ---
for all the 3 cores, so we know that all the 4 cores have booted. However, you will likely meet exceptions and FATAL
in excp_entry()
in the kernel since we have not yet protected the kernel with locks.
Kernel lock
As mentioned above, if multiple CPU cores use the kernel stack or update the kernel data structures simultaneously, these memory regions would be corrupted. Therefore, we have defined kernel_lock
in grass/kernel.s
which should ensure that, at any time, only one core can access the kernel stack or data structures. Your next task is to acquire and release this kernel lock, protecting the kernel.
Protect the kernel
Recall the trap_entry
defined in grass/kernel.s
which is first introduced in P2. This is the entry point of all interrupts or exceptions which trap a CPU core into the kernel. In other words, if multiple cores get trapped at the same time, they will all execute the instructions in trap_entry
and thus use the kernel stack at 0x80400000
.
Intuitively, you will need to acquire the kernel lock before switching to the kernel stack and release the kernel lock right before the mret
in trap_entry
. There is one key difference from how this is done in earth/boot.s
. Recall that trap_entry
saves all the registers on the kernel stack before calling kernel_entry
, and restores all the registers before mret
. This requires that your code for acquring or releasing the lock does not modify the value of any registers. This can be achieved by using the so-called scratch CSRs such as mscratch
and ssratch
. For example, to keep the value of the sp
register, trap_entry
has already been using the mscratch
CSR to record the old value of sp
.
After your modifications to grass/kernel.s
, there is only one more thing you need to do in grass/kernel.c
. When we have only one core, this core will always have some process to run because process GPID_TERMINAL
should always be runnable. Given multiple cores, it is possible that a core has to be idle, i.e., the scheduler cannot find a runnable process for this core. As a result, we need to ask an idle core to do nothing before it receives the next timer interrupt. And you will implement this in the CORE_IDLE
part of proc_yield
.
TIP
We use a single lock to protect the whole kernel because it is simple. In many operating systems, there are separate kernel stacks dedicated to different CPU cores so that kernel stacks don't need to be protected by locks. And there are different locks protecting different kernel data structures.
Run processes in parallel
Now, let us run multiple processes on different CPU cores and see whether our kernel can indeed handle multicore. Specifically, you will implement the proc_coresinfo
function in grass/kernel.c
and add it into struct grass
, so the shell can call this function for the coresinfo
built-in command (see coresinfo
in apps/system/sys_shell.c
). After you finish, run multiple processes in the background and try coresinfo
as follows.
> make qemu
...
[CRITICAL] Welcome to the egos-2000 shell!
➜ /home/yunhao clock --silent &
[INFO] process 6 running in the background
➜ /home/yunhao clock --silent &
[INFO] process 7 running in the background
➜ /home/yunhao clock --silent &
[INFO] process 8 running in the background
➜ /home/yunhao clock --silent &
[INFO] process 9 running in the background
➜ /home/yunhao coresinfo
[INFO] ==============Core ID / Process ID==============
[INFO] Core #1 is running pid=6
[INFO] Core #2 is running pid=4 (GPID_SHELL)
[INFO] Core #3 is running pid=7
[INFO] Core #4 is running pid=9
➜ /home/yunhao coresinfo
[INFO] ==============Core ID / Process ID==============
[INFO] Core #1 is running pid=8
[INFO] Core #2 is running pid=9
[INFO] Core #3 is running pid=4 (GPID_SHELL)
[INFO] Core #4 is running pid=7
...
In this demo, we started 4 processes in the background and, if we run coresinfo
multiple times, different cores would be running different processes. For the first coresinfo
above, processes #6, #7 and #9 were running while #8 was not running (i.e., runnable). You could also try the killall
built-in command after which all the 4 processes will terminate. Here is a video with more details of this demo.
Find concurrency bugs
Being able to run a demo does not mean that your code is free of bugs. Concurrency bugs refer to bugs that happen when there are multiple programs running concurrently. For this project, a concurrency bug could be programs running on different cores trying to modify the kernel data structures at the same time. Although we have protected the kernel with the kernel lock, concurrency bugs still exist. See whether you can find and fix them.
In general, concurrency bugs are difficult to find and debug because they typically happen neither deterministically nor often. Given that egos-2000 has a small code base and it uses simple data structures, a good way to remove concurrency bugs is to reason about the use of all data structures very carefully.
TIP
If you wish to learn more about parallel programming in operating systems, this book is a fun read: Is Parallel Programming Hard, And, If So, What Can You Do About It? . Also, if you run this project on the Arty board, you will need to add a few nop
after every amoswap
instruction. The reason is that the CPU design for the Arty board has a flaw and it cannot complete the amoswap
instruction within one CPU cycle. By adding a few nop
, the CPU core waits for amoswap
to complete before proceeding to a critical section.
Accomplishments
In this project, you have gained hands-on experiences with atomic memory operations, an extension to the RISC-V instruction set for multicore. You have seen how multiple cores can execute instructions (e.g., do printing) in parallel, and how to make the egos-2000 kernel support multicore using a kernel lock. Lastly, you have seen that concurrency bugs are not obvious because they do not happen often or deterministically.