Multicore & Locks

In previous projects, we have been running the operating system and user applications on one CPU core. However, multicore CPUs have been widely used: a laptop typically has a CPU with 4 to 16 cores while the CPU on a server machine in a data center could have 64 or more cores. Multicore leads to higher software performance because multiple processes can then run on different cores at the same time, and therefore multiple software tasks can make progress in parallel.

Handling multiple cores involves an important concept called mutual exclusion. In general, mutual exclusion prevents multiple CPU cores from using a shared resource at the same time. For example, when the kernel code is running on a core and updating a data structure (e.g., the PCB), it may need exclusive access to such data structures while preventing other cores from updating them simultaneously. Otherwise, these kernel data structures could be corrupted by simultaneous updates. We thus start by introducing mutual exclusion.

Mutual exclusion

When you run egos-2000 on QEMU, you may have noticed that the first line printed is one of the four possibilities listed below. In other words, a core is randomly selected from #1 .. #4 to run egos-2000 when you run it on QEMU.

[CRITICAL] --- Booting on QEMU with core #1 ---
[CRITICAL] --- Booting on QEMU with core #2 ---
[CRITICAL] --- Booting on QEMU with core #3 ---
[CRITICAL] --- Booting on QEMU with core #4 ---

This is a typical example of mutual exclusion. Specifically, the public egos-2000 code only supports a single core, so the booting code (aka. boot loader) will prevent the other 3 cores from running egos-2000 after egos-2000 has already started to run on a core. Such mutual exclusion requires a new class of CPU instructions called atomic memory operations which is an extension to the basic RISC-V instruction set architecture.

Atomic memory operation

The assembly code below is from earth/boot.s which shows the first instructions that all the CPU cores will execute when running egos-2000.

asm

boot_loader:
    la t0, boot_lock
    li t1, 1
    amoswap.w.aq t1, t1, (t0)
    bnez t1, boot_loader
    li sp, 0x80400000
    call boot
.bss
    boot_lock:       .word 0
    booted_core_cnt: .word 0

Line 9 means that there is a 4-byte variable named boot_lock that is initialized as 0 before all the cores start to execute the first instruction (i.e., the la instruction in boot_loader). The first two instructions load the address of boot_lock to register t0 and load value 1 to register t1. The magic happens at the amoswap.w.aq instruction where amo stands for atomic memory operation. It swaps the value of t1 with the 4 bytes at memory address t0 atomically. Atomicity here means that only one of the CPU cores could swap back the initial 0 value of boot_lock, after which all the other cores will swap back value 1.

Therefore, the first core that completes amoswap.w.aq will proceed while the other cores will be trapped into an infinite loop because of the bnez instruction as long as boot_lock continues to hold value 1.

TIP

The magic of amoswap.w.aq is enforced by the CPU hardware design. The CPU hardware decides which core should be the first when multiple cores try to execute this instruction at the same time. This is closely related to the memory coherence problem in computer architecture.

We say that the code above acquires the boot lock and one can release this lock by writing value 0 to the address of boot_lock. After releasing the boot lock, another core would be able to acquire it and proceed with the code logic after the bnez instruction.

In general, a lock is simply a 4-byte variable in the memory holding either 0 or 1. Given a lock variable x in C, we have provided two macros for you in library/egos.h.

#define release(x) __sync_lock_release(&x);
#define acquire(x) while (__sync_lock_test_and_set(&x, 1) != 0);
/* The __sync_lock_* functions are defined within the C compiler. */

While you still need to write some assembly code in this project, you can use these macros whenever you need to acquire or release a lock in your C code. Let's release the boot lock now and see what will happen.

Release the boot lock

After acquiring boot_lock, the first booted core sets the stack pointer as 0x80400000 and calls the boot function in erath/boot.c. Since the booted_core_cnt variable is initially 0, this core will enter the if branch in boot() and calls 4 initialization functions: tty_init, disk_init, mmu_init, and intr_init. They initialize the TTY and disk devices, and the CSRs relating to virtual memory and interrupts. In the end, boot() calls the grass_entry function defined in grass/init.c.

The grass_entry function loads the binary executable for apps/system/sys_proc.c into memory as the first process, and start to run this process after the mret instruction. Note that after this mret, the first process will set the stack pointer to 0x80800000 as shown in apps/app.s. This means that the first booted core has finished using the kernel stack, so we can safely release the boot lock and allow the next core to call boot().

We thus ask you to release the boot lock at the start of apps/system/sys_proc.c. Because grass_entry() has passed the address of boot_lock to this process as an argument for its main function, you can access boot_lock through the struct multicore* argument. Here is one possible printing after releasing the boot lock.

shell

> make qemu
...
[CRITICAL] --- Booting on QEMU with core #1 ---
...
[SUCCESS] Enter kernel process GPID_PROCESS
[SUCCESS] --- Core #4[INFO] Load kern el process #2: sys_tserminal
tarts running ---
[INFO] Load 0xde4 bytes to 0x80400000
[INFO] Load 0x12c bytes to 0x80408000
[SUCCESS] Enter kernel process GPID_TERMINAL
...

Essentially, the first process prints out Enter kernel process GPID_PROCESS normally, but the printing of [INFO] Load kernel process #2: sys_terminal is mixed with the printing of [SUCCESS] --- Core #4 starts running --- defined in earth/boot.c. This is indeed the result of core #1 running the first process while core #4 running boot() in parallel.

Complete the boot loader

Start with a fresh copy of egos-2000 and incorporate your code for virtual memory in P4.

Your first task is to complete the code in apps/system/sys_proc.c and earth/boot.c, so all 4 cores can finish booting. Specifically, the 3 cores booting after the first one should set up their CSRs for interrupts and virtual memory, but they do not initialize the TTY or disk devices again. In particular, multicore requires page table translation, so simply do a FATAL in your code if SOFT_TLB has been chosen as the translation mechanism.

The goal is to see [SUCCESS] --- Core #? starts running --- for all the 3 cores, so we know that all the 4 cores have booted. However, you will likely meet exceptions and FATAL in excp_entry() in the kernel since we have not yet protected the kernel with locks.

Kernel lock

As we have mentioned, if multiple CPU cores use the kernel stack or update the kernel data structures at the same time, these memory regions could be corrupted. Therefore, we have defined kernel_lock in grass/kernel.s which should ensure that, at any time, only one core can access the kernel stack or data structures. Your next task is to acquire and release this kernel lock, protecting the kernel.

Protect the kernel

Recall the trap_entry defined in grass/kernel.s which is first introduced in P2. It is the entry point of all the interrupts or exceptions that trap a CPU core into the kernel. In other words, if multiple cores get trapped at the same time, they will all execute the instructions in trap_entry and thus use the kernel stack at 0x80400000.

Intuitively, you will need to acquire the kernel lock before switching to the kernel stack and release the kernel lock right before the mret in trap_entry. There is one key difference from how this is done in earth/boot.s. Recall that trap_entry saves all the registers on the kernel stack before calling kernel_entry, and restores all the registers before mret. This requires that your code for acquiring or releasing the lock does not modify the value of any registers. This can be achieved by using the so-called scratch CSRs such as mscratch and ssratch. For example, to keep the value of the sp register, trap_entry has already been using the mscratch CSR to record the old value of sp.

After your modifications to grass/kernel.s, there is only one more thing you need to do in grass/kernel.c. Given multiple CPU cores, it is possible that a core has to be idle, i.e., the scheduler cannot find a RUNNABLE process to run on this core. As a result, we need to ask an idle core to do nothing before it receives the next timer interrupt. And you will implement this within the CORE_IDLE if-statement of proc_yield.

TIP

We use a single lock to protect the whole kernel because it is simple. In many operating systems, there are separate kernel stacks dedicated to different CPU cores so that kernel stacks don't need to be protected by locks. There are different locks protecting different kernel data structures.

Run processes in parallel

Now, let us run multiple processes on different CPU cores and see whether our kernel can indeed handle multicore. Specifically, you will implement the proc_coresinfo function at the end of grass/kernel.c and add this function into struct grass, so the shell can call this function for the coresinfo built-in command (see apps/system/sys_shell.c). After you finish, run multiple processes in the background and try coresinfo in the shell.

shell

> make qemu
...
[CRITICAL] Welcome to the egos-2000 shell!
➜ /home/yunhao loop &
[INFO] process 6 running in the background
➜ /home/yunhao loop &
[INFO] process 7 running in the background
➜ /home/yunhao loop &
[INFO] process 8 running in the background
➜ /home/yunhao loop &
[INFO] process 9 running in the background
➜ /home/yunhao coresinfo
[INFO] ==============Core ID / Process ID==============
[INFO] Core #1 is running pid=6
[INFO] Core #2 is running pid=4 (GPID_SHELL)
[INFO] Core #3 is running pid=7
[INFO] Core #4 is running pid=9
➜ /home/yunhao coresinfo
[INFO] ==============Core ID / Process ID==============
[INFO] Core #1 is running pid=8
[INFO] Core #2 is running pid=9
[INFO] Core #3 is running pid=4 (GPID_SHELL)
[INFO] Core #4 is running pid=7
...

In this demo, we started 4 processes in the background and, if we run coresinfo multiple times, different cores would be running different processes. For the first coresinfo above, processes #6, #7 and #9 were running while #8 was not running (i.e., RUNNABLE). You can also try the killall built-in command after which all the loop processes will terminate. Here is a video with more details of this demo.

Find concurrency bugs

Being able to run a demo does not mean that your code is free of bugs. Concurrency bugs refer to bugs that could happen when multiple programs are running concurrently. For this project, a concurrency bug could be programs running on different cores modifying certain kernel data structures at the same time. While we have protected the kernel with the kernel lock, such concurrency bugs still exist. See whether you can find and fix them.

In general, concurrency bugs are difficult to find and debug because they typically happen neither deterministically nor often. Given that egos-2000 has a small code base and it uses simple data structures, a good way to remove concurrency bugs is to reason about the use of all the data structures very carefully.

TIP

If you wish to learn more about parallel programming in operating systems, this book is a fun read: Is Parallel Programming Hard, And, If So, What Can You Do About It? Also, if you run your code on the Arty board, you need to add a few nop after each amoswap instruction. The reason is that the CPU design for the Arty board has a flaw and it cannot complete the amoswap instruction in one CPU cycle. By adding a few nop, the CPU core waits for amoswap to complete before proceeding into a critical section.

Accomplishments

You have gained hands-on experience with atomic memory operations, an extension to the RISC-V instruction set for multicore. You have seen how multiple cores execute instructions (e.g., do printing) in parallel, and how to protect the egos-2000 kernel with the kernel lock. Lastly, you have seen that concurrency bugs are sometimes not obvious because they may not happen often or deterministically.

Multicore & Locks ​

Mutual exclusion ​

Atomic memory operation ​

Release the boot lock ​

Complete the boot loader ​

Kernel lock ​

Protect the kernel ​

Run processes in parallel ​

Find concurrency bugs ​

Accomplishments ​