Skip to content

Multicore & Locks

By far, we have been running the operating system and user applications on a single CPU core, while multicore CPUs have been widely used. The CPU on a laptop or a mobile phone typically has 4 to 16 cores, and the CPU on a server machine in a datacenter could have 64 or more cores. Multicore leads to higher software performance because multiple processes can then run on different cores at the same time, and therefore multiple software tasks can make progress in parallel.

Dealing with multicore involves an important concept, mutual exclusion. In general, mutual exclusion prevents multiple CPU cores from accessing a shared resource at the same time. For example, when the kernel code is running on a core and updating some data structures (e.g., the PCB), it may need exclusive access to such data structures while preventing other cores from updating them simultaneously. Otherwise, these kernel data structures could be corrupted by simultaneous updates. We thus start by introducing mutual exclusion.

Mutual exclusion

When you run egos-2000 on QEMU, you may have noticed that the first line printed is one of the four possibilities listed below. In other words, a core is randomly selected from #1 .. #4 to run egos-2000 when you run it on QEMU.

[CRITICAL] --- Booting on QEMU with core #1 ---
[CRITICAL] --- Booting on QEMU with core #2 ---
[CRITICAL] --- Booting on QEMU with core #3 ---
[CRITICAL] --- Booting on QEMU with core #4 ---

This is a typical example of mutual exclusion. Specifically, the public egos-2000 code only supports a single core, so the booting code (aka. boot loader) will prevent the other 3 cores from running egos-2000 after egos-2000 has already started to run on a core. Such mutual exclusion requires a new class of CPU instructions called atomic memory operations which is an extension to the basic RISC-V instruction set architecture.

Atomic memory operation

The assembly code below is from earth/boot.s which shows the first instructions that all the CPU cores will execute when running egos-2000.

asm
boot_loader:
    la t0, boot_lock
    li t1, 1
    amoswap.w.aq t1, t1, (t0)
    bnez t1, boot_loader
    li sp, 0x80400000
    call boot
.bss
    boot_lock:       .word 0
    booted_core_cnt: .word 0

Line 9 means that there is a 4-byte variable named boot_lock that is initialized as 0 before all the cores start to execute the first instruction (i.e., the la instruction in boot_loader). The first two instructions load the address of boot_lock to register t0 and load value 1 to register t1. The magic happens at the amoswap.w.aq instruction where amo stands for atomic memory operation. It swaps the value of t1 with the 4 bytes at memory address t0 atomically. Atomicity here means that only one of the CPU cores could swap back the initial 0 value of boot_lock, after which all the other cores will swap back value 1.

Therefore, the first core that completes amoswap.w.aq will proceed while the other cores will trap into an infinite loop due to the bnez instruction as long as boot_lock continues to hold value 1.

TIP

The magic of amoswap.w.aq is enforced by the CPU hardware design. The CPU hardware decides which core should be the first when multiple cores try to execute this instruction at the same time. This is part of the memory coherence problem in computer architecture.

We say that the code above acquires the boot lock and one can release this lock by writing value 0 to the address of boot_lock. After releasing the boot lock, another core would be able to acquire it and proceed with the code logic after the bnez instruction.

In general, a lock is simply a 4-byte variable in the memory holding either 0 or 1. Given a lock variable x in C, we have provided two macros for you in library/egos.h.

c
#define release(x) __sync_lock_release(&x);
#define acquire(x) while (__sync_lock_test_and_set(&x, 1) != 0);
/* The __sync_lock_* functions are defined within the C compiler. */

While you still need to write some assembly code in this project, you can use these macros whenever you need to acquire or release a lock in your C code. Let's now release the boot lock and see what will happen.

Release the boot lock

After acquiring boot_lock, the first booted core sets the stack pointer as 0x80400000 and calls the boot function in erath/boot.c. Because variable booted_core_cnt is initially 0, this core will enter the if branch in earth/boot.c and calls the 4 initialization functions: tty_init, disk_init, mmu_init and intr_init. These functions initialize the TTY and disk devices as well as the CSRs related to virtual memory and interrupts. Finally, boot() will call the grass_entry function defined in grass/init.c.

The grass_entry function loads the binary executable for apps/system/sys_proc.c into memory as the first process, and start to run this process after the mret instruction. Note that after this mret, the first process will set the stack pointer to 0x80800000 as shown in apps/app.s. This means that the first booted core has finished using the kernel stack, so we can safely release the boot lock and let the next booted core use the kernel stack.

TIP

Think about why the stack memory would be corrupted if multiple CPU cores use the same stack at the same time, and what undesirable consequences could happen.

We thus ask you to release the boot lock at the start of apps/system/sys_proc.c. Because grass_entry() has passed the address of boot_lock to this first process as the second argument of its main function, you can access boot_lock through argument boot. Here is one possible printing after releasing the boot lock.

shell
> make qemu
...
[CRITICAL] --- Booting on QEMU with core #1 ---
...
[SUCCESS] Enter kernel process GPID_PROCESS
[SUCCESS] --- Core #4[INFO] Load kern el process #2: sys_tserminal
tarts running ---
[INFO] Load 0xde4 bytes to 0x80400000
[INFO] Load 0x12c bytes to 0x80408000
[SUCCESS] Enter kernel process GPID_TERMINAL
...

Essentially, the first process prints out Enter kernel process GPID_PROCESS normally, but the printing of [INFO] Load kernel process #2: sys_terminal is mixed with the printing of [SUCCESS] --- Core #4 starts running --- defined in earth/boot.c. This is indeed the result of core #1 running the first process while core #4 running boot() in parallel.

Complete the boot loader

Start with a fresh copy of egos-2000 and incorporate your solutions of P4, virtual memory.

Your first task is to complete the code in apps/system/sys_proc.c and earth/boot.c , so all the 4 cores can finish booting. Specifically, the 3 cores booting after the first one should setup their CSRs for interrupts and virtual memory, but they do not initialize the disk or TTY devices again. In particular, multicore requires page table translation, so you can simply do FATAL in your code if SOFT_TLB has been chosen as the translation mechanism.

The goal is to see [SUCCESS] --- Core #? starts running --- for all the 3 cores, so we know that all the 4 cores have booted. However, you will likely meet exceptions and FATAL in excp_entry() in the kernel since we have not yet protected the kernel with locks.

Kernel lock

As mentioned above, if multiple CPU cores use the kernel stack or update the kernel data structures simultaneously, these memory regions would be corrupted. Therefore, we have defined kernel_lock in grass/kernel.s which should ensure that, at any time, only one core can access the kernel stack or data structures. Your next task is to acquire and release this kernel lock, protecting the kernel.

Protect the kernel

Recall the trap_entry defined in grass/kernel.s which is first introduced in P2. This is the entry point of all interrupts or exceptions which trap a CPU core into the kernel. In other words, if multiple cores get trapped at the same time, they will all execute the instructions in trap_entry and thus use the kernel stack at 0x80400000.

Intuitively, you will need to acquire the kernel lock before switching to the kernel stack and release the kernel lock right before the mret in trap_entry. There is one key difference from how this is done in earth/boot.s. Recall that trap_entry saves all the registers on the kernel stack before calling kernel_entry, and restores all the registers before mret. This requires that your code for acquring or releasing the lock does not modify the value of any registers. This can be achieved by using the so-called scratch CSRs such as mscratch and ssratch. For example, to keep the value of the sp register, trap_entry has already been using the mscratch CSR to record the old value of sp.

After your modifications to grass/kernel.s, there is only one more thing you need to do in grass/kernel.c. When we have only one core, this core will always have some process to run because process GPID_TERMINAL should always be runnable. Given multiple cores, it is possible that a core has to be idle, i.e., the scheduler cannot find a runnable process for this core. As a result, we need to ask an idle core to do nothing before it receives the next timer interrupt. And you will implement this in the CORE_IDLE part of proc_yield.

TIP

We use a single lock to protect the whole kernel because it is simple. In many operating systems, there are separate kernel stacks dedicated to different CPU cores so that kernel stacks don't need to be protected by locks. And there are different locks protecting different kernel data structures.

Run processes in parallel

Now, let us run multiple processes on different CPU cores and see whether our kernel can indeed handle multicore. Specifically, you will implement the proc_coresinfo function in grass/kernel.c and add it into struct grass, so the shell can call this function for the coresinfo built-in command (see coresinfo in apps/system/sys_shell.c). After you finish, run multiple processes in the background and try coresinfo as follows.

shell
> make qemu
...
[CRITICAL] Welcome to the egos-2000 shell!
 /home/yunhao clock --silent &
[INFO] process 6 running in the background
 /home/yunhao clock --silent &
[INFO] process 7 running in the background
 /home/yunhao clock --silent &
[INFO] process 8 running in the background
 /home/yunhao clock --silent &
[INFO] process 9 running in the background
 /home/yunhao coresinfo
[INFO] ==============Core ID / Process ID==============
[INFO] Core #1 is running pid=6
[INFO] Core #2 is running pid=4 (GPID_SHELL)
[INFO] Core #3 is running pid=7
[INFO] Core #4 is running pid=9
 /home/yunhao coresinfo
[INFO] ==============Core ID / Process ID==============
[INFO] Core #1 is running pid=8
[INFO] Core #2 is running pid=9
[INFO] Core #3 is running pid=4 (GPID_SHELL)
[INFO] Core #4 is running pid=7
...

In this demo, we started 4 processes in the background and, if we run coresinfo multiple times, different cores would be running different processes. For the first coresinfo above, processes #6, #7 and #9 were running while #8 was not running (i.e., runnable). You could also try the killall built-in command after which all the 4 processes will terminate. Here is a video with more details of this demo.

Find concurrency bugs

Being able to run a demo does not mean that your code is free of bugs. Concurrency bugs refer to bugs that happen when there are multiple programs running concurrently. For this project, a concurrency bug could be programs running on different cores trying to modify the kernel data structures at the same time. Although we have protected the kernel with the kernel lock, concurrency bugs still exist. See whether you can find and fix them.

In general, concurrency bugs are difficult to find and debug because they typically happen neither deterministically nor often. Given that egos-2000 has a small code base and it uses simple data structures, a good way to remove concurrency bugs is to reason about the use of all data structures very carefully.

TIP

If you wish to learn more about parallel programming in operating systems, this book is a fun read: Is Parallel Programming Hard, And, If So, What Can You Do About It? . Also, if you run this project on the Arty board, you will need to add a few nop after every amoswap instruction. The reason is that the CPU design for the Arty board has a flaw and it cannot complete the amoswap instruction within one CPU cycle. By adding a few nop, the CPU core waits for amoswap to complete before proceeding to a critical section.

Accomplishments

In this project, you have gained hands-on experiences with atomic memory operations, an extension to the RISC-V instruction set for multicore. You have seen how multiple cores can execute instructions (e.g., do printing) in parallel, and how to make the egos-2000 kernel support multicore using a kernel lock. Lastly, you have seen that concurrency bugs are not obvious because they do not happen often or deterministically.

"... any person ... any study."