Skip to content

Multicore & Locks

By far, we have been running the operating system and user applications on one CPU core, but multicore CPUs have been widely used. A laptop typically has a CPU with 4 to 16 cores while the CPU on a server machine in a data center could have 64 or more cores. Multicore leads to higher software performance because multiple processes run on different cores at the same time, and therefore multiple software tasks can make progress in parallel.

Handling multiple cores involves an important concept called mutual exclusion. In general, mutual exclusion prevents multiple CPU cores from using a shared resource simultaneously. For example, when the kernel code is running on a core and updating some data structure (e.g., the PCB), it may need exclusive access to the data structure. Otherwise, these kernel data structures can be corrupted by simultaneous updates by multiple cores. We thus start by introducing mutual exclusion.

Mutual exclusion

When you run egos-2000 on QEMU, you may have noticed that the first line printed is one of the four possibilities listed below. In other words, a core is randomly selected from #1 .. #4 to boot egos-2000 when you run it on QEMU.

[CRITICAL] --- Booting on QEMU with core #1 ---
[CRITICAL] --- Booting on QEMU with core #2 ---
[CRITICAL] --- Booting on QEMU with core #3 ---
[CRITICAL] --- Booting on QEMU with core #4 ---

This is a typical example of mutual exclusion. Specifically, the public egos-2000 code only runs on a single core, so the booting code (aka. boot loader) will prevent the other 3 cores from running egos-2000 after egos-2000 has already started to run on a core. Such mutual exclusion requires a new class of CPU instructions called atomic memory operations which is an extension to the basic RISC-V instruction set architecture.

Atomic memory operation

The assembly code below is from earth/boot.s which shows the first instructions that all the CPU cores will execute when running egos-2000.

asm
boot_loader:
    la t0, boot_lock
    li t1, 1
    amoswap.w.aq t1, t1, (t0)
    bnez t1, boot_loader
    li sp, 0x80200000
    call boot
.bss
    boot_lock:       .word 0
    booted_core_cnt: .word 0

Line 9 means that there is a 4-byte variable named boot_lock that is initialized as 0 before all the cores start to execute the first instruction (i.e., the la instruction in boot_loader). The first two instructions load the address of boot_lock to register t0 and load value 1 to register t1. The magic happens at the amoswap.w.aq instruction where amo stands for atomic memory operation. It swaps the value of t1 with the 4 bytes at memory address t0 atomically. Atomicity means that only one of the 4 cores could swap back the initial 0 value of boot_lock, after which all the other cores will swap back value 1.

The first core that completes amoswap.w.aq will call boot() and, as long as boot_lock continues to hold value 1, the other cores will be trapped into an infinite loop because of the bnez instruction.

TIP

The magic of amoswap.w.aq is enforced by the CPU hardware design. The CPU hardware decides which core should be the first when multiple cores try to execute this instruction at the same time. This is closely related to the memory coherence problem in computer architecture.

We say that the code above acquires the boot lock and one can release this lock by writing value 0 to the address of boot_lock. After releasing the boot lock, another core would be able to acquire it, and proceed with calling boot().

In general, a lock is simply a 4-byte variable in the memory holding either 0 or 1. Given a lock variable x in C, we have provided two macros for you in library/egos.h.

c
/* The __sync_lock_* functions are defined within the C compiler. */
#define release(x) __sync_lock_release(&x);
#define acquire(x) while (__sync_lock_test_and_set(&x, 1) != 0);

While you still need to write some assembly code in this project, you can use these macros whenever you need to acquire or release a lock in your C code. Now, release the boot lock, and see what will happen.

Release the boot lock

Since booted_core_cnt is initially 0 just like boot_lock, the first booted core will enter the if branch in boot(), and calls 4 initialization functions: tty_init, disk_init, mmu_init, and intr_init. They initialize the TTY and disk devices, and the CSRs for virtual memory and interrupts. Lastly, boot() calls the grass_entry function defined in grass/init.c.

The grass_entry function loads the binary executable for apps/system/sys_proc.c into memory as the first process, and start to run this process after the mret instruction. Note that after this mret, the first process will set the stack pointer to 0x80400000 as shown in apps/app.s. This means that the first booted core has finished using the kernel stack, so we can now safely release the boot lock and allow the next core to call boot().

We thus ask you to release the boot lock at the start of apps/system/sys_proc.c. Because grass_entry() has passed the address of boot_lock to this process as an argument for its main function, you can access boot_lock through the struct multicore* argument. Here is one possible printing after you release the boot lock.

> make qemu
...
[CRITICAL] --- Booting on QEMU with core #1 ---
...
[SUCCESS] Enter kernel process GPID_PROCESS
[SUCCESS] --- Core #4[INFO] Load kern el process #2: sys_tserminal
tarts running ---
[INFO] Load 0xde8 bytes to 0x80200000
[INFO] Load 0x11c bytes to 0x80208000
[SUCCESS] Enter kernel process GPID_TERMINAL
...

Essentially, the first process prints out Enter kernel process GPID_PROCESS normally, but the printing of [INFO] Load kernel process #2: sys_terminal is mixed with the printing of [SUCCESS] --- Core #4 starts running --- defined in earth/boot.c. This is exactly the result of core #1 running the first process while core #4 running boot() in parallel.

Complete the boot loader

Start with a fresh copy of egos-2000, and incorporate your code for virtual memory in P4.

Your first task is to complete the code in apps/system/sys_proc.c and earth/boot.c, so all the cores finish booting. Specifically, the 3 cores booting after the first one should set up their own CSRs for interrupt and virtual memory, but they do not initialize the devices again. Note that each core owns their own set of CSRs. In particular, multicore requires page table translation, so simply do a FATAL in your code if SOFT_TLB has been chosen.

The goal is to see [SUCCESS] --- Core #? starts running --- for all the 3 cores, so we know that all the 4 cores have booted. However, you will likely meet exceptions and FATAL in excp_entry() in the kernel since we have not yet protected the kernel code with locks.

Kernel lock

As we have mentioned, if multiple CPU cores use the kernel stack or update the kernel data structures at the same time, these memory regions can be corrupted. Therefore, we define kernel_lock in grass/kernel.s which should ensure that, at any time, only one core can access the kernel stack or data structures. Your job is to protect the kernel by acquiring and releasing the kernel_lock.

Protect the kernel

Recall the trap_entry defined in grass/kernel.s which is first introduced in P2. It is the entry point of all the interrupts or exceptions trapping a CPU core into the kernel. If multiple cores get trapped at the same time, they will all execute the instructions in trap_entry and thus use the kernel stack at 0x80200000.

Intuitively, you will need to acquire the kernel lock before switching to the kernel stack and release the kernel lock right before the mret in trap_entry. There is one key difference from how this is done in earth/boot.s. Recall that trap_entry saves all the registers on the kernel stack before calling kernel_entry, and restores all the registers before mret. This requires that your code for acquiring or releasing the lock does not modify the value of any registers. This can be achieved by using the so-called scratch CSRs such as mscratch and ssratch. For example, to keep the value of the sp register, trap_entry has already been using the mscratch CSR to record the old value of sp.

After your modifications to grass/kernel.s, there is only one more thing you need to do in grass/kernel.c. Given multiple CPU cores, it is possible that a core has to be idle, i.e., the scheduler cannot find a RUNNABLE process to run on this core. As a result, we need to ask an idle core to do nothing before it gets the next timer interrupt. You have seen this situation when implementing sleep in P3, and you need to handle it again in proc_yield.

TIP

We use a single lock to protect the whole kernel because it is simple. In many operating systems, there are separate kernel stacks dedicated to different CPU cores so that kernel stacks don't need to be protected by locks. There are different locks protecting the other kernel data structures.

Run processes in parallel

Now, let us run multiple processes on different CPU cores, and see whether our kernel can indeed handle multicore. Specifically, you will implement the proc_coresinfo function at the end of grass/kernel.c and add this function into struct grass, so the shell can call this function for its built-in command coresinfo (see hints in apps/system/sys_shell.c). After you finish, run multiple processes in the background, and try coresinfo.

> make qemu
...
[CRITICAL] Welcome to the egos-2000 shell!
➜ /home/yunhao loop &
[INFO] process 6 running in the background
➜ /home/yunhao loop &
[INFO] process 7 running in the background
➜ /home/yunhao loop &
[INFO] process 8 running in the background
➜ /home/yunhao loop &
[INFO] process 9 running in the background
➜ /home/yunhao coresinfo
[INFO] ==============Core ID / Process ID==============
[INFO] Core #1 is running pid=6
[INFO] Core #2 is running pid=4 (GPID_SHELL)
[INFO] Core #3 is running pid=7
[INFO] Core #4 is running pid=9
➜ /home/yunhao coresinfo
[INFO] ==============Core ID / Process ID==============
[INFO] Core #1 is running pid=8
[INFO] Core #2 is running pid=9
[INFO] Core #3 is running pid=4 (GPID_SHELL)
[INFO] Core #4 is running pid=7
...

In this demo, we started 4 processes in the background and, if we run coresinfo multiple times, different cores would be running different processes. For the first coresinfo above, processes #6, #7 and #9 were running while #8 was not running (i.e., RUNNABLE). You can also try the killall built-in command after which all the background loop will terminate. Here is a video with more details of this demo.

Find concurrency bugs

Being able to run a demo does not mean that your code is free of bugs. Concurrency bugs refer to bugs that could happen when multiple programs are running concurrently. For this project, a concurrency bug could be programs running on different cores modifying certain kernel data structures at the same time. While we have protected the kernel with the kernel lock, such concurrency bugs still exist. See whether you can find and fix them.

In general, concurrency bugs are difficult to find and debug because they typically happen neither deterministically nor often. Given that egos-2000 has a small code base and it uses simple data structures, a good way to remove concurrency bugs is to reason about the use of all the data structures very carefully.

TIP

If you wish to learn more about parallel programming in operating systems, this book is a fun read: Is Parallel Programming Hard, And, If So, What Can You Do About It? Also, if you run your code on the Arty board, you need to add a few nop after each amoswap instruction. The reason is that the CPU design for the Arty board has a flaw, and it cannot complete the amoswap instruction in one CPU cycle. By adding a few nop, the CPU core waits for amoswap to complete before proceeding into a critical section. The Tang Nano 20K board does not support multicore right now.

Accomplishments

You have gained some experience with atomic memory operations, the extension to RISC-V instruction set for multicore. You have seen how multiple cores run code in parallel resulting in an interleaved printing, and how to protect the kernel with a lock. Lastly, you have tried to find and fix concurrency bugs which may not happen often or deterministically.

"... any person ... any study."