--------------------------[ Page One ]-------------------------- Kernel support for garbage collection is one potential technical advantage of a LispOS. GC flip can be done by stopping CPUs, not threads. Incremental scav might be plausible. There might be things we can do in terms of page protection for lazy determination of pointer/non-pointer values in registers during GC. Kernel design is fairly straightforward. Leave it for a bit. The real question is compiler and build tool support. If we start from SBCL, we would have to start by hosing the build process. The obvious runtime environments to support are bare-metal, a coLinux-style setup, and xen. The SBCL toplevel build control looks fairly hackable. Horrific thought: the first form in any source file is an in-package. We could set up a custom in-package function which mangles its argument in either host or target compile mode. In fact, it should "only" take a few hours to make a .asd file to do most of the work. We should be able to get genesis to produce a tiny core with no foreign symbol dependencies. If we arrange for the bootloader to read the entire core in at th e1-meg mark and then map pages for the core spaces, we should be able to put hte "core" code in Lisp. Put core header at 1-met, page tables at 2-gig, identity map the low meg, and be clever with the initial lisp function. First init function should just stomp the screen and lock up. Future boot loader should map a page or so for FS before starting Lisp, and the init function can populate the minimum required fields for operation, get itself an alloc region, and use the normal thread creation functions to build a proper thread to start the system. We don't need to asdf-ize the system to start, we could just hack a new build script. --------------------------[ Page Two ]-------------------------- Important components for a LispOS: Lisp Implementation (based on SBCL) Kernel (memory management, thread scheduling, etc.) File System (Movitz has ext2fs read under unknown license, might use that for now.) Networking (TCP/IP) Graphic Interface (X, use CLXS) User components: Filesystem Browser Text editor (climacs?) Terminal windows (for REPLs, or for telnet/ssh) X window manager Hard drive partitioning/formatting tools Web Browser Debugger IRC client (beirc?) -------------------------[ Page Three ]------------------------- The plan is to run the project mostly dark until it is visually impressive (X, a wm, and a couple apps). Any given subcomponent can be visible (clxs is a good example here), but the main goal should remain secret. Assume that we can beg, borrow, or steal anything anyone else is working on. The minimum we need is X, a wm, a filesystem, networking, and a terminal app. We can run X, the wm, and the terminal as open projects from the start. The terminal doesn't even have to be very complete. Let's leave it for later. The design for CLXS is approaching feature-complete. Once the infrastructure is built out a bit and the next stage is implemented, it can back-burner for a while. That's a day or so of work on CLXS, and then I can leave it again. The dark-running projects are the kernel, the filesystem, and networking. We can leave networking alone until the kernel and filesystem work. And then we might consider local-only networking for a while, as that should be enough to run X. The only bit of the kernel I'm worried about is the GC. And that's mostly because I don't understand SBCL's current GC. Some study is probably in order. The other major uncertainty is the filesystem interface. On the one side, it depends on the block device interface. And on the other side, it depends on the pathname and stream interfaces. And while we can use COS for network filesystems, PCL isn't loaded yet by the time the bootstrap file system has to go live. We can use my Forth system to do the initial bootstrap for a while. Saves having to do any 16-bit ASM work. SLAD will be trickier under this setup. Maybe have it fork a new address space and run there? --------------------------[ Page Four ]------------------------- Ext2fs is the obvious choice for an initial filesystem. We have sample code, and it is easily accessible from a Linux host. This leaves the questions of interfacing to pathnames and streams on the Lisp side,, the block device interface on the kernel side, and paged operation if we put the kernel and/or swap in the filesystem. An initial test shows that genesis requires wo foreign symbols directly. [undefined_tramp] and [closure_tramp]. If we have them in an assemfile, we should be able to load it before initializing the static symbols. If this does end up working, it would be a possible lead-in to that win32-native-executable compiler. Turns out that the tramp values are offset thanks to their fun headers. I may have to do a bit more work to get them working right. Qemu apparently supports floppy disks. Thus, we should be able to use it for initial testing. Our strategy for now should be to read all of fda to linear #x400000, which is unused space, build initial page tables based on the core header (or just hardcode them), and then use a stub in low memory to disable interrupts, set up the page tables and a stack frame, and call the initial function. : loadtrk 24 * DUP 200 * 400000 + SWAP 24 fdread ; : load-disk 50 BEGIN DUP WHILE 1- DUP loadtrk REPEAT DROP ; works to load a disk image. Even once the test core boots we won't be able to use the boot environment long before running into its limits. Thus, once the system is booting, a priority must be placed on building out the kernel functionality and the new cold-core memory layouts. It would be a good idea to arrange things such that we can recompile and reload source files without having to rebuild the whole compiler. Cut down the edit-compile-test time. --------------------------[ Page Five ]------------------------- Loading a core file would be easier if the header were simpler. We don't need build ids and stuff like that. We really just need to know where to map the pages and where the initial function is. We also don't need an awful lot of the bootstrap/cold-init data that we're getting. These changes to genesis would make it easier to set up the page tables and give us plenty of room for the os kernel. VARIABLE next-page VARIABLE page-directory 3 CONSTANT page-protection : new-page next-page @ 1000 next-page +! ; : clear-page 0 OVER C! DUP 1+ 0FFF CMOVE ; : make-page-table ( addr ) new-page DUP clear-page page-protection OR SWAP page-directory-entry ! ; : page-directory-entry ( addr ) 14 SHR -4 AND page-directory @ + ; : find-page-table ( addr ) DUP page-directory-entry @ DUP 1 AND IF NIP -1000 AND EXIT THEN DROP next-page @ SWAP make-page-table ; : map-page ( phys vert) SWAP page-protection OR SWAP DUP find-page-table SWAP 0A SHR 0FFC AND + ! ; We probably should make provisions to map the page tables into virtual memory as well. If we map the tables linearly at 80400000 - 807FFFFF, that's one extra page, and leaves 4 megs at the two gig line for the page directory and any other data we want. We should have an identity-map page table for the low four megs when we first enable paging. That way, we can still use Forth to finish setting up and switch the table back at the last minute. : id-map ( top-addr ) BEGIN DUP WHILE 1000 - DUP DUP map-page REPEAT DROP ; --------------------------[ Page Six ]-------------------------- CODE cr0@ EBX PUSH, 200F W, 0C3 C, NEXT, CODE cr0! 220F W, 0C3 C, EBX POP, NEXT, CODE cr3@ EBX PUSH, 220F W, 0DB C, NEXT, CODE cr3! 220F W, 0DB C, EBX POP, NEXT, : LGDT, 010F W, as-modrm 10 OR C, displace ; CODE LGDT 10 #, EBX SHL, EBX PUSH, 2 [ESP] LGDT, 8 #, ESP ADD, EBX POP, NEXT, The basic plan is to load the core file, set up the page tables for Lisp, set up and enable the page tables for forth, set up the Lisp GDT, assemble a startup stub in low memory, and jump to the startup stub. The startup stub must: Kill interrupts, reload cr3 with the Lisp page tables, establish a Lisp stack frame, and call the initial function. Actually, we can set the stack fram up and kill interrupts from Forth. CODE fs! EBX PUSH, 0A10F W, EBX POP, NEXT, MOV cr3, ebx / xor ebx, ebx / xor ecx, ecx / xor esi, esi / xor edi, edi / xor edx, edx / pop eax / pop ebp / call [eax+1] ? : start-lisp SP@ >R 0 0 0 R> 400000 @ page-directory @ lisp-boot-trampoline ; 400030 @ 400028 @ + 401000 + next-page ! : map-pages ( phys virt num ) BEGIN DUP WHILE 1- >R 2DUP map-page 1+ SWAP 1+ SWAP R> REPEAT DROP 2DROP ; : map-space 4 + DUP @ 1000 * 401000 + SWAP 4 + DUP @ 1000 * SWAP 4 + @ map-pages ; : map-core-file 400004 map-space 400014 map-space 400024 map-space We'll map the DT at 12000 and FS at 13000. -------------------------[ Page Seven ]------------------------- The two critical files on the disk are the core and the swap. It might be easiest to maintain their on-disk locations as extent lists. We then have a method of finding a disk sector for any given file offset. We need to maintain, in parallel with the page tables, information about what is mapped where, what the on-fault actions are, and which pages can be discarded if we need new memory pages. The GDT has a maximum of 8192 descriptors at 8 bytes each. That's 64k of space. We should reserve the full address space but only map the first page (and set the limit to just that page.) The IDT has a usable limit of 256 descriptors, most of which will be unused. That's half a page (2048 bytes). The boot loader should be static-allocating these at known locations. There are a number of functions that we obviously will need in the kernel. Page table manipulation, page allocation, selector manipulation, etc. We can write these even before we figure out what to do about memory management. The filesystem drivers don't have to be in the kernel. It is sufficient if the kernel has the disk drivers and extent lists for the core file and possibly swap space. That should simplify the kernel a bit at the expense of adding code to set up the extents to the booltloader. And that isn't a big expense because the bootloader would have to do most of that work to load the kernel from the core file in the first place. The kernel basically needs disk, keyboard, and timer drivers; a task switcher; memory managemnt; and garbage collection. fd-streams use a non-lisp mmap()ed memory buffer. That simplifies things quite a bit. The intial (pre-memory-manager) versions don't need any of the device drivers. -------------------------[ Page Eight ]------------------------- The next things to do, codewise, are to build out the memory management, interrupt, and task-switch primitives, the task switcher, and the bootloader ext2 reader. And I should study gencgc while I do so. The kernel functions are all basically statically linked. If they, their literals, and their fdefinitions were in a special segment, but not their symbols or top-level forms, that would probably suffice for a minimal bootstrap. It'll probably take a new fop to separate them out, though. It might be easier to have genesis dump everything from those files into the kernel segment and use the GC mechanisms to sort everything out. Transporting the functions and fdefinitions to static space might do the trick. Any tenured space might do. The Forth system could be run in a very primitive penalty box. Maintaining it and the boot code from within Lisp is a very real possibility. For now we can put the TSS in the other half of the page with the IDT. 0.) Create os-kernel pakage. 1.) Create interim page allocator. a.) Export page allocation pointer from Forth. 2.) Allocate IDT page and load IDTR. a.) Create LIDT VOP. 3.) Create TSS and load TR. a.) Create LTR VOP. b.) Create GDT selector allocator. c.) Create GDT selector configurator. 4.) Create a wait-for-interrupt function (HLT). --------------------------[ Page Nine ]------------------------- Once we have interrupts working in ring 0, and before the base system works, we need to spend some time on thread synchronization and whatnot. Right now we have no user I/O. And cold-init only uses the one thread. This should simplify the initial page-fault handler somewhat. We need an in-kernel display driver. For diagnostic output at this point, but for user-mode use eventually. Should probably have threadsafe entry points for user code and unsafe entry points for the kernel (which, presumably, is going to go tits-up if it needs to say anything). Should probably build this before setting up interrupt handlers. Each thread needs its own kernel stack. The general prologue for interrupts and exceptions will push a bogus error code as needed to normalize the stack depth followed by the integer registers. This lets us use the kernel stack as the CPU context structure for tasks not running on a CPU. The kernel needs a yield-to- thread operation whcih changes the ESP stored in the TSS, FS, and the current stack and frame pointers. There's something to be said about the scheduler, the debugger, and single-stepping instructions, but I'm not sure what yet. 1.) Add in and out instruction definitions. 2.) Create I/O port access VOPs. 3.) Move the hardware cursor. 4.) Transliterate the Forth display driver into Lisp. a.) Create memmove and memset VOPs. 5.) Lay out initial interrupt/exception stubs. 6.) Make compiler type propagation work. --------------------------[ Page Ten ]-------------------------- It's been just over a week. My standalone SBCL project has hit a snag with the cross-compiler and its pathological inability to manage type information. Demonstrating the issue is not hard. Fixing it might be. Thus far, I have an interesting proof-of- concept loader, and some good design work, but nothing truly compelling. The critical point would be working sb-show and running part of cold init. That would be interrupts, display drivers, page faults, and disk driver at the least. Also means further-hacked genesis and a filesystem driver in the Forth system. A good couple weeks work, at least. No, we don't need the filesystem driver. We can treat the HD as we do the floppy, and fake up a single extent for the core file and another one for swap. Still need the rest of it. Another possibility would be to back and fill. We have a Forth system, why not use it? We lose ideological purity, but wouldn't need to hack genesis, a disk driver, a display driver, or other kernel stuff. Somehow, I'm not convinced. This past week has brought a number of project ideas to mind, with varying levels of ambition. From blogging to webstores, to multiplayer web games, to stripping out and rebuilding the Python compiler to an auto-newbie-helper bot for IRC. It may be time for me to work on something else. The path ahead for OS work is fairly clear, and I don't want to burn out on it. Having thought about it, this compiler lossage means that one of my preconditions for SBCL/standalone doesn't hold. We can't really continue until it's sorted out. -------------------------[ Page Eleven ]------------------------ The compiler thing turned out to be partially me not declaring types as well as I should have and partially cross-float- infinity-kludge. Both problems were simple enough to solve. I managed to screw up loading the IDTR as well. An IDT limit of 0 doesn't make for happy page faults. To make things easier, I intend to have an idle thread which is always runnable and does (loop (wait-for-interrupt)), and has its function in kernel space. We'll need one per CPU. If we run the GC scavenger as another scheduled task, we might be able to arrange things such that we never take a page fault in ring 0. Our guiding philosophy should be to keep things as unprivelidged as possible, with some exceptions. Anything that can run in ring 3 does so. Anything that can be paged out is subjec to demand- paging. Read-only space, static space, and kernel space use wired memory pages. We run as little as possible in ring 0, and as little as possible in wired memory. Taken to an extreme, the ring-0 code would just handle interrupts, and would do so by switching threads to a per-cpu manager thread. We may get there someday. This effectively gets us PCLSRing "for free" if we implement it. And we should. Positing the existence of a REGISTER-KERNEL-FUNCTION function which moves a given function into wired space, treating the initial core as oldspace, linking the kernel functions first (so that they are towards the beginning of dynamic space), and just having the loader read up the first few megs of the cold-core might be a good way to start the system. With the minimal ring 0 proposed above, we could have the loader start us off in ring 3, and do all the setup from there. -------------------------[ Page Twelve ]------------------------ Our interrupt stack frames look like this: SS \ <- [ TSS ESP0 - 4 ] ESP | EFLAGS >- from priv transition CS | EIP / Error code - from exception or handler EAX -----\ ECX \ EDX | EBX >- from pushad TSS ESP0 - 24 | EBP | ESI / EDI -----/ Below EDI we can push whatever we want (interrupt/exception number, for example). It's a fixed format, and we could put an array or structure header on it for easy access from Lisp. ------------------------[ Page Thirteen ]----------------------- Having looked at the gencgc source, I believe that memory management will turn out to be fairly straightforward. Most of the scav work can be done from a dedicated thread, giving us an incremental system. The "page" structure of gencgc is similar to the region-bits array on a classical LispM. The "alloc_region" structure is most of the rest of the region setup on a LispM. With direct access to the hardware page tables, we don't need all of the same information as gencgc does. Because a booted core includes a page table and whatnot, it will have a substantially different structure than a cold-core. We can take advantage of this in a couple ways. First, a booted core can contain multiple threads. Second, because we have the page tables, we no longer need to keep track of the bounds of dynamic space in the core header. Third, abusing GC primitives as part of system bootstrap makes even more sense. We hack genesis to include a "kernel space" containing some set of critical files. On startup we scavenge the kernel functions and related constants to their resting place (maybe in static space?), and scav !cold-init to a fresh allocation region. With the rest of the cold-core as oldspace, we can continue on our way, and the incremental GC will sort out the mess. One problem with incremental GC is that we don't have a traditional read barrier. In practical terms, this means that any page which can be read by the mutator must contain only pointers to newspace or copyspace. We can arrange for the GC thread to be able to read otherwise-restricted pages either by running it in ring 1 or 2, or by having it map the pages into memory elsewhere for access (useful for writing to read-only space as well.) ------------------------[ Page Fourteen ]----------------------- Kind of random, but the missing piece for my frame-return hack was the binding pointer. If we save it in a dedicated slot in the stack frame, as part of XEP processing, it'd be fairly heap to have. rtoy points out that this doesn't help known-fun calls within a component. If we can non-conservatively scavenge the control stack, by means of code annotations or whatever, we can turn around and incrementally scavenge the stack (either by page protection or return address rewriting). Or both techniques if stacks are subject to swap-out. If, through code annotation, we don't have to conservatively cavenge the register set, then we can lose the dont_move page flag and all its implications. For GC, we need to know what the system looks like before and after flipping, what is done while flipping, how scavenging occurs, what happens when we page fault, what our invariants are, and so on. One of the invariants of a Baker-style incremental GC is that the mutator must never see a pointer to oldspace. On the old LispMs, this was accomplished by means of hardware. The page tables contained an oldspace bit, and it was easy for the microcode to check this bit for a given address. We don't have that luxury. We need to make sure that any live data on a page that is readable by a nrmal user task doesn't contain any oldspace pointers. We should probably have an operation mode where we only do stop-and-copy (non-incremental) GC. It would in many ways be easier to implement, and would provide a different performance profile. ------------------------[ Page Fifteen ]------------------------ A conservative root points to an object that may be live, and that we don't dare move. We can't move the object because it might be an unboxed integer, and modifying that would be obviously wrong. We can't discard it, since it might be a live poointer. But we are under no such constraints for any other object on the page. We can count the one object as unscavenged newspace, and the rest of the page as oldspace. We don't even need to worry about stray references to the rest of the page because the mutator can't see pointers to oldspace. After we flip, if the object is still a conservative root, we can put alloc regions around it, as the oldspace regions of the page have been evacuated. Let's call "unscavenged newspace" copyspace. We then have three spaces. Newspace is all live data and safe-for-the-mutator pointers. Oldspace is all untransported objects and forwarding pointers. Copyspace is all live data, but contains pointers to oldspace. Because we are using page protection for our read barrier, we must scavenge copyspace a page at a time. In order to do a GC flip, we must stop all of the CPUs, scavenge the conservative roots, scavenge the non-conservative roots, create new alloc regions for each threads (closing the old ones), and resume where we left off. We also have multiple generations to handle. We write-protect newspace for the older generations, as they effectively provide roots for younger generations, and we don't want to have to scav the entire memory space. ------------------------[ Page Sixteen ]------------------------ The different page types for GC are: 1.) Oldspace 2.) Newspace 3.) Copyspace 4.) Static Space 5.) Old-generation newspace Note that, with conservative roots we also can have a page that is both oldspace and copyspace. Such a page is protected as copyspace, and hang any transport complexity. Space types 1 and 2 are r/w. For oldspace, this is to make transport easier. Space type 3 is no-access in order to implement our read barrier. Space types 4 and 5 have r/o or r/w access depending on volatility. The GC needs to be concerned with fault activity on space types 3, 4, and 5, as they are the non-r/w spaces. For a fault in space 4 or 5, we either emulate the write (and optionally record it as a young pointer), or mark the page as volatile (to be scanned for roots at flip). For a fault in space 3 (copyspace), we have to scavenge the entire page, transporting any oldspace references, and then promove the page to newspace. -----------------------[ Page Seventeen ]----------------------- All physical memory pages are either wired, in-use, or free. The free pages should be held on a list for the page allocator. The in-use pages should be stored on an LRU queue (more likely a not-recently-used queue), and need to have some way to quickly find the associated virtual memory address when they need to be swapped out. Let's take a "modern" 32-bit x86 for example. Give it two gigs of RAM (not too unusual these days). That's #x80000 pages (512 thousand or so). At 8 bytes per page of tracking information (LRU or free chain and VMA, with a few bits for flags), that's #x400000 bytes (4 megs) to cover. We can afford that. As a practical matter, we'll run out of address space before we use up that much RAM outside of machine virtualization. Memory requirements for tracking virtual address space aren't so easily calculable without a more complete design for the garbage collector. SLAD essentially becomes a variant on syspend-to-disk. Unlike SBCL proper, we keep all of our threads around. The main difference between SLAD and suspend-to-disk is that SLAD copies everything over to a new core, and suspend-to-disk can "just" flush everything out to swap space. The underying mechnism for both will probably take an extent list for the target disk areas, rather than deal with the filesystem directly. SLAD may, in fact, not kill the system. It could save off to the new core and then continue running with the new core and old swap space. ------------------------[ Page Eighteen ]----------------------- The first thing the system manager thread does is dispatch on exception/interrupt type. That can be taken care of with a 48-entry function array. Interrupts get shunted to device drivers; most exceptions get shunted to the debugger task; breakpoints, page faults, and the floating-point task-switched exception get special care. The fp trap is fairly straightforward to handle, we can defer the details. Breakpoints are used as a shorthand register-preserving function call for non-speed-critical parts of code (exceptional situations, really). They have meanings for error handling, debugging, and we'll probably use them for GC and thread synchronization. Page faults have the fun possibilities of needing to request a disk read (with a pagetable update and thread restart upon completion), to request a page scav from the GC thread, or to request a disk read with a pagetable update and page scav request on completion. We need to use proper synchronization primitives and track pending in-page operations to cover the cases of multiple threads blocking on the same page. The scaenger should also be able to request a page-in without blocking, and a user thread likewise for a scav or page-in. Because of the needs of the scavenger (should remain ready for scav requests while waiting for a page-in of oldspace), serve- event-based networking, and any other fancy synchronization needs, we need something like WaitForMultipleObjects fom Win32. ------------------------[ Page Nineteen ]----------------------- A perusal of the wineserver source shows synchronization objects having waitqueues of interested threads, threads having "waits" of waitqueue entries they are interested in, and timeout structures. The wineserver makes for an interesting guide to the concepts, but is a single-threaded application, and so doesn't include the spinlocks required for multi-CPU operation. And waiting on a spinlock and modifying N waitqueues can become a long time to lock out interrupts. Or perhaps not, given that we make no realtime interrupt response guarantees as of yet. A "user" synchronization request, such as serve-event or WaitForMultipleObjects can allocate the needed "wait" on a wired stack page or two. System requests such as an in-page or scavenge request should use statically allocated waits in the thread structure. I/O completion notification should obviously be done using the thread synchronization system. Disk I/O requests are built around an event synchronization object, used to signal completion (either success or failure). To simplify synchronization, when a page fault is received on a paged-out copyspace page, it is up to the scavenger to request a page-in. -------------------------[ Page Twenty ]------------------------ We shouldn't need to put the compiler in the cold-core. We need to build it with the cross compiler, but it should be loadable by the fasloader. An awful lot of the rest of the system should be similar. Many toplevel forms are simple funcalls. This is the observation behind the fopcompiler. It should be possible to have genesis handle such calls via a modified cold-toplevels mechanism. This should reduce the number of compiled functions used only for toplevels, which was one of my concerns with a separate heap space for the os kernel. Another genesis-related change to consider is dumping cold-init related data in a custom space and use a set of known memory locations as pointers instead of symbols which are just going to get discarded anyway. -----------------------[ Page Twenty-one ]---------------------- The prototype task creation routine works quite well. It initializes the new task as if the initial function had just been called. No need for a stub or anything. One stub we might want, though, is for when a thread returns from its initial function. I also implemented a polling-mode ata hard disk driver. It works for reading from the disk, and is untested for writing. It is the basic mechanism required to page data in from the disk. It turns out that the HLT instruction isn't valid for ring 3, even at IOPL 3. The idle task will have to be set up to not do an interllevel return (quite possible), and just sit in a loop waiting for an interrupt. Because it still does an interrupt return, the process state will still be valid when it receives an interrupt. Were it not for the idle task having to run at ring 0, I'd be tempted to have the scheduler run in ring 1 and abuse the nature of the interlevel returns to do the context switching. Looks like the only way to get to ring 0 with the stack set up correctly is to take an interrupt. Scratch that. A 1-arg call gate and a pushfd/call/pop sequence should do the trick, and since the manager thread runs without interrupts we don't have to worry about taking one in ring 0. It turns out that there are a number of instructions that may only be run in ring 0. Looks like we will need a syscall interface after all. -----------------------------[ End ]----------------------------