--------------------------[ Page One ]--------------------------
Kernel support for garbage collection is one potential technical
advantage of a LispOS. GC flip can be done by stopping CPUs, not
threads. Incremental scav might be plausible. There might be
things we can do in terms of page protection for lazy
determination of pointer/non-pointer values in registers during
GC.

Kernel design is fairly straightforward. Leave it for a bit.

The real question is compiler and build tool support. If we
start from SBCL, we would have to start by hosing the build
process.

The obvious runtime environments to support are bare-metal, a
coLinux-style setup, and xen.

The SBCL toplevel build control looks fairly hackable. Horrific
thought: the first form in any source file is an in-package. We
could set up a custom in-package function which mangles its
argument in either host or target compile mode. In fact, it
should "only" take a few hours to make a .asd file to do most of
the work.

We should be able to get genesis to produce a tiny core with no
foreign symbol dependencies. If we arrange for the bootloader to
read the entire core in at th e1-meg mark and then map pages for
the core spaces, we should be able to put hte "core" code in
Lisp. Put core header at 1-met, page tables at 2-gig, identity
map the low meg, and be clever with the initial lisp function.
First init function should just stomp the screen and lock up.
Future boot loader should map a page or so for FS before
starting Lisp, and the init function can populate the minimum
required fields for operation, get itself an alloc region, and
use the normal thread creation functions to build a proper
thread to start the system.

We don't need to asdf-ize the system to start, we could just
hack a new build script.
--------------------------[ Page Two ]--------------------------
Important components for a LispOS:
  Lisp Implementation (based on SBCL)
  Kernel (memory management, thread scheduling, etc.)
  File System (Movitz has ext2fs read under unknown license,
might use that for now.)
  Networking (TCP/IP)
  Graphic Interface (X, use CLXS)


User components:
  Filesystem Browser
  Text editor (climacs?)
  Terminal windows (for REPLs, or for telnet/ssh)
  X window manager
  Hard drive partitioning/formatting tools
  Web Browser
  Debugger
  IRC client (beirc?)
-------------------------[ Page Three ]-------------------------
The plan is to run the project mostly dark until it is visually
impressive (X, a wm, and a couple apps). Any given subcomponent
can be visible (clxs is a good example here), but the main goal
should remain secret. Assume that we can beg, borrow, or steal
anything anyone else is working on.

The minimum we need is X, a wm, a filesystem, networking, and a
terminal app. We can run X, the wm, and the terminal as open
projects from the start. The terminal doesn't even have to be
very complete. Let's leave it for later. The design for CLXS is
approaching feature-complete. Once the infrastructure is built
out a bit and the next stage is implemented, it can back-burner
for a while. That's a day or so of work on CLXS, and then I can
leave it again.

The dark-running projects are the kernel, the filesystem, and
networking. We can leave networking alone until the kernel and
filesystem work. And then we might consider local-only
networking for a while, as that should be enough to run X.

The only bit of the kernel I'm worried about is the GC. And
that's mostly because I don't understand SBCL's current GC. Some
study is probably in order.

The other major uncertainty is the filesystem interface. On the
one side, it depends on the block device interface. And on the
other side, it depends on the pathname and stream interfaces.
And while we can use COS for network filesystems, PCL isn't
loaded yet by the time the bootstrap file system has to go live.

We can use my Forth system to do the initial bootstrap for a
while. Saves having to do any 16-bit ASM work. SLAD will be
trickier under this setup. Maybe have it fork a new address
space and run there?
--------------------------[ Page Four ]-------------------------
Ext2fs is the obvious choice for an initial filesystem. We have
sample code, and it is easily accessible from a Linux host. This
leaves the questions of interfacing to pathnames and streams on
the Lisp side,, the block device interface on the kernel side,
and paged operation if we put the kernel and/or swap in the
filesystem.

An initial test shows that genesis requires wo foreign symbols
directly. [undefined_tramp] and [closure_tramp]. If we have them
in an assemfile, we should be able to load it before
initializing the static symbols. If this does end up working, it
would be a possible lead-in to that win32-native-executable
compiler. Turns out that the tramp values are offset thanks to
their fun headers. I may have to do a bit more work to get them
working right.

Qemu apparently supports floppy disks. Thus, we should be able
to use it for initial testing. Our strategy for now should be to
read all of fda to linear #x400000, which is unused space, build
initial page tables based on the core header (or just hardcode
them), and then use a stub in low memory to disable interrupts,
set up the page tables and a stack frame, and call the initial
function. : loadtrk 24 * DUP 200 * 400000 + SWAP 24 fdread ;
: load-disk 50 BEGIN DUP WHILE 1- DUP loadtrk REPEAT DROP ;
works to load a disk image.

Even once the test core boots we won't be able to use the boot
environment long before running into its limits. Thus, once the
system is booting, a priority must be placed on building out the
kernel functionality and the new cold-core memory layouts.

It would be a good idea to arrange things such that we can
recompile and reload source files without having to rebuild the
whole compiler. Cut down the edit-compile-test time.
--------------------------[ Page Five ]-------------------------
Loading a core file would be easier if the header were simpler.
We don't need build ids and stuff like that. We really just need
to know where to map the pages and where the initial function
is. We also don't need an awful lot of the bootstrap/cold-init
data that we're getting. These changes to genesis would make it
easier to set up the page tables and give us plenty of room for
the os kernel.

VARIABLE next-page     VARIABLE page-directory    3 CONSTANT page-protection
: new-page next-page @ 1000 next-page +! ;
: clear-page 0 OVER C! DUP 1+ 0FFF CMOVE ;
: make-page-table ( addr ) new-page DUP clear-page page-protection OR SWAP page-directory-entry ! ;
: page-directory-entry ( addr ) 14 SHR -4 AND page-directory @ + ;
: find-page-table ( addr ) DUP page-directory-entry @ DUP 1 AND IF NIP -1000 AND EXIT THEN DROP
    next-page @ SWAP make-page-table ;
: map-page ( phys vert) SWAP page-protection OR SWAP DUP find-page-table SWAP
    0A SHR 0FFC AND + ! ;

We probably should make provisions to map the page tables into
virtual memory as well. If we map the tables linearly at
80400000 - 807FFFFF, that's one extra page, and leaves 4 megs at
the two gig line for the page directory and any other data we
want.

We should have an identity-map page table for the low four megs
when we first enable paging. That way, we can still use Forth to
finish setting up and switch the table back at the last  minute.

: id-map ( top-addr ) BEGIN DUP WHILE 1000 - DUP DUP map-page REPEAT DROP ;
--------------------------[ Page Six ]--------------------------
CODE cr0@ EBX PUSH, 200F W, 0C3 C, NEXT,
CODE cr0! 220F W, 0C3 C, EBX POP, NEXT,
CODE cr3@ EBX PUSH, 220F W, 0DB C, NEXT,
CODE cr3! 220F W, 0DB C, EBX POP, NEXT,
: LGDT, 010F W, as-modrm 10 OR C, displace ;
CODE LGDT 10 #, EBX SHL, EBX PUSH, 2 [ESP] LGDT,
    8 #, ESP ADD, EBX POP, NEXT,

The basic plan is to load the core file, set up the page tables
for Lisp, set up and enable the page tables for forth, set up
the Lisp GDT, assemble a startup stub in low memory, and jump to
the startup stub.

The startup stub must: Kill interrupts, reload cr3 with the Lisp
page tables, establish a Lisp stack frame, and call the initial
function. Actually, we can set the stack fram up and kill
interrupts from Forth.

CODE fs! EBX PUSH, 0A10F W, EBX POP, NEXT,

MOV cr3, ebx / xor ebx, ebx / xor ecx, ecx / xor esi, esi /
xor edi, edi / xor edx, edx / pop eax / pop ebp / call [eax+1] ?
: start-lisp SP@ >R 0 0 0 R> 400000 @ page-directory @ lisp-boot-trampoline ;
400030 @ 400028 @ + 401000 + next-page !
: map-pages ( phys virt num ) BEGIN DUP WHILE 1- >R 2DUP map-page
    1+ SWAP 1+ SWAP R> REPEAT DROP 2DROP ;
: map-space 4 + DUP @ 1000 * 401000 + SWAP 4 + DUP @ 1000 * SWAP 4 + @ map-pages ;
: map-core-file 400004 map-space 400014 map-space 400024 map-space

We'll map the DT at 12000 and FS at 13000.
-------------------------[ Page Seven ]-------------------------
The two critical files on the disk are the core and the swap. It
might be easiest to maintain their on-disk locations as extent
lists. We then have a method of finding a disk sector for any
given file offset.

We need to maintain, in parallel with the page tables,
information about what is mapped where, what the on-fault
actions are, and which pages can be discarded if we need new
memory pages.

The GDT has a maximum of 8192 descriptors at 8 bytes each.
That's 64k of space. We should reserve the full address space
but only map the first page (and set the limit to just that
page.) The IDT has a usable limit of 256 descriptors, most of
which will be unused. That's half a page (2048 bytes). The boot
loader should be static-allocating these at known locations.

There are a number of functions that we obviously will need in
the kernel. Page table manipulation, page allocation, selector
manipulation, etc. We can write these even before we figure out
what to do about memory management.

The filesystem drivers don't have to be in the kernel. It is
sufficient if the kernel has the disk drivers and extent lists
for the core file and possibly swap space. That should simplify
the kernel a bit at the expense of adding code to set up the
extents to the booltloader. And that isn't a big expense because
the bootloader would have to do most of that work to load the
kernel from the core file in the first place.

The kernel basically needs disk, keyboard, and timer drivers; a
task switcher; memory managemnt; and garbage collection.
fd-streams use a non-lisp mmap()ed memory buffer. That
simplifies things quite a bit. The intial (pre-memory-manager)
versions don't need any of the device drivers.
-------------------------[ Page Eight ]-------------------------
The next things to do, codewise, are to build out the memory
management, interrupt, and task-switch primitives, the task
switcher, and the bootloader ext2 reader. And I should study
gencgc while I do so.

The kernel functions are all basically statically linked. If
they, their literals, and their fdefinitions were in a special
segment, but not their symbols or top-level forms, that would
probably suffice for a minimal bootstrap. It'll probably take a
new fop to separate them out, though. It might be easier to have
genesis dump everything from those files into the kernel segment
and use the GC mechanisms to sort everything out. Transporting
the functions and fdefinitions to static space might do the
trick. Any tenured space might do.

The Forth system could be run in a very primitive penalty box.
Maintaining it and the boot code from within Lisp is a very real
possibility.

For now we can put the TSS in the other half of the page with
the IDT.

  0.) Create os-kernel pakage.
  1.) Create interim page allocator.
    a.) Export page allocation pointer from Forth.
  2.) Allocate IDT page and load IDTR.
    a.) Create LIDT VOP.
  3.) Create TSS and load TR.
    a.) Create LTR VOP.
    b.) Create GDT selector allocator.
    c.) Create GDT selector configurator.
  4.) Create a wait-for-interrupt function (HLT).
--------------------------[ Page Nine ]-------------------------
Once we have interrupts working in ring 0, and before the base
system works, we need to spend some time on thread
synchronization and whatnot.

Right now we have no user I/O. And cold-init only uses the one
thread. This should simplify the initial page-fault handler
somewhat.

We need an in-kernel display driver. For diagnostic output at
this point, but for user-mode use eventually. Should probably
have threadsafe entry points for user code and unsafe entry
points for the kernel (which, presumably, is going to go tits-up
if it needs to say anything). Should probably build this before
setting up interrupt handlers.

Each thread needs its own kernel stack. The general prologue for
interrupts and exceptions will push a bogus error code as needed
to normalize the stack depth followed by the integer registers.
This lets us use the kernel stack as the CPU context structure
for tasks not running on a CPU. The kernel needs a yield-to-
thread operation whcih changes the ESP stored in the TSS, FS,
and the current stack and frame pointers. There's something to
be said about the scheduler, the debugger, and single-stepping
instructions, but I'm not sure what yet.

  1.) Add in and out instruction definitions.
  2.) Create I/O port access VOPs.
  3.) Move the hardware cursor.
  4.) Transliterate the Forth display driver into Lisp.
    a.) Create memmove and memset VOPs.
  5.) Lay out initial interrupt/exception stubs.
  6.) Make compiler type propagation work.
--------------------------[ Page Ten ]--------------------------
It's been just over a week. My standalone SBCL project has hit a
snag with the cross-compiler and its pathological inability to
manage type information. Demonstrating the issue is not hard.
Fixing it might be. Thus far, I have an interesting proof-of-
concept loader, and some good design work, but nothing truly
compelling. The critical point would be working sb-show and
running part of cold init. That would be interrupts, display
drivers, page faults, and disk driver at the least. Also means
further-hacked genesis and a filesystem driver in the Forth
system. A good couple weeks work, at least. No, we don't need
the filesystem driver. We can treat the HD as we do the floppy,
and fake up a single extent for the core file and another one
for swap. Still need the rest of it.

Another possibility would be to back and fill. We have a Forth
system, why not use it? We lose ideological purity, but wouldn't
need to hack genesis, a disk driver, a display driver, or other
kernel stuff. Somehow, I'm not convinced.

This past week has brought a number of project ideas to mind,
with varying levels of ambition. From blogging to webstores, to
multiplayer web games, to stripping out and rebuilding the
Python compiler to an auto-newbie-helper bot for IRC.

It may be time for me to work on something else. The path ahead
for OS work is fairly clear, and I don't want to burn out on it.

Having thought about it, this compiler lossage means that one of
my preconditions for SBCL/standalone doesn't hold. We can't
really continue until it's sorted out.
-------------------------[ Page Eleven ]------------------------
The compiler thing turned out to be partially me not declaring
types as well as I should have and partially cross-float-
infinity-kludge. Both problems were simple enough to solve. I
managed to screw up loading the IDTR as well. An IDT limit of 0
doesn't make for happy page faults.

To make things easier, I intend to have an idle thread which is
always runnable and does (loop (wait-for-interrupt)), and has
its function in kernel space. We'll need one per CPU. If we run
the GC scavenger as another scheduled task, we might be able to
arrange things such that we never take a page fault in ring 0.

Our guiding philosophy should be to keep things as unprivelidged
as possible, with some exceptions. Anything that can run in ring
3 does so. Anything that can be paged out is subjec to demand-
paging. Read-only space, static space, and kernel space use
wired memory pages. We run as little as possible in ring 0, and
as little as possible in wired memory. Taken to an extreme, the
ring-0 code would just handle interrupts, and would do so by
switching threads to a per-cpu manager thread. We may get there
someday. This effectively gets us PCLSRing "for free" if we
implement it. And we should.

Positing the existence of a REGISTER-KERNEL-FUNCTION function
which moves a given function into wired space, treating the
initial core as oldspace, linking the kernel functions first (so
that they are towards the beginning of dynamic space), and just
having the loader read up the first few megs of the cold-core
might be a good way to start the system.

With the minimal ring 0 proposed above, we could have the loader
start us off in ring 3, and do all the setup from there.
-------------------------[ Page Twelve ]------------------------
Our interrupt stack frames look like this:
  SS  \  <- [ TSS ESP0 - 4 ]
 ESP   |
EFLAGS >- from priv transition
  CS   |
 EIP  /
Error code  - from exception or handler
 EAX   -----\
 ECX         \
 EDX          |
 EBX          >- from pushad
TSS ESP0 - 24 |
 EBP          |
 ESI         /
 EDI   -----/

Below EDI we can push whatever we want (interrupt/exception
number, for example). It's a fixed format, and we could put an
array or structure header on it for easy access from Lisp.
------------------------[ Page Thirteen ]-----------------------
Having looked at the gencgc source, I believe that memory
management will turn out to be fairly straightforward. Most of
the scav work can be done from a dedicated thread, giving us an
incremental system.

The "page" structure of gencgc is similar to the region-bits
array on a classical LispM. The "alloc_region" structure is most
of the rest of the region setup on a LispM. With direct access
to the hardware page tables, we don't need all of the same
information as gencgc does.

Because a booted core includes a page table and whatnot, it will
have a substantially different structure than a cold-core. We
can take advantage of this in a couple ways. First, a booted
core can contain multiple threads. Second, because we have the
page tables, we no longer need to keep track of the bounds of
dynamic space in the core header. Third, abusing GC primitives
as part of system bootstrap makes even more sense. We hack
genesis to include a "kernel space" containing some set of
critical files. On startup we scavenge the kernel functions and
related constants to their resting place (maybe in static
space?), and scav !cold-init to a fresh allocation region. With
the rest of the cold-core as oldspace, we can continue on our
way, and the incremental GC will sort out the mess.

One problem with incremental GC is that we don't have a
traditional read barrier. In practical terms, this means that
any page which can be read by the mutator must contain only
pointers to newspace or copyspace. We can arrange for the GC
thread to be able to read otherwise-restricted pages either by
running it in ring 1 or 2, or by having it map the pages into
memory elsewhere for access (useful for writing to read-only
space as well.)
------------------------[ Page Fourteen ]-----------------------
Kind of random, but the missing piece for my frame-return hack
was the binding pointer. If we save it in a dedicated slot in
the stack frame, as part of XEP processing, it'd be fairly heap
to have. rtoy points out that this doesn't help known-fun calls
within a component.

If we can non-conservatively scavenge the control stack, by
means of code annotations or whatever, we can turn around and
incrementally scavenge the stack (either by page protection or
return address rewriting). Or both techniques if stacks are
subject to swap-out. If, through code annotation, we don't have
to conservatively cavenge the register set, then we can lose the
dont_move page flag and all its implications.

For GC, we need to know what the system looks like before and
after flipping, what is done while flipping, how scavenging
occurs, what happens when we page fault, what our invariants
are, and so on.

One of the invariants of a Baker-style incremental GC is that
the mutator must never see a pointer to oldspace. On the old
LispMs, this was accomplished by means of hardware. The page
tables contained an oldspace bit, and it was easy for the
microcode to check this bit for a given address. We don't have
that luxury. We need to make sure that any live data on a page
that is readable by a nrmal user task doesn't contain any
oldspace pointers.

We should probably have an operation mode where we only do
stop-and-copy (non-incremental) GC. It would in many ways be
easier to implement, and would provide a different performance
profile.
------------------------[ Page Fifteen ]------------------------
A conservative root points to an object that may be live, and
that we don't dare move. We can't move the object because it
might be an unboxed integer, and modifying that would be
obviously wrong. We can't discard it, since it might be a live
poointer. But we are under no such constraints for any other
object on the page. We can count the one object as unscavenged
newspace, and the rest of the page as oldspace. We don't even
need to worry about stray references to the rest of the page
because the mutator can't see pointers to oldspace. After we
flip, if the object is still a conservative root, we can put
alloc regions around it, as the oldspace regions of the page
have been evacuated.

Let's call "unscavenged newspace" copyspace. We then have three
spaces. Newspace is all live data and safe-for-the-mutator
pointers. Oldspace is all untransported objects and forwarding
pointers. Copyspace is all live data, but contains pointers to
oldspace. Because we are using page protection for our read
barrier, we must scavenge copyspace a page at a time.

In order to do a GC flip, we must stop all of the CPUs, scavenge
the conservative roots, scavenge the non-conservative roots,
create new alloc regions for each threads (closing the old
ones), and resume where we left off.

We also have multiple generations to handle. We write-protect
newspace for the older generations, as they effectively provide
roots for younger generations, and we don't want to have to scav
the entire memory space.
------------------------[ Page Sixteen ]------------------------
The different page types for GC are:
  1.) Oldspace
  2.) Newspace
  3.) Copyspace
  4.) Static Space
  5.) Old-generation newspace

Note that, with conservative roots we also can have a page that
is both oldspace and copyspace. Such a page is protected as
copyspace, and hang any transport complexity.

Space types 1 and 2 are r/w. For oldspace, this is to make
transport easier. Space type 3 is no-access in order to
implement our read barrier. Space types 4 and 5 have r/o or r/w
access depending on volatility.

The GC needs to be concerned with fault activity on space types
3, 4, and 5, as they are the non-r/w spaces. For a fault in
space 4 or 5, we either emulate the write (and optionally record
it as a young pointer), or mark the page as volatile (to be
scanned for roots at flip). For a fault in space 3 (copyspace),
we have to scavenge the entire page, transporting any oldspace
references, and then promove the page to newspace.
-----------------------[ Page Seventeen ]-----------------------
All physical memory pages are either wired, in-use, or free. The
free pages should be held on a list for the page allocator. The
in-use pages should be stored on an LRU queue (more likely a
not-recently-used queue), and need to have some way to quickly
find the associated virtual memory address when they need to be
swapped out.

Let's take a "modern" 32-bit x86 for example. Give it two gigs
of RAM (not too unusual these days). That's #x80000 pages (512
thousand or so). At 8 bytes per page of tracking information
(LRU or free chain and VMA, with a few bits for flags), that's
#x400000 bytes (4 megs) to cover. We can afford that. As a
practical matter, we'll run out of address space before we use
up that much RAM outside of machine virtualization.

Memory requirements for tracking virtual address space aren't so
easily calculable without a more complete design for the garbage
collector.

SLAD essentially becomes a variant on syspend-to-disk. Unlike
SBCL proper, we keep all of our threads around. The main
difference between SLAD and suspend-to-disk is that SLAD copies
everything over to a new core, and suspend-to-disk can "just"
flush everything out to swap space. The underying mechnism for
both will probably take an extent list for the target disk
areas, rather than deal with the filesystem directly. SLAD may,
in fact, not kill the system. It could save off to the new core
and then continue running with the new core and old swap space.
------------------------[ Page Eighteen ]-----------------------
The first thing the system manager thread does is dispatch on
exception/interrupt type. That can be taken care of with a
48-entry function array. Interrupts get shunted to device
drivers; most exceptions get shunted to the debugger task;
breakpoints, page faults, and the floating-point task-switched
exception get special care. The fp trap is fairly
straightforward to handle, we can defer the details.

Breakpoints are used as a shorthand register-preserving function
call for non-speed-critical parts of code (exceptional
situations, really). They have meanings for error handling,
debugging, and we'll probably use them for GC and thread
synchronization.

Page faults have the fun possibilities of needing to request a
disk read (with a pagetable update and thread restart upon
completion), to request a page scav from the GC thread, or to
request a disk read with a pagetable update and page scav
request on completion. We need to use proper synchronization
primitives and track pending in-page operations to cover the
cases of multiple threads blocking on the same page. The
scaenger should also be able to request a page-in without
blocking, and a user thread likewise for a scav or page-in.

Because of the needs of the scavenger (should remain ready for
scav requests while waiting for a page-in of oldspace), serve-
event-based networking, and any other fancy synchronization
needs, we need something like WaitForMultipleObjects fom Win32.
------------------------[ Page Nineteen ]-----------------------
A perusal of the wineserver source shows synchronization objects
having waitqueues of interested threads, threads having "waits"
of waitqueue entries they are interested in, and timeout
structures. The wineserver makes for an interesting guide to the
concepts, but is a single-threaded application, and so doesn't
include the spinlocks required for multi-CPU operation. And
waiting on a spinlock and modifying N waitqueues can become a
long time to lock out interrupts. Or perhaps not, given that we
make no realtime interrupt response guarantees as of yet.

A "user" synchronization request, such as serve-event or
WaitForMultipleObjects can allocate the needed "wait" on a wired
stack page or two. System requests such as an in-page or
scavenge request should use statically allocated waits in the
thread structure.

I/O completion notification should obviously be done using the
thread synchronization system.

Disk I/O requests are built around an event synchronization
object, used to signal completion (either success or failure).
To simplify synchronization, when a page fault is received on a
paged-out copyspace page, it is up to the scavenger to request a
page-in.
-------------------------[ Page Twenty ]------------------------
We shouldn't need to put the compiler in the cold-core. We need
to build it with the cross compiler, but it should be loadable
by the fasloader. An awful lot of the rest of the system should
be similar.

Many toplevel forms are simple funcalls. This is the observation
behind the fopcompiler. It should be possible to have genesis
handle such calls via a modified cold-toplevels mechanism. This
should reduce the number of compiled functions used only for
toplevels, which was one of my concerns with a separate heap
space for the os kernel.

Another genesis-related change to consider is dumping cold-init
related data in a custom space and use a set of known memory
locations as pointers instead of symbols which are just going to
get discarded anyway.
-----------------------[ Page Twenty-one ]----------------------
The prototype task creation routine works quite well. It
initializes the new task as if the initial function had just
been called. No need for a stub or anything. One stub we might
want, though, is for when a thread returns from its initial
function.

I also implemented a polling-mode ata hard disk driver. It works
for reading from the disk, and is untested for writing. It is
the basic mechanism required to page data in from the disk.

It turns out that the HLT instruction isn't valid for ring 3,
even at IOPL 3. The idle task will have to be set up to not do
an interllevel return (quite possible), and just sit in a loop
waiting for an interrupt. Because it still does an interrupt
return, the process state will still be valid when it receives
an interrupt. Were it not for the idle task having to run at
ring 0, I'd be tempted to have the scheduler run in ring 1 and
abuse the nature of the interlevel returns to do the context
switching.

Looks like the only way to get to ring 0 with the stack set up
correctly is to take an interrupt. Scratch that. A 1-arg call
gate and a pushfd/call/pop sequence should do the trick, and
since the manager thread runs without interrupts we don't have
to worry about taking one in ring 0.

It turns out that there are a number of instructions that may
only be run in ring 0. Looks like we will need a syscall
interface after all.
-----------------------------[ End ]----------------------------