Copy-on-write trees introduced as `side-branching'.
second-tier cache: placing blocks into the second-tier cache not when they are _referenced_, but when they are _evicted_ by the first-tier.
and
To avoid "cascading aborts" it's enough to delay read operations: indeed, when transaction is aborted, and values it has overwritten are reverted back to their original state, there is not need to abort operations (and hence transactions) that also wrote into these values.
can we apply the notion of epochs/windows to the individual local file system? E.g., introduce multiple independent journals, written back concurrently, and synchronized on the start. Per-object journals (like in Cambridge File Server)?
Good bibliography
. no explicit file layouts on MDS. Instead layouts are computed based on file identifier and stripe pattern. Quite like our storage pools it seems. . (clustered) MDS collects exponentially decaying load averages for a subtrees (accumulating all the way up to the root), and uses them to dynamically re-balance namespace. . has something like IBIT locks: * immutable attributes (like layout); * security attributes (owner, mode); * file attributes (size, mtime); . asynchronous commit on OSD, including keeping uncommitted data on the client.
- when joining leaves Always keep lowered numbered blocks when joining any two blocks: This will shrink the directory over time.
- inode "file": b-tree where inodes are keyed by ino. This can be used to scatter inodes through disk. E.g., store inodes close to directory entries.
Contains extensive bibliography on software correctness and language design
Compiler Technologies
- 1968: Berkeley CDC 6400. - 2-level swapping and 2-level scheduling: disk -> Extended-Core-Store -> central-memory - sub-process (protection domain): may execute code in other sub-processes of the same process. Sub-processes are organized into tree. Parent has full control over its descendants (debugger, emulators). Errors and interrupts are propagated through tree upward until handler is found. "We saw this as a generalization of the usual monitor-user mode facility." - sub-systems, like disk/directory system, create sub-processes in user processes. Something like re-parenting done by window managers. - capabilities, including user-definable types, controlled by user servers. - radix-trees to represent (possibly sparse) files in-core - event channels. - memory mapping without hardware assistance. - compacting allocator. Its users store absolute addresses together with compaction counter. If counter doesn't match one in the block, absolute address has to be recalculated. - interaction between sub-systems (disk, io, even memory allocator) is mediated by event channels. Presumably this is very helpful when memory is extremely tight (sub-systems can be swapped-out). - "locks" in directory entries. Process had to supply matching "key" to access entry content. DR-DOS. - hard-links and soft-links. Pseudo file system for system objects (a la /sys or /proc): "name-tags". - a way to construct "trusted" sub-process in a given process. Trusted sub-process has capabilities that user doesn't have. Special directories provide "sub-process descriptor" capabilities, that system uses to lookup file with a special name within said directory and extract description of a trusted sub-process to be constructed from this file. Something in the middle between suid executables and "agent threads" in MVS. - file system using phase-trees. - intent logging. Per-file.
From: Matthew Dillon <dillon@apollo.backplane.com> Subject: Re: lkwt in DragonFly Newsgroups: gmane.os.dragonfly-bsd.kernel Date: Fri, 6 Feb 2004 18:33:47 -0800 (PST) First, Bosko, thanks for asking your question! You really got me thinking about our token implementation. I've been whining about it being broken on our mailing lists for quite a while and as I was formulating a reply to your email I had an epiphany. So I hope you don't mind me having CC'd your email and my reply with the kernel@ list! :Matt, : : Reading this: : :Token Passing Primitives Not Mutexes :... : : Do you intend to have a single pcpu token on a given cpu protecting : all data structures (or a class of data structures)? The tradeoff of : the lesser-granularity vs the admittedly much lower cost for a common : token acquisition may or may not be worth it in the end. (The other : part to the tradeoff is that you can never go with full-preemption). : I did understand this correctly, right? : :Cheers, :-- :Bosko Milekic * bmilekic@technokratis.com * bmilekic@FreeBSD.org Not entirely. First, please note that the current token implementation does not work very well... it's a mess to use because cpu #B can steal away cpu #A's token by sending cpu #A an IPI message requesting the token. This means cpu #A has to be in a critical section to 'protect' the token, and must re-acquire the token (in case it was lost) after it blocks. After thinking about this quite a bit I have come up with a solution that, in a moment of enlightment, would make our tokens operate like self-contained little RCU implementations (superior to Linux's implementation, in fact!). I intend to fix our tokens by not allowing an IPI to steal a token until the owning cpu has gone through a thread switch, or by explicit release. This means that, once fixed, our token implementation will allow us to abstract RCU-like operations, which I think is very important. It is not what I originally intended tokens to be but it is turning out to be what they should be. The nice thing about the revamp I intend to do on the token code is that the RCU abstraction will wind up being a lot better then Linux's RCU abstraction, because it will be compartmentalized. Only cpus competing for the same token are effected rather then all cpus in the system. We would not have the serialization problem that Linux has. Tokens will be used in a manner similar to how mutex are used in FreeBSD-5, except that in our case (with the revamped token code) DFly will be able to hold any number of tokens across a blocking situation without creating deadlocks (but also not guarenteeing atomicy across the blocking situation). I think this is exactly the sort of abstraction we need. Mutexes in FBSD-5 and tokens in DFly are primarily used as interlocks on structures, for which no blocking occurs, or in situations where the blocking condition is well manageed (e.g. lockmgr). I feel that my new token (RCU-like) implementation will wind up being far superior to FreeBSD-5's mutex implementation for 90% of these situations and it will not suffer from the deadlock potential that mutexes have because it will not guarentee atomicy across a blocking call. Also note that several major DragonFly subsystems (LWKT scheduler and SLAB allocator) are totally mutex & token free and things like the network stack will also be mostly mutex & token free because it will assign PCBs to particular threads (so operations on those PCBs are only done by a particular thread which in turn means no locking of the PCB is required at all). The performance trade off is basically the ability for the cpu owning a token to operate on the token without having to use bus-locked instructions, verses having to send an IPI message to the owning cpu when the requesting cpu does not own the token. This is not the major reason for using tokens, though. Still, judging by the benchmark comparisons we've done between 4.x and DFly, the overhead of a 'miss' appears to be very small. It turns out that most locks (like vnode locks) hit the same-cpu case a *LOT* more then they hit the foreign-cpu case so the overall benefit is at least equivalent to FreeBSD-5's mutex overhead. It certainly isn't worse. We do lose when it comes to IPC operations between two processes running on difference CPUs, but if this is the only thing that winds up being a problem it is a simple enough case to fix. However, the major reason for using tokens is that it creates an abstraction that allows us to execute many types of operations as if the system had only one cpu. Since a token is owned by a cpu and not a thread, if thread #1 obtains a token and blocks, thread #2 running on the same cpu can obtain that same token optimally. Likewise, thread #3 running on a different cpu can 'steal' the token (but it only gets transfered on a scheduler boundary of the first cpu once I fix the current token code). We also have the advantage of being able to use this abstraction heavily without compromising UP performance, because on a UP box a blocking situations implies a scheduler switch and since tokens are per-cpu... well, the whole thing devolves into a NOP. I have some work to do to move my current token implementation into this new paradigm, but I do not expect it to be difficult. For example, if cpu #B steals a token from cpu #A then the threads that own the token on cpu #A can't be scheduled to run again until cpu #B is through with the token. This means actively accounting for and structurally linking the dependancy information. Whew! That may seem complex, and in a mutexed environment it would be very complex. But remember in DFly the whole basis of using this abstraction is to be able to operate almost as if one were on a single cpu even when one is on a 16-cpu system, so the actual structural accounting does not really have to worry about interrupts or stale data or other cpus. A critical section is sufficient to operate on token meta data on the owning cpu, and we can actually defer structural linkages within tokens we own until a thread switch occurs (which won't happen in most cases since like a mutex a token is obtained and released without blocking most cases). I think I rambled on there. Please feel free to ask additional questions! -Matt
From: Matthew Dillon <dillon@apollo.backplane.com> Subject: Re: lkwt in DragonFly Newsgroups: gmane.os.dragonfly-bsd.kernel Date: Sat, 7 Feb 2004 13:13:19 -0800 (PST) :> RCU = Remote Copy Update, right? Just asking, so I can find the right :> paper to read more about it. : :Read-Copy-Update. It's a read-mostly algorithm that's used to cluster data :structures writes/changes, so that it minimizes bus traffic across processors. : :http://lse.sourceforge.net/locking/rcupdate.html : :What Matt is talk about is merging a kind of quiescience operation into dfBSD :tokens if I understand correctly. : :bill Right. My new idea for the token API gives us a scheduler-level serialization abstraction that works across the whole system. It would kinda be like having multiple 'MP' locks in that it would have the same sort of side effects as an MP lock. The traditional serialization abstraction is the concept of a 'mutex', which is what FreeBSD-5 uses. Linux tries to avoid mutex overheads by using RCU (anyone can read data without having to worry about it being changed out from under, but modifications must be synchronized across all cpus at a scheduler boundary). Right now we try to avoid mutex-like overheads through our per-cpu model (e.g. LWKT scheduler and SLAB allocator) and through our per-cpu tokens. The problem with our current token abstraction is very similar to the problem FreeBSD-5 has with its mutex abstraction. In our case, a token can be 'lost' when a thread blocks or gets interrupted outside of a critical section and the thread must carefully code re-acquisition of potentially lost tokens because of this. In FreeBSD-5's case a mutex is not allowed to be held across a blocking operation at all and so the use of mutexes must be carefully coded... adding significant complexity to deeply nested APIs. Mutexes also have lock ordering issues (deadlock potential). My new token idea takes the original per-cpu idea and abstracts it into a serialization model. We could, in fact, even support 'reader' and 'writer' refs to tokens in order to allow multiple readers to hold the same token at the same time. The key difference between the new token idea and, say, a standard lock, is that the token is only in effect while the thread that acquired it is running. When the thread blocks or switches away the tokens that thread has acquired can be borrowed by other threads (on the same or on other cpus) and must be returned before the original thread can resume. So, unlike mutexes or locks, a token only guarentees run-time serialization. A token also has no deadlock issues and no lock ordering issues at all. Like mutexes, code that uses tokens must be aware of potential blocking conditions (because atomicy is lost across a blocking operation), but unlike mutexes the code that uses tokens does not have to release the tokens across the blocking operation. I believe this to be a VERY IMPORTANT feature of the new token design because it means we can (A) hold multiple tokens across a blocking situation, (B) higher level procedures do not need to pass token references to lower level procedures that 'might block', they need only be aware that the lower level procedure might block, and (C) the new token abstraction leaves open the possibility of major implementation optimizations because it devolves nearly into a NOP on UP systems. e.g. we can support per-cpu 'caching' of the token after it has been released, allowing us to avoid locked bus cycles as well as memory barriers in the cache case. So how does this tie into the RCU concept? Simple, really... RCU turns out to be a special case of the new token abstraction, that's all. In the new abstraction if you had a token T and every thread in the system obtained 'read access' on T, then any thread which then obtains 'write access' on T would be serialized against all the threads with read access at the scheduler level. If we wanted to implement Linux's 'queued' RCU API then we would queue structural changes directly to a thread that obtains 'write access' on T and this translates directly to updating the data during quiescent-scheduler periods. I actualy don't like Linux's queued RCU API, I am just saying that it could be implemented natively on top of the new token abstraction. We would use our tokens in a more active manner. e.g. instead of holding a read token and queueing write udpates to another entity that holds the write token we would obtain a write token and do the update in real time, then release the write token. Nothing prevents us from doing it both ways, though :-) So what does this mean? This means that the new token abstraction will not be a replacement for standard reader/writer (lockmgr) locks, but it certainly would be an excellent replacement for mutexes because it has far fewer side effects, requires less knowledge of the lock state to be passed down the subroutine chain, and is far more flexible. -Matt Matthew Dillon <dillon@backplane.com>
From: Matthew Dillon <dillon@apollo.backplane.com> Subject: Re: lkwt in DragonFly Newsgroups: gmane.os.dragonfly-bsd.kernel Date: Sat, 7 Feb 2004 18:01:55 -0800 (PST) : How do you determine whether another CPU holds the token when your : thread on the current cpu wants/needs it? Or do you always : synchronously IPI the other cpu when you want a token for writing? The cpu owning the token is stored in the token structure. No bus locked instruction is required to test (tok->gd == mycpu) because only we can give a token away to another cpu, which synchronously updates our cpu's cache to set tok->gd != mycpu, and only the cpu owning the token can hand it off to us, which means that the contents of tok->gd will only be read back as == mycpu if another cpu has handed it off to us, and once another cpu has done that only our cpu can give it up. struct lwkt_token { globaldata_t gd; /* cpu owning token */ ... } So, obtaining a token in the cache case could be as simple as this: lwkt_gettoken(lwkt_token_t tok) { globaldata_t gd = mycpu; if (gd->gd_cachetok) /* deal with previously fast-cached tokens */ gd->gd_cachetok = tok; /* interlock against IPI */ if (tok->gd == gd) /* optimal case, done */ return; ... fall through to less optimal code ... ... leave gd_cachetok set to tok indicating that we want the token ... send_ipi requesting to tok->gd requesting 'tok'. ... } lwkt_reltoken(lwkt_token_t tok) { globaldata_t gd = mycpu; if (gd->gd_cachetok == tok) { gd->gd_cachetok = NULL; return; } ... less optimal code. .. } Now I am not enitrely sure what the %fs:gd overhead is to load 'mycpu', but ignoring that issue for the moment the rest of the code will execute optimally within the processor's pipeline. (Also, in DFly since threads cannot be preemptively switched to another cpu, the contents of 'mycpu' is effectively fixed and can be cached by a procedure). The key issue here is that the optimal case becomes nothing more complex then a few lines of standard C code, easily inlined and easily pipelined by the cpu. Another key performance issue is the fact that since the tokens do not actually create deadlockable situations, a great deal of code where you might be obtaining and releasing several mutexes in FBsd-5 should devolve into just a single token in DFly. In particular I am thinking of cases within the lockmgr, cases where a mutex must be passed as an argument in FBsd-5 so it can be released and reacquired in a deeper procedure that blocks, and cases where you have to play pussy foot with mutexes in FBsd-5 to deal with lock ordering issues (e.g. unlock B, lock A, relock B). I think that if you were to compare the overhead of a single mutex to a single token that you would not see much of a difference.... but the idea in DFly is to greatly reduce the number of operations that would otherwise have to occur anyway so it would be a more fair comparison to compare two or three mutexes against a single token. (ultimately, of course, only a real live test can tell whether there is an actual performance benefit). : If you always IPI for a write, then all accesses to data structures : protected by tokens need to be classified into either reads or : writes. : : What about if you have multiple CPUs holding a token for reading only : and then one comes along and wants to have the token for writing? : What exactly happens? Do you have a way of figuring out whether : another CPU holds the token for reading and an IPI is required, or do : you always IPI everyone? Ok, so the question is how do we detect when a token can or cannot be given away to another cpu? It turns out to be fairly simple for the exclusive case: since only one cpu can 'own' the token, if you are on that cpu you basically do not have to do anything because you already own the token (remember, the token is not a lock, only a serializing entity). If you are on a different cpu you send an IPI request to the cpu owning the token (tok->cpu). The IPI code on the target cpu simply checks to see if the token is being held by the currently running thread. There are two ways of doing this: First, if we take the simple case where a thread only owns one token, the token may be held in gd->gd_cachetok. Otherwise the token would be held in the current thread's token array: /* * IPI callback (basically like a FAST int. entered with us in a * critiacl section): */ get_token_remote(tokreq_t req) { globaldata_t gd = mycpu; token_t tok = req->tok; if (tok->cpu == gd) { if (gd->gd_cachetok != tok) tok->cpu = req->cpu; } else { send_ipi(tok->cpu, get_token_remote, req); /* message chasing */ } } Now,that's a simplification. If a thread can own multiple tokens then we would store them in an array in the thread and we would have to iterate through the array... but since a thread will either own just zero or one token 'most of the time', the performance case is the zero or one token case, not the multiple token case. MULTIPLE READER TOKEN CASE Ok, now multiple-reader tokens (RCU style serialization). I think this is a situation where we would need to have a cpu mask embedded in the the token structure and then use locked bus instructions to set and clear bits in the mask. However, there is no requirement that we actually clear the bit in the mask when we release the token so the performance case for RCU would be able to run without any locked bus cycles. The trade-off is that if another cpu wants to get the token *exclusiveily* (exclusive serialization) it would have to send an IPI to all the cpus in the mask which might include cpus that do not actually own the token. In fact, we definitely do not want to clear the cpu bitmask bit in the token when releasing it, at least not immediately, because that would greatly reduce thread switching performance (the LWKT scheduler would have to run through the tokens owned by the old thread and release their cpu bits, then set the cpu bits on the tokens owned by the new thread). So: lwkt_gettoken_shared(lwkt_token_t tok) { globaldata_t gd = mycpu; gd->gd_cachetok = tok; /* interlock against IPI on current cpu */ if ((tok->cpumask & (1 << gd->gd_cpuid)) == 0) atomic_bus_locked_bit_set(&tok->cpumask, ...); /* owned by current cpu (exclusively or shared), we are done */ if (tok->cpu == gd) return; /* owned by another cpu, do more work to see if it is exclusively owned by the target cpu, perhaps by sending an IPI */ ... } Remember that for RCU operation obtaining 'read' access is either inherent or very cheap, and obtaining write access is expensive. : So this means you will have locked bus cycles for token acquisitions : UNLESS the token was last held by the current CPU? How do you : determine whether the token is cached locally? A critical section : protecting the check of a flag so as to ensure an incoming IPI does : not preempt your check? This can be done by interlocking with a globaldata field, without using a locked bus cycle or a critical section. globaldata_t gd = mycpu; gd->gd_cachetok = tok; /* prevent IPI from giving token away */ if (tok->cpu == gd) /* THEN check if our cpu owns the token */ return; /* yes, done */ /* no, request token via (expensive) IPI */ The IPI code runs on the target cpu, so if we own the token already then we need only set gd_cachetok and any IPI sent to us will check it and refuse to give the token away to another cpu. For the multiple token case the IPI code can for (i = 0;i < td->td_numtoks; ++i) if (td->td_tokary[i] == tok) return; /* cannot give token away */. (td == current thread) This sort of interlock works because all the potentially conflicting operations are running on the cpu owning the token and ONLY the cpu owning the token, so no locked bus cycles are required and only a simple interlock is needed to deal with IPI interrupts interrupting our code just as we are getting or releasing a token. If the IPI request cannot be satisfied the IPI code can queue the request to the token: req->next = tok->reqbase; tok->reqbase = req; And then the release code can check for tok->reqbase != NULL and satisfy the pending request(s) (or next pending request). Pulling a request off the list would have to be done inside a critical section since an IPI can add more requests to the list, but this is not a performance critical case. : costs for acquiring a token for reading and writing will be? In most : cases, from what you said, I'm expecting a typical acquisition to : consist of a critical section at best (for a read or if the token is : cached on the local cpu), sometimes a bus-locked instruction, and : sometimes IPIs? Pardon the ignorance, but not yet having looked at : your implementation, it is still unclear to me as to what a typical : token acquisition/deacquisition will look like in DragonFly. : :Cheers, :-- :Bosko Milekic * bmilekic@technokratis.com * bmilekic@FreeBSD.org Best case is as I outlined above... no critical section or bus locked operation is required at all, just an interlock against an IPI on the current cpu and that can be done by setting a field in the globaldata structure. In FreeBSD-5 this sort of interlock could very well be a bit more expensive because in FreeBSD-5 you have no choice but to use a critical section to prevent your thread from being preemptively switched to another cpu, and even if you could get away without a critical section you would still have no choice but to use integrated %fs:gd_XXX accesses to access fields in your globaldata (again because outside of a critical section your thread can be preemptively switched to another cpu). The preemptive switching in FreeBSD-5 is the single biggest problem it has. It makes it impossible to implement optimal algorithms which take advantage of the per-cpu globaldata structure because you are forced to obtain one or more mutexes, OR us a critical section (which is a lot more expensive in FreeBSD-5 then in DFly) no matter what you do. It's a cascade effect... and circular reasoning too. We want to use mutexes, oh no mutexes obtained by interrupts may conflict with mutexes obtained by mainline code, oh gee that means we should do priority borrowing and, wow, that means that one mainline kernel thread can preempt another mainline kernel thread indirectly, oh fubar, that means that all of our globaldata accesses have to be atomic (no caching of 'mycpu'), so we can't easily access per-cpu globaldata based caches without using a mutex. See? circular reasoning. -Matt Matthew Dillon <dillon@backplane.com>
From: Matthew Dillon <dillon@apollo.backplane.com> Subject: lwkt_token progress Newsgroups: gmane.os.dragonfly-bsd.kernel Date: Sat, 21 Feb 2004 12:30:09 -0800 (PST) Well, after much hair pulling trying to fit the new Token serialization code into the same API I had before, I finally gave up and decided to change the API. The issue comes down to how to record the tokens that a thread has acquired. Since tokens are simply serializing entities multiple threads can own the same token (they just can't be running at the same time when they do), so we can't just use a linked list embedded in the token's structure. I tried creating a fixed array of pointers in the thread structure. I tried creating a cache of token-referencing structures in the globaldata structure. I was not happy with either method. What I finally came up with is to have the caller pass a token reference structure that must remain valid for the duration of the hold. This allows the caller to declare the reference structure on the stack and allows the token API code to avoid using a critical section, making it very fast code. This means I have to make a bunch of changes to all the existing code that calls into the API, since the argument to the API functions is now a reference structure pointer rather then a token pointer. On the bright side, the new serializing token API is shaping up to be a very powerful tool. So powerful, in fact, that I believe it will help us get out from under the BGL much more quickly then I originally anticipated. While the old token code was only 'similar' to mutexes, the new serializing tokens are going to be much, much more powerful then mutexes. The new serializing tokens are going to make mutexes look like a poor country cousin! I'm hoping to have a patch set ready for testing tonight. Here is my current notion of the rules: ---- Serializing Tokens A serializing token may be held by any number of threads simultaniously. A thread holding a token is guarenteed that no other thread also holding that same token will be running at the same time. A thread may hold any number of serializing tokens. A thread may hold serializing tokens through a thread yield or blocking condition, but must understand that another thread holding those tokens may be allowed to run while the first thread is not running (blocked or yielded away). There are theoretically no unresolvable deadlock situations that can arise with the serializing token mechanism. However, the initial implementation can potentially get into livelock issues with multiply held tokens. Serializing tokens may also be used to protect threads from preempting interrupts that attempt to obtain the same token. This is a slightly different effect from the Big Giant Lock (also known as the MP lock), which does not interlock against interrupts on the same cpu. Holding a serializing token does NOT prevent preemptive interrupts from occuring, though it might cause some of those interrupts to block-reschedule. -Matt
From: Matthew Dillon <dillon@apollo.backplane.com> Subject: More thoughts on tokens Newsgroups: gmane.os.dragonfly-bsd.kernel Date: Mon, 23 Feb 2004 11:49:17 -0800 (PST) I've been doing a comprehensive review of how LK_NOWAIT is used with locks on DragonFly. It turns out that it is only used in about two dozen files and most of it is related to buffer cache operation (softupdates being the most aggregious misuser). The interlock/race issue with most of the uses of LK_NOWAIT is that code which uses LK_NOWAIT does not expect to be switched out to another thread during, e.g. a lockmgr() operation, and so all the code in question seems to make assumptions in regards to the stability of the structure it is working on. Here's an example from the buffer cache code: if ((bp = gbincore(vp, blkno))) { /* * Buffer is in-core. If the buffer is not busy, it must * be on a queue. */ if (BUF_LOCK(bp, LK_EXCLUSIVE | LK_NOWAIT)) { if (BUF_TIMELOCK(bp, LK_EXCLUSIVE | LK_SLEEPFAIL, "getblk", slpflag, slptimeo) == ENOLCK) goto loop; splx(s); return (struct buf *) NULL; } /* * The buffer is locked. B_CACHE is cleared if the buffer is *... ... gbincore() is not locking the returned buffer. The code above depends on code atomicy (i.e. the Big Giant Lock being held and no thread switches occuring) between the call to gbincore() and the call to BUF_LOCK(...,LK_NOWAIT). This is a good example of why I have to have the token code spin for the moment instead of yield. The solution is equally obvious... and that is to pass a token reference structure to [gb]incore() and for [gb]incore() to return the buffer with its token held. e.g. lwkt_tokref ilock; if ((bp = gbincore(vp, &ilock, blkno))) { ... if (BUF_LOCK(bp, &ilock, LK_EXCLUSIVE | LK_NOWAIT)) { ... } This way the token code would be able to yield without breaking the atomicy between the gbincore() call and the BUF_LOCK() call above. [ If you love FreeBSD-5's mutexes, stop reading here ] In contrast, FreeBSD-5's mutex solution is as follows: VI_LOCK(vp); if ((bp = gbincore(vp, blkno))) { int lockflags; /* * Buffer is in-core. If the buffer is not busy, it must * be on a queue. */ lockflags = LK_EXCLUSIVE | LK_SLEEPFAIL | LK_INTERLOCK; if (flags & GB_LOCK_NOWAIT) lockflags |= LK_NOWAIT; error = BUF_TIMELOCK(bp, lockflags, VI_MTX(vp), "getblk", slpflag, slptimeo); ... } This is a pretty good example of how NOT to implement an interlock API, so I'll rail on FreeBSD for a moment here: * All users of [gb]incore() have to explicitly obtain the vnode's mutex as an interlock. In my version, the interlock users of [gb]incore() obtain is under the control of [gb]incore(), which gives us the flexibility to implement it either as a vnode-embedded token or as a buffer-embedded token without having to pollute the higher level code with the knowledge of which one. * The call obtaining the interlock has been separated out from gbincore() and the call passing the interlock to BUF_TIMELOCK() is not directly linked to the VI_LOCK() call that obtained it. In my version gbincore() loads the supplied interlock reference structure and this same structure is required to be passed to BUF_*(), so there is no knowledge pollution in the higher level code at all as to what the interlock actually is. In anycase, I know why FreeBSD has done things this way... it's because of the mutex model they are using which I hate so much. Since lock ordering is an issue with mutexes most of FreeBSD's codebase needs to have very specific knowledge of the exact mutex(es) being manipulated throughout the entire call stack in order to avoid lock order reversals. Also FreeBSD mutexes have to be released across blocking situations, and FreeBSD mutexes cannot be used recursively (at least not safely). This means that lower level procedures either must have complete knowledge about the mutex(es) held by higher level procedures that call into them, or must be guarenteed to be non-blocking. And, worse, when you *DO* obtain multiple mutexes in the FreeBSD-5 model a deep yield makes it impossible for any other thread to obtain the higher level mutexes. In fact, for all intents and purposes a mutex in FreeBSD is almost exactly the same as a normal lock in its effect. We do not have that problem. Our tokens do not have any (theoretical) lock order reversal issues (theoretical because my initial implementation can't guarentee that a livelock won't happen), our tokens can be held across blocking conditions, multiple procedural levels can obtain the same token independantly (recursively) without specific knowledge of who is holding what, and a deep blocking or yield situation in DFly does not prevent other threads from obtaining the same serializing tokens while the first one is blocked because our tokens are run-time serializing entities only, not lockmgr()-style (run-time and block-time serializing) locks. So, I really do believe that our serializing token model is the better way to go. At the very least, the use of a reference structure is a far superior abstraction then the direct use of tokens or mutexes (what I had before and what FBsd has now). It's a bit sad that FreeBSD has tied itself into such knots. With some work they could change their mutex model into something similar to our serializing token model simply by recording the held mutexes and then allowing them to be held through a blocking condition (releasing them on switch-out and re-acquiring them on switch-in), along with getting rid of that utterly insane preemptive kernel thread switching model and preemptive cpu switching model. Just changing the internal mechanism is not enough, though, because they've already made major API changes that pretty much lock them into the current model. They would have to first admit that they have put themselves into a corner, then fix the API abstractions, before they can actually fix the core mutex abstraction. -Matt
From: Matthew Dillon <dillon@apollo.backplane.com> Subject: Re: scheduler Newsgroups: gmane.os.dragonfly-bsd.kernel Date: Sat, 15 Nov 2003 11:13:33 -0800 (PST) :Hi ! : :Since I'm very fond of the responsiveness effects :of ULE in fbsd-current, I'd like to know, how this :very responsiveness is supposed to be achieved :in dragonfly via userland schedulers. :As I understand it, the kernel-level thread-scheduling :is not to be touched at all including time-slices. :Also I understand that preemptive cpu-moves are against :essential dragonfly design because of locking issues. :(Which makes perfect sense for me, though I'm just an : interested bystander/sysadmin.) :I don't expect any complete explanations, just some pointers :to relevant email-threads/documentation/ideas/whatever. : :I hope I don't sound too demanding... :Just asking. : :Cheers :Peter Well, there are two things going on here... there's the userland scheduler and there is the LWKT scheduler. The LWKT scheduler is responsible for actually running a thread. It uses a fixed priority scheme but you have to keep in mind that the fixed priorities are differentiating major subsystems, not user processes. For example, hardware interrupt threads have the highest priority, followed by software interrupts, kernel-only threads, then finally user threads. A user thread either runs at user-kernel priority (when it is actually running in the kernel, e.g. running a syscall on behalf of userland), or a user thread runs at user priority. The user scheduler uses the dynamic priority mechanism from FreeBSD-4.x. It schedules *exactly* *one* user-runnable thread on each cpu in the system at a time, and it is responsible for dealing with the timeslice and forcing a reschedule to reschedule the next user-runnable thread. What this means is that from the point of view of a user process, preemption and dynamic priorities works as it always has. What has changed is that preempt and dynamic priorities *ONLY* apply to threads related to user processes now and are not used to schedule kernel-specific activities such as interrupts. In FreeBSD-5 the same ULE mechanism is use to schedule both user and kernel activities which imposes a great deal of unnecessary overhead IMHO. This also means that we do not have priority inversion based deadlocks, because any thread that is in the kernel (whether working on behalf of a user process, e.g. running a system call, or not), runs at a fixed kernel priority which is higher then the priority we use when running a process's thread that is currently in userland. Be sure not to confuse user priority based time slicing with preemption. A program running in user mode can be 'preempted' just fine. It's when a thread is running in the kernel that preemption is either not allowed or follows careful rules. DragonFly does preempt... it just does it very carefully and only under particular circumstances. An LWKT interrupt thread can preempt most other threads, for example. This mimics what FreeBSD-4.x already did with its spl/run-interrupt-in-context-of-current-process mechanism. What DragonFly does *NOT* do is allow a non-interrupt thread to preempt another non-interrupt thread. FreeBSD-5 uses priority borrowing to allow virtually any thread to preempt virtually any other thread, either directly or indirectly. This is a huge mistake IMHO which allows for very sloppy interrupt thread programming. FreeBSD-5 will also preemptively switch a thread to another cpu. DragonFly never preemptively switches a thread to another cpu if the thread is running in the kernel. -Matt
(also posted to freebsd-stable) I am starting work on on cache_lookup() and related functions now in DragonFly. I expect it will take at least the week to get a prototype working and I believe the resulting patch set will be fairly easy to port to FreeBSD. The first step I am taking is to make the namecache more persistent. *ALL* active vnodes accessed via lookup() will be guarenteed to have a namecache entry. I have successfully booted a test system with this change, though I am sure there are bugs :-). I am not entirely sure what to do about the filehandle functions but I am not going to worry about it for the moment. What this basically means is that vnodes cannot be recycled in the 'middle' of the topology, only at the leafs. This does not present a big problem since files are always leafs. The namecache topology will be guarenteed to remain unbroken except in cases where the namespace is deleted (e.g. removing a file with active descriptors). The second step will be to use a namecache structural pointer for our 'directory handle' in all places in the system where a directory handle is expected (e.g. 'dvp'). This will also involve getting rid of the vnode-based parent directory support (v_dd). Since the namecache structure has a pointer to the vnode this is a pretty easy step. A little harder will be to fix all the directory scanning functions to use the namecache topology instead of the vnode topology. The third step will be to use the namecache for all name-based locking operations instead of the underlying vnodes. For example, if you are renaming a/b to c/d you only need to hold locks on the namecache entry representing "b" and the one representing "d" prior to executing the rename operation. The one representing "d" will be a negative cache entry, placemarking the operation. This will not only completely solve the locking issues with rename(), remove(), and create, it also completely solves directory recursion stalls in both directions, completely solves the race to root issue, solves most of the directory lookup stalls (cache case lookups will not need a vnode lock and can run in parallel with directory operations that do need the vnode lock), gets rid of all name-related deadlock situations, and potentially allows modifying directory operations to become reentrant down the line. The fourth step will be a *BIG* Carrot... the namecache topology does not have a problem with vnodes appearing in multiple places in the filesystem. This means that (A) it will be possible to hardlink directories and (B) it will be possible to implement semi-hard links, basically softlinks that *look* like hardlinks and (C) to be able to CD forwards and backwards without the system getting confused about paths. In other words, some way cool shit. Additional optimizations are possible for the future. For example, it will be possible to cache ucred pointers in the namecache structure and thus allow namei() to *COMPLETELY* resolve a path without making any VOP calls at all, which will at least double and probably quadruple best case path lookup performance. I'll post an update after step 3, probably near the end of the week or on the weekend. I expect people will start screaming for Step 4 now that they know it is possible :-) -Matt Matthew Dillon <dillon@backplane.com>
Matthew Dillon <dillon@apollo.backplane.com> writes: [...] > > So why hasn't it been done or, at least, why isn't it unversal after all > these years? > > It's a good question. I think it comes down to how most programmers > have been educated over the years. Its funny, but whenever I build > something new the first question I usually get is "what paper is your > work based on?". I get it every time, without fail. And every time, > without fail, I find myself trying to explain to the questioner that > I generally do not bother to *READ* research papers... that I build > systems from scratch based on one or two sentence's worth of concept. I cannot say for "most programmers" but as far as few (mostly Linux) journalled file systems that I have experience in are concerned, I am sure situation is slightly different. Traditional journalling design that you are describing, writes journalled information twice: in the log and "in-place". As modern file systems tend to journal both data and meta-data this doubles amount of storage traffic. Besides, even with all transaction batching, writes to the log require additional seeks. The latter may be not important for the situation you seems to have in mind, viz. keeping off-site log accessible over network, but this _is_ important for the normal local file system that, in the simplest case, keeps log on the same device as the rest of file system. Attempts to improve performance of journalling file systems in this regard, mainly rotate around a cluster of (arguably very old) ideas called variously "shadows" (in the data-base world), "phase trees" (Tux2), and "wandering logs" (in reiser4). From little technical information available on Solaris ZFS, it seems it also uses something similar. These solutions, by their very nature, require tight cooperation between journalling and intimate innards of file system, which mostly rules out any kind of "universal journalling layer". Add to this other optimizations, like delayed block allocation (that are almost a must for any modern file system) that also interfere with journalling in non-trivial ways, and you will see why it's hard to devise a common API for efficient journalling. Basically, ext3 is the only Linux file system that uses classical WAL redo-only logging straight from Gray&Reuters, and lo, --- there even is standalone journalling module (fs/jdb) for it, or for any other file system that uses plain WAL. So, in some sense, I see situation as in some sense opposite to what you are saying: there is no common journalling API (in Linux), precisely because modern file systems developers are _not_ following classical papers on how to do journalling. Instead they explore new mechanisms to guarantee data consistency that, and in this situation straight-jacket of predefined universal API would be an obstacle. Nikita.
MACH VM: osfmk/vm/vm_page.h: /* * Each page entered on the inactive queue obtains a ticket from a * particular ticket roll. Pages granted tickets from a particular * roll generally flow through the queue as a group. In this way when a * page with a ticket from a particular roll is pulled from the top of the * queue it is extremely likely that the pages near the top will have tickets * from the same or adjacent rolls. In this way the proximity to the top * of the queue can be loosely ascertained by determining the identity of * the roll the pages ticket came from. */ promising idea
Added xll_sync.[ch] with simple implementations of mutex, semaphore, and condition variable. XNU/MACH synchronization primitives are... brain-damaged to say a least. For example, semaphore may only be destroyed by the same thread that created it. I wonder did authors of this code ever tried to actually use it? How on earth this is supposed to work with semaphores embedded into dynamical data-structures? xll_sync.[ch] is done directly on top of MACH wait_queue. Unfortunately, wait_queue has no static initializer macro.
Synthesis quaject's interface consists of the following: 1. callentries (methods), 2. callouts (references to callentries of external quaobjects), and 2. callbacks. callback is something along which control is returned when unusual situation occurs. For example, when queue is empty and we are trying to call queue.get(), queue.empty callback is invoked (quaject is tailored to particular user through run-time code generation, so its .empty callback is specific to the caller, as far as I understand). So far nothing unusual, _but_ queue quaject also has queue.hasone callback that is invoked when previous call to .get() failed through .empty and new element was just added. So, callbacks are both exception handling and notification/synchronization mechanism.
http://www.unet.univie.ac.at/aix/aixbman/prftungd/vmmov.htm "Repaging" mentioned by Riel long time ago. How this can be implemented in Linux? E.g., add ->repage_tree radix-tree to struct address_space. (We get 32 bits per-page this way, while we need only one!) Insert entry into this tree on each page fault. Do _not_ remove entry when page is reclaimed. As a result, ->repage_tree keeps history of page faults. Check it on page fault. If there is an entry for page being faulted in---we have "repage". Maintain global counter of elements in ->repage_tree's. When it's higher than some threshold (AIX uses total number of frames in the primary storage), start removing ->repage_tree elements on reclaim. Disadvantage: history is lost when struct address_space is destroyed.
It is immediately apparent from these figures that moving-arm disks should never be used, neither for paging applications nor for any other heavy-traffic auxiliary memory applications [D3]. Peter J. Denning -- Virtual Memory.
The dynamic reordering of all blocks [e.i., LRU --nikita] in memory may be a costly procedure; moreover, we are not particularly interested in the entire order. Of interest, rather, is a split of blocks into two subsets: the one to which recent references have occurred and the other to which no recent reference has been made. There is a relatively easy way to facilitate the split: any time reference a is made to a block, a status bit (which we call the "P bit") is set to 1 for the particular block. For pushes, unmarked blocks (for which P is 0) are preferred. However, it may happen that the set of unmarked blocks vanishes; a t this instant, all P bits are reset to 0, except the just-marked block, and the procedure starts afresh. Dueling queues for VM. Interesting...
Smaller page size reduce replacement rate considerably, mostly because "the predominant factor is that more useful information can be stored in memory, thus is directly accessible, resulting in a new, extended locality of the program...." Maybe this is what makes look-aside caches such as inode and dentry cache so efficient.
For data-base workloads A_0 replacement algorithm---optimal algorithm based in IRM (Independent Reference Model) is _much_ worse than LRU.
With WAL, destaging policies in the database buffer pool can be more elaborate than those in filesystem caches, where the age of "dirty" or modified pages must be bounded in order to restrict the loss of data in case of a system crash.
.... Hence, as a general rule, the dirty pages should be kept in memory only to the point where write coalescing becomes insignificant....
With the hardware support for WS on the Maniac II (see [MORR72]) each page frame has an associated counter that approximates the virtual-time-since-last-reference VT- LR(p) directly. The counter is cleared whenever the page is referenced and the processor automatically increments the counter of each page of the task every .25 msec. that the task executes.
Объекты планировщика: тред, мьютекс, событие, таймер. Интерфейс: /* waits for objects. Typical use for a mutex. */ object_wait(obj); /* * register a call-back to be executed, when object "happens". * Typical use for a timer. */ object_register(obj, callback); /* * register a callback that decides when wait has to be terminated. * Used to avoid thundering herd problem. */ object_wait_filter(obj, callback);
I can still recommend single-pass top-down recursive descent both as an implementation method and as a design principle for a programming language. C.A.R. Hoare, The 1980 ACM Turing Award Lecture
several "binding time" levels (sale, compile, run-time) as a way to implement re-usable software components.
EWD450: "controlled variables" (bound variable of a loop, it seems) and recursion should be abolished. Controlled variables, are bad (my understanding), because repetition's goal is to establish some relation while keeping invariant (as a result, as the end of the loop, both invariant is true, and guarding conditions are false). So, loop should be structured in the terms of what relation is achieved, rather than of what data structure we iterate upon.
EWD788: The symmetry is the fundamental property; commutativity is a notational artefact. Multiplication is defined on the _unordered_ pair (or even finite bag) of numbers. Set-theoretic framework, where multiplication is an operation and is encoded as mult: Z x Z -> Z forces asymmetry. Avoiding over-specification helps to simplify an argument. mult: FBag(Z) -> Z (Including empty bag.) Commutativity is here, already. Zero: 0 = mult.{} Associativity: mult{a,b}
EWD408 and EWD462: Here "local replacement policy" is discussed. Its advantage is that by keeping C/T ratio (ratio of virtual time CPU that given process got to the time is spent waiting for page faults) constant, one can guarantee that system behavior is (more or less) independent of the work-load. Predictable. Why is it so? Not clear. But any way, basic assumption of this policy is that there is little or now page sharing going on. This no longer holds. For multiple threads executing within the same address space such scheme wouldn't work. Some useful idea can be extracted: for a "task" we keep: 1. its page fault rate 2. page fault rate it would have if its working set were K1 pages smaller 3. page fault rate it would have if its working set were K2 pages larger Latter is easily achieved by keeping a track of K2 most recently evicted pages. (2) can be achieved by: 1. actually unmapping (but not evicting) K1 coldest pages in the working set, or 2. on each page fault scan K1 coldest pages in the working set and analyze their referenced bits. Goal of this is to find "knee" position in page-fault-rate/resident-set-size graph, that corresponds to the working set size. It is however unclear what to do with shared pages (including file system page cache).
EWD851: Reducing control traffic in a distributed implementation of mutual exclusion Dijkstra uses regular expressions to describe distributed algorithm. Nodes and links are denoted by characters in a string. Nodes are arranged into a _ring_. It looks intriguing to consider a theory or "cyclic strings". And what algebraic picture it gives us when we recall that strings over given alphabet form free monoid?
EWD910: "bottom" (diverging computation of FP) is a price paid for "functional" description of programming language semantics: when final state is considered a function of a program and initial state, "bottom" is used as a value of this function for (program, initial-state) pairs for which computation fails to terminate.
Insertion into Chomsky hierarchy
A Course In Algebraic Number Theory
Верещагин Н. К., Шень А.
Following F. William Lawvere, we show that many self-referential paradoxes, incompleteness theorems and fixed point theorems fall out of the same simple scheme. We demonstrate these similarities by showing how this simple scheme encompasses the semantic paradoxes, and how they arise as diagonal arguments and fixed point theorems in logic, computability theory, complexity theory and formal language theory.
Anders Kock, Gonzalo E. Reyes Aspects of fractional exponent functors Proposition 2.1: Given functors F, G: A -> B, and natural transformations g, h: F -> G, lets E(g, h) denote full sub-category of A, where g and h agree. If A has and F preserves a certain class of colimits, then E(g, h) is closed under this class of colimits. If A has and G preserves a certain class of limits, then E(g, h) is closed under this class of limits. And if G preserves monics, then E(g, h) is closed under sub-objects. This proposition can be used to prove that certain categories have certain types of (co-)limits.
A Tutorial on (Co)Algebras and (Co)Induction: Algebra describes class (construction), and co-algebra describes type (observation with unknown state).
Abstract: This paper grew out of a question of Lawvere, who asked whether categories of ordinary (1st or 2nd) order differential equations may be seen as toposes. He also gave some indications how "fractional exponents" may be used to answer this question, [25]. It is more or less classical how first order differential equations (1ODEs) are organized into a category. A 1ODE on a manifold M is the same thing as a vector field X : M -> T(M), and a morphism of vector fields (M 1 ; X 1 ) !
p.31: sets (discrete categories) are idempotents in the category of categories: S is discrete iff S^2 = S, where S^2 = Cat(2, S) = Arr(S), where 2 is two-element-one-arrow category.
Higher-dimensional category theory is the study of n-categories, operads, braided monoidal categories, and other such exotic structures. It draws its inspiration from areas as diverse as topology, quantum algebra, mathematical physics, logic, and theoretical computer science. This is the first book on the subject and lays its foundations.
If K is any convex set we may define a metric on it by K(A, B) = inf_{F:A->B}(- log \alpha_F) where F:A->B means that F \in K with A on the open segment from f to B, with \alpha_f then denoting 0 < \alpha < 1 with A = (1 - \alpha)*F + \alpha*B.
Эпиморфизмом в категории гладких многообразий является сюръекция со всюду плотным образом. Сюръективная субмерсия обладает универсальным свойством, характуризующим Бурбаковский "фактор-объект".
Total of 1478 links
Last generated: 2008-10-12 21:03
Generated by: generate-bookmarks -S -s index -o exp --