Opening Windows: Ticking of the Clock

The Ticking of The Clock

This article refers to assembly code listings which, at the time of its writing, were supposed to demonstrate the statements in the text.
However, since those listings are fragments of Windows Me code (as seen from the image loaded into memory) we prefer not to publish them, to be sure to comply with all existing copyright regulations, hence the article comes without the listings. We did not edit the text to remove the references to the listings, because it is an earlier work and it is only provided "as is". The text itself is still readable, because the listings were only meant to be a proof of our conclusions.

Introduction

In this article I will analyze how Windows Me handles the interrupt from the programmable interval timer (PIT), used to update the internal clock.

In doing this, I will describe the ring 3 – ring 0 transition. The way Windows programs the programmable interrupt controller (PIC) will give us the opportunity to explore the concept of Interrupt Request Level (IRQL) introduced with the Windows Driver Model (WDM).

We will see how an interrupt triggers the servicing of events and timeouts, which will be discussed as well.

I tried to complete this text with basic concepts on the processor architecture, whenever I felt this was necessary. In order not to impose them on anyone who already has familiarity with this subject, I collected them into an appendix, placing references to it in the other parts of the text. I decided to use the same approach with Windows ring 0 core (VMM) basic concepts, which are collected in yet another appendix.

This document refers to several code listings which contain disassembled portions of the Windows Me kernel. It is not necessary to read the listings in order to understand the concepts explained in the text. They are included primarily to support the statement I made in the article. Nevertheless, the listings contain many additional details explained in comments, so they may be worth looking at. I suggest to read each article section first and then, eventually, read the relative listing for further informations.

Before delving into the intricacies of Windows interrupt handlers, it may be useful to explain a few things about the conventions I adopted for the code samples.

Document convention

Symbolic Names

My Own Ones

This document contains several code samples, which include symbolic names for code labels and variables. Some of these names are defined in the symbol files supplied with Windows, while other ones I added myself to clarify the meaning of code and variables for which symbols weren’t available.

All the names that I invented begin with the characters @#@, in order to keep them distinct from the original Windows ones. The distinction maybe important because I assigned these names basing upon my interpretation of the code, therefore it is possible, though I wish this didn’t happen, that they may be misleading, if my interpretation is wrong.

There is an exception to this rule: I defined structures to represent some system variables. The structure names follow the convention I stated above, while the structure members don’t, in order to avoid too many decorations.

Structures Undocumented Members

The Windows SDKs and DDKs document a number of structures used by the system. By analyzing the Windows code it appears that some of these contain other members, which are not documented. Since the documented one always begin with a prefix which resembles the parent structure, I chose to name the undocumented one of which I became aware, using the same prefix with a trailing "X" character. For instance, some documented members of the tcb_s structure (from the Windows Me DDK) are:

TCB_ClientPtr
TCB_VMHandle

I found some extra members in the tcb_s, which are all named TCBX_...

Addresses in the Code Samples

This document includes many code samples which were produced by disassembling portions of code and which, therefore, include the addresses of the listed instructions. I began this article while Windows 98 SE was the current release, then I revised all the materials for Windows Me, updating the listings only when I found that there were differences. Unfortunately, we could hardly expect the Windows Me modules to load at the same addresses of their Windows 98 counterparts. For this reason, sometimes you’ll find a mix of different and inconsistent addresses in the code, but this should not be a problem. You must simply ignore the instruction addresses. I guarantee that the code listings show exactly the stream of instructions you find when you look at Windows Me code. Whenever an address is used as an instruction argument (e.g. in a jump), I guarantee that its position is shown correctly in the listing. Besides, I introduced symbolic labels for most of the relevant addresses which, of course, are placed correctly in the code. In short, the numerical addresses should be disregarded, as is the case when you write assembly source code: you don’t have numerical addresses at all, only the labels you choose to define.

Organization of Code in Procedures

All the code listings are organized in procedures declared with the proc and endp assembler directives. These are just declarations I added in order to separate the code into functional blocks. Whenever I found an address in code referenced by calls or interrupts I defined a procedure beginning at that address. The name of the procedure is the symbol associated with that address from the windows symbol files (if any), or a descriptive name I invented (decorated with @#@). I then enclosed in the procedure body all the code up to the farthest instruction reached by the flow from that address. In other words, if a way of a branch leads to a return instruction and another leads to other code below it, this code is considered part of the procedure as well. The procedure ends immediately before the first instruction which is never reached by any flow of execution inside the procedure itself.

Structures Unknown Fields

By analyzing the code, I have been able to find many data structures which are not documented. Often, I could not identify every single member of these structures, so, when declaring them in Struct.Asm (Listing 1), I inserted three dots to denote missing members between known ones. For instance:

@#@IRQStr struct IrqNumber dword ? ;(+00h) ;Irq number.Also used to command the PIC. MaskBitPt word ? ;(+04h) ;Bit pattern for irq unmasking. ... IrqLevel byte ? ;(+14h)

The offsets between parentheses give the correct position of the member inside the structure.

1 - The PIT Interrupt from Ring 3 to Ring 0 And Return

1.1 - Transition from Ring 3 to Ring 0

Note: if you are not familiar with the concept of ring x code, you will find additional information in section A.1

1.1.1 - Windows IDT

Since we want to start our exploration from the PIT interrupt, the first thing to do is look at the processor IDT, for information about its handler. Before doing this, though, we have to choose which IDT, because Windows has two of them for each VM (you can find additional information about VMs in section B.1).

I will analyze the system VM IDT for protected mode, because this is the one involved whenever a windows application is executing at the time of the interrupt.

1.1.2 - PIT 1^st Level Hander

The PIT interrupts on IRQ 0, which corresponds to interrupt vector 50h, because of the way Windows initializes the programmable interrupt controller (PIC). By looking at the gate for this vector we obtain the address of the 1^st level handler, which is listed in Figure 1

C0001100 83EC04 sub esp,4 C0001103 60 pushad C0001104 BE40010000 mov esi,140h C0001109 E90E030000 jmp Specific_Fault_Dispatch

Figure 1

The 1^st sub is used to compensate the Intel error code. Windows always builds the same stack layout, whenever the current program is interrupted by an hardware interrupt, a software interrupt or a processor exception. Since some exceptions push the additional error code after eip, windows makes room for one doubleword below eip for those interruptions which don’t have an error code. In this way, all the stack frames are always 4

doublewords long. If you are not familiar with the stack frame built by an interruption, you will find additional information in section A.2

Immediately after the sub all the processor general purpose registers are saved with a pushad , then esi is loaded with a value which is equal to the interrupt vector multiplied by 4:

50h * 4 = 140h.

This value is used to remember which interrupt occurred, because from here we will jump to Specific_Fault_Dispatch which is common to many interrupts.

1.1.3 - Specific_Fault_Dispatch

The code for specific fault dispatch is listed in SpcFltDs.Asm (Listing 2).

It’s interesting to note how the value of esp after the pushad is the pointer to the thread client register structure. This is an ubiquitous structure documented in the Windows Me DDK as Client_Reg_Struc, which contains the processor state for the interrupted thread. Each thread in the system has a client register structure which, when the thread is not running, contains its saved state. I will add some information about this structure in a while, but before, let’s finish with our examination of Specific_Fault_DIspatch.

The esp value pointing to the client register structure is stored into ebp, where it will stay for most of the time. Ebp is the conventional place for this pointer, although it can also be retrieved from some control structures. Afterwards a test on the saved flag image is performed, to determine whether the executing program was in V86 mode. I will concentrate on the non-V86 case, because it applies to Windows applications code. The analysis will therefore go on at @#@SetPmAppRing0Env, but first I want to keep my promise and spend some more words on the client register structure.

1.1.4 - The Client Register Structure And the Status of Threads

I think it’s useful to further clarify when a thread is "frozen" with its state saved in the client register structure.

I said before that this is true for every thread which is not running. Okay you could say, but our thread is running, it’s just that it received an interrupt, but still it’s the same good old thread. This is true but the fact is that, as soon as a thread enters ring 0, the 1^st thing Windows does is to save its ring 3 state in the client register structure. Therefore, to be more accurate, I should say that the structure contains the state of the non running threads and of the running one, if it is executing ring 0 code.

What is interesting to point out is that these two conditions are actually the result of the same interrupt handling scheme, applied to all thread. If a thread is preempted by the scheduler or blocks on a synchronization primitive, it does so at ring 0. So this is what happened to it: it entered ring 0, having its client register structure built, then never went on. It just stayed at ring 0 either because Windows decided to preempt it or because it called a synchronization function which caused it to be stopped there until the conditions to unblock it are met.

In all these cases, Windows maintains information about the point at which the thread was at ring 0 and its state, then resumes execution of some other thread, which had the same fate at some earlier time. Therefore all the non running thread were stopped at ring 0 and this, ultimately, is the reason why they still have their ring 3 status in the client register structure.

I’ve been careful not to say that a thread running at ring 3 doesn’t have a client register structure, because, as we will see, the memory area for the structure always exist and at the same address, even in this case. Only, as long as the ring 3 application is running happily, the structure contents are not valid.

1.1.5 - Detecting Interrupts from Ring 0

We can go on with our analysis by looking at @#@SetPmAppRing0Env,which is a label inside Return_To_VM listed in RetToVM.Asm (Listing 3).

The 1^st thing this code block does is to load the current VM handle into ebx from a static variable named Cur_Vm_Handle (you can find additional information on VM handles in section B.1).

Then the value of the CB_Client_Pointer of cb_s is compared against the pointer to the just built client register structure. This compare is the criteria the VMM uses to determine whether the interrupt came from ring 3 or ring 0 (you can find additional information about these two kind of interrupts in section B.2).

To explain my statement, let’s note that the interrupt frame and the registers saved by the initial pushad are written at an address, which is determined by the TSS, because the initial ring 0 stack pointer is taken from there (see section A.1 for details on the ring 0 stack and the TSS). That is, since a ring 3 – 0 transition occurred, esp is initialized to the fixed value in the TSS, therefore the frame and the data saved by the pushad end up at a fixed address.

On the other hand, if an interrupt occurs while we are already at ring 0, no stack switch occurs, because the privilege level stays the same, and an ordinary interrupt frame (eflags, cs, eip) is built to wherever esp is pointing at the moment of the interrupt.

So this is the reason why the cmp in sample 1-3 detects the ring from which the interrupt came: if it was ring 3, the client register structure to which ebp is pointing must be at a predefined address, which is kept inside the cb_s. If this is not the case, than it is ring 0 code which has been interrupted.

By the way, we’ll see in a while how, after having set up the ring 0 environment, an interrupt from ring 3 is handled by jumping to an address determined using the interrupt vector as an index into a jump table, which has name PM_Fault_Table. When an interrupt arrives from ring 0, the environment is set up in a different way and then another jump table is used, namely VMM_Fault_Table, which isn’t surprising since Windows has different handlers for ring 0 and ring 3 interrupts.

1.1.6 - Mind Your Own Stack

Before going on with the analysis of @#@SetPmAppRing0Env, I would like to point out some implications of the stack management I just described.

I explained before that when a thread is suspended it remains at ring 0, with the ring 3 state of execution saved in the client register structure. I also explained that whenever a thread reaches ring 0 from ring 3, the stack is initialized with the pointer in the TSS. It is therefore necessary that every running thread has a different value for this pointer, because if they all used the same one they would overwrite each other client register structure. Actually, when the VMM decides to switch to another thread, it writes into the TSS member containing the pointer a value specific of the resuming thread. This value is kept inside the tcb_s (in an undocumented member) and is presumably determined when the thread is created. This can be verified by looking at the code of a procedure named Directed_Yield, which is part of the chain which switches execution to another thread. I am not going to analyze this procedure here, because it is beyond the scope of this article.

We will also see that the VMM doesn’t use this initial stack throughout its execution. As part of setting up the ring 0 environment, esp is loaded with yet another value kept inside the tcb_s. Therefore the initial stack pointed by the TSS is only used to store the saved state of execution.

1.1.7 - Completing the Ring 0 Initialization

We are now ready to go on examining the code at @#@SetPmAppRing0Env. For now, let’s concentrate ourselves on the case of interrupts coming from ring 3. At address 0xC00014D1 We see that the VMM goes on saving registers inside the client register structure. There are two interesting things to note about this.

The first one is the fact that the registers are saved above the interrupt frame, hence above the initial stack pointer loaded from the TSS. We can see this because the interrupt frame itself is 6 doublewords long (remember it includes the ring 3 stack pointer and the error code) and the pushad took 20h more bytes. Therefore we have a total of

6 * 4 + 20h = 38h

bytes pushed on the stack, with ebp pointing at the lowest of them. Any access at an offset greater or equal 38h from ebp will therefore be above the pushed data. This would normally be a strange thing to do, because, a program can safely use only addresses below the initial stack value it finds. In our case the VMM can do such a thing because it knows to have set aside for the thread stack a certain memory area and to have calculated the initial esp (stored in the TSS) in such a way that there is available space above.

The second thing worth mentioning is that the segment registers are saved with exactly the same layout they would have, had the processor been interrupted in V86 mode. In this case, the processor would have saved the segment registers automatically, as part of the interrupt processing, i.e., when the processor is in V86 it doesn’t push only SS, esp, eflags, CS, esp, but the segment registers also. This doesn’t happen when it is in PM, so Windows does it by hand, at the same offsets, in order to have a uniform layout of the saved application state.

After having done this, the segment registers are loaded with the selector appropriated for ring 0 code, which had been previously loaded into di.

Then the current thread handle is loaded into edi. See how the global variable _pthcbCur is where windows keeps the handle of the currently executing thread. If you need some background information about thread handles, refer to section B.3.

Afterwards, as I anticipated, the stack pointer is loaded with a new value stored in an undocumented member of the tcb_s which I named TCBX_ESP0_VMM. See Struct.Asm (Listing 1) for the definitions of the structures mentioned in this document. The current stack value, which still points to the client register structure, is stored in the same member, because the instructions is an xchg.

Then a call is done, with the address taken from PM_Fault_Table. Esi, which contains 4 * interrupt vector, is used as an offset into the table, to reference the 2^nd level handler specific for our interrupt. The address stored in the table corresponds to the symbol Hw_IRQ_0. This finishes the initialization of the ring 0 execution environment. When the 2^nd level handler will return, execution will jump to Return_To_VM where the epilogue code which returns to the ring 3 program will be executed.

1.2 - The PIT Handler

In this section I will examine the PIT handler to which Return_To_VM jumps. The 1^st part of the handler is actually made of pieces coming from VPICD (the virtual programmable interrupt controller driver), therefore we’ll take a look at this component too.

1.2.1 - IRQ Masking and IRQL

The code explained here can be found in HwIrq0.Asm (Listing 4)

This procedure is actually inside the VPICD. Its first task is to program the PIC to mask the IRQ line of our interrupt. This will prevent another interrupt on the same IRQ to be serviced, even with interrupts enabled, on the processor. We see that the VPICD uses three bit masks to keep track of which IRQ must be masked, performs an or on them and outs the command to the PIC to mask all the resulting bits. In order to improve performances, it checks whether the command it is about to issue is the same as the last issued one and, if so, avoids the unnecessary out instruction.

The reason why a bit mask is needed is that every time the masking command is issued to the PIC, it must contain a set bit for every IRQ we want to mask. Every 0 bit in the command unmasks the corresponding line. This means Windows must record into these patterns which lines are currently masked and, each time it needs to mask another one, issue to the PIC a command byte with all the corresponding bits set.

The reason why there are three different masks is that they keep track of IRQ masked for different reasons.

The one named VPICD_Phys_ISR is for IRQs which are masked because an interrupt from them is being serviced. We see that our handler sets the bit for its line inside this mask. These bits are cleared when the interrupt handler calls the service _VPICD_Phys_eoi, to inform Windows that the IRQ can interrupt again.

The one named VPICD_Phys_IMR, probably, account for IRQ masked by ring 3 code. It is a known fact that Windows traps attempts from ring 3 code to issue commands to the PIC (see, for instance [2], p. 417). It probably keeps track of which IRQs have been masked for this reason inside this variable.

The most interesting mask is VPICD_Irql_Imr, which has to do with the IRQL.

IRQL is a concept inherited from Windows NT, as part of the Windows Driver Model specifications, which have been introduced in order to allow the design of device drivers portable between the Windows 9x family and Windows NT. Beginning with the Windows 98 DDK, the kit documentation includes the WDM specifications where references can be found to the IRQL. I haven’t been able to find an explanation of the underlying concept though, while information can be found in the Windows NT DDK, so, sometimes, I will quote from there.

The current IRQL is a state of the whole system, in the sense that, at any given time, the system is at a given IRQL. I have not been able to find information supporting this statement in the Windows Me DDK, but I will demonstrate it by showing how the IRQL is used and managed. The current IRQL determines which IRQ can be serviced and whether the system processes or not certain categories of enqueued callbacks. Generally speaking, the higher the current IRQL is, the lesser the system is interruptible, meaning that: more IRQs can be masked, certain categories of enqueued callbacks are not called, and so on. When the current IRQL reaches its maximum (numerically equal to 1fh), interrupts are completely disabled.

While examining the code executed to handle our interrupt, we’ll encounter various "effects" of the current IRQL, which I will describe and which will support the overview I outlined just now.

The first IRQL effect we are seeing is the presence of the VPICD_Irql_Imr bit mask: the bits set to 1 in this mask cause the corresponding IRQ lines to be masked off by the code in HwIrq0.Asm (Listing 4). We will see later how the bits inside this mask are set or cleared, when the current IRQL is changed.

Having completed the masking, this procedure jumps to another one which is common to different interrupt handlers, named VPICD_Common_Master_Int. It does so with the register al loaded with the PIC eoi command specific for this IRQ and with edx pointing to a structure that I call @#@IRQStr, which stores attributes pertaining to this IRQ and will be used to dispatch the appropriate handler.

1.2.2 - Dispatching the Interrupt to the Installed Handler

The code explained here is enclosed in VpCmnInt.Asm (Listing 5)

VPICD doesn’t include the functions to completely handle hardware interrupts. Instead, it implements interfaces and provides APIs to allow the real handlers to install themselves on the path of an interrupt and become part of its interrupt management scheme. Therefore, VPICD_Common_Master_Int will use the content of the @#@IRQStr to call the real handler.

The first thing this function does is to issue to the master PIC an eoi command for the IRQ which interrupted. The command has been previously loaded into al by the code specific for the IRQ. This is necessary, because the PIC automatically masks the just occurred interrupt and all the ones with a higher IRQ number, until it doesn’t receives this command. Windows doesn’t use this PIC feature and relies on its mask variables to disable IRQs, therefore the command is issued here, before actually handling the interrupt.

Having done this, it compares the current IRQL with the IRQL value stored in the @#@IrqStr. If the current value is below the requested one, @KfRaiseIrql@4 is called to raise it. It’s interesting to note that if the current IRQL is above the requested one, nothing is done. This is consistent with the idea that the higher the IRQL is the less interruptible the system is. The code for a given IRQ must be executed when the IRQL is at least at the value specified in the @#@IrqStr, but if the system is even less interruptible, i.e. less "disturbed", everything is fine.

Before further describing this function, I need to spend some words about the structure @#@IRQStr and its companion @#@SIRQStr, as they appear to be used by the program. The first one, which stores information about an hardware interrupt, contain a member which I named PSIRQList, which is a pointer to a linked list. Each node of this list is an instance of @#@SIRQStr and stores information about an interrupt handler for this IRQ. The reason why these information are arranged in a list is to allow for multiple handler to be installed: when the interrupt occurs, the first handler in the chain is called; if it chooses not to handle it, the next handler is called and so on. The handlers installed through this list are, presumably, the callbacks documented as VID_Hw_Int_Proc in the Windows Me DDK. These are callbacks which can be installed through documented services of the VPICD, to handle interrupts. The code in VPICD_Common_Master_Int implements this functionality. These callbacks signals to the VPICD whether the next handler in the chain is to be called or not by means of the carry flag. This is documented in the Windows Me DDK and we also see it in the code of this function.

Having adjusted the IRQL, this procedure calls the first handler in the chain, which is the main thing it has to do. Our analysis of the timer interrupt will therefore go on with the installed handler, which is part of vtd.vxd, the virtual timer device. Before moving to it, though, there are some extra details about this procedure of the VPICD which are worth mentioning.

First of all, at the moment of the call, the register eax still points to the list node (it is actually used to reference the handler address). This can be matched with the DDK documentation of the VID_Hw_Int_Proc, which states that it is called with eax containing the IRQHandle, which, therefore, is nothing but the pointer to our list node.

Another point of interest is that there are two possible kinds of handlers. The first kind are procedures which execute with their IRQ unmasked and can therefore be reentered. The VPICD provides additional checks in order to prevent an excess of reentrancy – due, for instance, to slow code – which can cause a stack overflow. The second kind are procedures which don’t need these additional checks and just get called. The means by which the VPICD discriminates the kind of handler is the flag 8000h in the Flags member of @#@SIRQStr, which, if set, denotes a reentrant handler. For handlers of the first kind, the VPICD keeps track of how many calls to the handler are in progress and, if they reach the number of 10h, the IRQ is masked, in order to prevent the interrupt from further occurring.

We saw earlier that the initial interrupt handling code bothers itself to check whether the interrupt came from ring 0 or ring 3, taking different paths accordingly. You could then say that what I stated above is incorrect, because the int-from-ring-3 handler we are seeing can’t be reentered, because of this distinction. Remember, though, that we reach this piece of code because the address of Hw_IRQ_0 was stored inside PM_Fault_Table. If you take a look at the initial code for ints from ring 0, you will find that it jumps to the second level handler by using another branch table, which is named VMM_Fault_Table. It turns out that, for some interrupts, the two tables contain the same address, meaning that the same procedure will be called both for interrupts from ring 0 and ring 3. This is actually the case for interrupt 50h: both tables contain Hw_IRQ_0 , i.e. the PIC initial handler which jumps to VPICD_Common_Master_Int. For this particular handler it turns out that the recursion flag (8000h) is clear, therefore no recursion checking is required, but this show how VPICD_Common_Master_Int can be reentered by the same interrupt.

From the code, we also see that the flag bit 200h in @#@SIRQStr.Flags specifies whether the list node is to be moved on top of the list, if its handler is executed. That is, if the previous handlers didn’t consume the interrupt and the one with bit 200h set is reached and consumes it, its node becomes the first one of the list and this handler will be the first one called the next time this interrupt will occur.

Finally, having called the installed VID_Hw_Int_Proc and before returning, this procedure calls the address stored at VPICD_Hw_Int_Filter but, normally, this points to a do-nothing address from which we just return.

Let’s briefly recapitulate what we have seen so far. When the interrupt occurs the ring 0 environment is initialized, then a call is done to the 2^nd level handler. This, in turn, is made of VPICD code, which call one or more installed VID_Hw_Int_Procs. In the case of the PIT, there’s only one procedure in the chain. After having executed it, from here we return after the call through PM_Fault_Table and, from there, execution jumps at Return_To_VM, which I will describe later.

We saw that VPICD_Common_Master_Int calls two procedures which manage the current IRQL: @KfRaiseIrql@4 and @KfLowerIrql@4. I will analyze them in detail later, when I will be able to explain some side effects of the IRQL change. For now, let’s just remember that the VPICD allows every IRQ line to define the minimum IRQ level at which it wants its interrupt to be serviced, and adjusts the current IRQL accordingly.

1.2.3 – The PIT VID_Hw_Int_Proc

The handler for IRQ0 which the VPICD finds installed in the @#@SIRQStr instance is inside a vxd named VTD (virtual timer device). Its first task is to call the VPICD service VPICD_Phys_Eoi, which unmasks the IRQ. The VTD handler is apparently sure to be fast enough to take the risk of running with its interrupt enabled.

When the VPICD calls the installed VID_Hw_Int_Proc, it loads edx with a reference data specified at the time the handler was installed. For the PIT handler, this data is a pointer to a data structure where the timing counters are stored. I named the structure definition for it @#@VtdStr1 and this particular instance @#@VtdData.

The first two dwords of @#@VtdData store the fundamental counter which Windows uses to keep track of the time. I named this value UpTime and the two dword members which store it UpTimeLow and UpTimeHigh. Upon every call to @#@PitHwIntProc this value is incremented by the amount stored in the BaseTInc (base time increment) member of @#@VtdData, which appears to be set to 174dh = 5965. The base time increment is related to the frequency at which the PIT is programmed to generate the interrupt and it is worth an in depth analysis.

The first thing to say is that by examining a code section I named @#@UpdtWinClock, it can be concluded that the PIT is programmed to interrupt every 5msec, i.e. at a frequency of 200Hz. @#@UpdtWinClock is reached by a JMP at the end of @#@PitHwIntProc. This procedure calculates the increment in the value of UpTime between two subsequent calls, divides it by the constant 4a9h, then passes the result to Update_System_Clock. This service is documented and requires the time elapsed since it was called for the last time, measured in milliseconds. So we know that, by dividing an increment of UpTime by 4a9h, we convert it into milliseconds. We can determine the period of the PIT interrupt by dividing BaseTInc by 4a9h:

Ti = BaseTInc / 4a9h = 5965 / 1193 = 5 msec. (1)

It is interesting to note that, even though BaseTInc is an integer multiple of 4a9h, @#@UpdtWinClock keep into account an eventual remainder from the division. Of course, if there is a remainder, it can’t be simply neglected, because this would lead to an accumulation of errors which would make the time measure drift. I described in detail how the remainder is accounted for, in the comments in the code listing, which you can find in UpdWClk.Asm.

We also see that UpTime is measured in units of 1 / 1193 msec, which corresponds to 0.8382 microseconds, or a frequency of 1193 kHz. This is the closest integer representation of the frequency at which the PIT is driven, i.e. 1193.18 kHz.

The PIT is programmed to act as a frequency divisor of the input frequency and it requires to load into one of its registers, through an i/o port, the division factor. That is, if you load the value x, it will generate an interrupt with a frequency of 1193.18/x kHz, i.e. with a period Ti given by:

Ti = x / 1193.18 msec (2)

By comparing 1 and 2, we can conclude that BaseTinc = x, i.e. that Windows stores in this structure member the very same value it loads into the PIT register. We also see that Ti is not exactly 5 msec, but, rather:

Ti = 5965/1193.18 = 4.999 msec.

Another thing we can observe is that there is a documented VTD service named VTD_Get_Real_Time which returns two doublewords storing the elapsed time in units of 0.8 microseconds. It is very likely that is simply the value of UpTime.

1.2.4 - Updating the DOS clock

@#@PitHwIntProc has actually two timers to keep updated: the Windows internal clock and the DOS clock, which still exists. In order to update the latter, it calls an interrupt simulation service which manipulates the client state of some virtual machine, to simulate the occurring of an interrupt. When the VM is resumed, the DOS interrupt handler is executed as if the PIT did actually interrupt a DOS program. The interesting point is that, in Windows, the PIT is usually programmed with a much higher frequency (200 Hz) than it is in DOS (18.206 Hz), so Windows must not call this simulation service on every interrupt. The technique Windows uses to accomplish this is to add BaseTInc to the DosTimeDivisor member of @#@VtdData, which is 16 bit wide. The DOS interrupt is simulated only when the ADD gives carry, i.e. when DosTimeDivisor rolls over and this effectively divides the number of times the interrupt is simulated and hence the simulation frequency by a constant factor, thereby bringing it to 18.206 Hz.

To better understand the mechanism, let’s analyze the relationship between the number of interrupts since the adding started and the number of time the ADD set the carry flag. If we suppose that the interrupt occurred n times and we call I the base time increment, we can express the quantity nI, resulting from the ADD as follows:

nI = kT + R (3)

Where:

T = threshold at which the carry is set (65536 or 10000h)

R = remainder, < T.

That is, we can express nI as an integer multiple of T plus a quantity < T. k gives the number of time the ADD set the carry, i.e., the number of times the sum I + I + I + … rolled over, reaching the threshold value T. Note that the code needs not to (and actually does not) store the complete nI value. Since we are only interested in the carry flag, it is enough to store the low order dword. I considered the abstract quantity nI only to show the relationship between n (number of interrupts) and k (number of times the carry was set).

We can resolve (3) to find k versus the time since the adding started:

k(t) = n(t)I/T – R(t)/T (4)

In (4) I stressed on the fact that R is a function of time, because the "portion" of nI less than the threshold varies with every add operation. Since we are interested in dividing the frequency of interruption, let’s manipulate (4) in order to show the frequencies:

K(t)/t = n(t)I/(tT) – R(t)/(tT)

Since n(t) is simply the PIT frequency f times the elapsed time t we have:

K(t)/t = fI/T – R(t)/(tT)

(K(t)/t)/f = I/T – R(t)/(ftT)

That is, the ratio between the average frequency of the carry (k(t)/t) and the frequency of the PIT is given by a constant (I/T) plus a variable term.

The variable term can be neglected, because R(t) is limited, by definition, to be less than T, while ftT increases continuously in time. Therefore, the more time passes, the smaller the error we have if we neglect the variable term.

When an error is introduced, it’s usually a good idea to give an estimate of its importance, so let’s calculate how much time needs to pass before the variable term becomes less than 1% of the constant term. I’ll name e the variable term, because it’s the "error" in the calculation:

R(t) < T => e = R(t)/(ftT) < T/(ftT) = 1 /(ft)

The time required for e to drop below a given value e_max is therefore given by:

e < 1/(ft) < e_max

t > 1/(f e_max)

With our numbers:

I/T = constant term = 5965/65536 = 0.091. => e_max = 9.1E-4.

f = 0.2 kHz

t > 1/(0.2 * 9.1E-4) = 5495 ms

I.e., after 5.495 seconds, the error is less than 1%.

Within this approximation we can say that the ratio between k/t and f is constant and it’s equal to:

(K(t)/t)/f = I/T (5)

I will refer to k/t as f_d (DOS frequency) and to f as f_w (Windows frequency) in the following, hence (5) becomes:

f_d/f_w = I/T

Note however that the error estimate above is valid for the average DOS frequency, that is, the ratio between the total number of times the carry was set since the counting began and the elapsed time. The more time elapses, the more this ratio is close to I/T * fw. On the other hand, if we observe f_d during a generic interval, it still varies a little unless T is exactly an integer multiple of I, which is not our case. It is easy to demonstrate that also in this case, the longer the interval is, the closer the average fd is to I/T * fw.

Knowing that T = 65536 and that I is the PIT division factor x, fd becomes:

fd = fw * I/T = fw * x / 65536.

On the other hand, we know that fw is the PIT frequency (fp) divided by the factor x, hence:

fd = (fp / x) * (x / 65536) = fp / 65536 = 18.206 Hz (6)

This is exactly the same frequency at which the PIT interrupts in DOS, where it is actually programmed to divide by 65536. So we see how the interrupt is simulated in the DOS environment at its original frequency.

See how this technique works for any programmed PIT frequency: by simply keeping the current PIT register value into BaseTInc, Windows is always able to calculate the DOS frequency. This is important because, as we will see, the PIT can be programmed for different frequencies while the system works.

Finally, @#@PitHwIntProc jumps into @#@UpdtWinClock which, as we already saw, calculates the elapsed time in msec and passes it to Update_System_Clock, whose job is to update other time counters.

1.2.5 - Update_System_Clock

Update_System_Clock is listed in section UpdSClk.Asm (Listing 8).

The first thing this procedure does is to check a status variable named System_Init_State and, if it is equal to 0, it returns without doing anything. Afterwards, it adds the content of ecx to the variable System_Time. We see here how this fundamental counter represent a relative time since Windows started.

Following that, there are a number of sections, each of which implements a category of timeouts.

The VMM offers a service to request that a given procedure be called within a certain amount of time, that is, to schedule a timeout. You call this service specifying your procedure address and the time value in milliseconds; later, the system will call your code after at least the specified time has elapsed. Ecx will contain the amount of time (in msec) of which the call is late, with respect to the specified timeout value.

There are two data structures involved in timeouts management: @#@ToutListPtr and @#@ToutNode, which you can find in Struct.Asm (Listing 1). I interpreted the content of their members both by looking at the code and by dumping them with the kernel debugger.

@#@TOutListPtr.Node1 is a pointer to a linked list, of which each node is an instance of @#@ToutNode. Each node in the list represents a scheduled timeout and contains the pointer to the procedure to be called, usually named "callback procedure". If there aren’t timeouts in the list, @#@ToutListPtr.Node1 points to @#@ToutListPtr itself.

The member @#@ToutNode.Value stores the amount of time specified for the timeout and the list is ordered by this member value. I will show later, when discussing the procedure which calls the callback that, for nodes other than the first one, this member contains the time initially specified for the timeout minus the time of the previous timeout in the list, that is, the time relative to the previous timeout.

Since the code block in UpdSClk.Asm (Listing 8) are very similar, I will concentrate on one of them, namely the one which begins at the label @#@CkGlobTouts.

First, it checks whether the timeout list is empty, by comparing @#@ToutListPtr.Node1 with the starting address of @#@ToutListPtr. If this is the case, it skips to the next timeout block, otherwise it subtracts the elapsed time in ecx from @#@ToutNode.Value. Thus, we see that the initial value is not left unmodified in this member, but rather it is continuously decremented and represents the remaining time before the callback must be called. If the result of the subtraction is positive, it is not time to call it yet and the remaining code is skipped. Otherwise, the previous value is restored by adding the elapsed time back to it (ecx); the elapsed time is also added to @#@ToutListPtr.ElapsTime, which stores the amount of time passed since the last update of the Value member. So we are seeing that the VMM keeps track of how many time has passed for the timeout in two possible ways. First, the remaining time is decremented until the moment to call the callback arrives. Then, as the time passes, @#@ToutNode.Value remains constant, but @#@ToutListPtr.ElapsTime is incremented. This means that the time elapsed since the timeout expired is actually given by the difference between Value and ElapsTime. We will see later how this difference will be passed to the callback, when it is finally invoked, and also how, at that moment, ElapsTime will be zeroed.

Having updated ElapsTime, the code compares it with the current elapsed time (ecx). The purpose of this test is to determine if we have already found that the timeout expired. In other words, prior to the timeout expiration, ElapsTime is 0, therefore, the first time we add ecx to it, becomes equal to ecx. The compare detects this situation and the code to schedule the call to the callback is executed (we will see it in more detail in a while). Afterwards, ElapsTime will continue to be increased until the timeout is finally called and, while this happens, the compare avoids to reschedule the call. We will see later how, when ElapsTime is zeroed, the elapsed time will be subtracted from the remaining time of the next timeout (if any), which will by then have become the first in the list.

At any rate, the first time the timeout is found expired, the execution of a procedure named Time_Out is scheduled as an event (I provided some background information on events in section B.4). This is a very interesting point, in that it shows that timeouts are ultimately implemented through events. Time_Out is an interface called by the event handling mechanism, which maintains the timing counters and, in the end, calls the timeout callback. On the other hand, from the viewpoint of the event handling code, Time_Out is nothing but an event callback.

We see in the call to Call_Restricted_Event that the flag PEF_WAIT_PREEMPTABLE is specified. This flag is not documented in the DDK, although its name can be found in vmm.inc. I will return to this flag later, when examining ring 0 interrupts.

The other blocks in UpdSClk.Asm (Listing 8) implement timeouts specific to a particular thread or VM.

For VMs, we see that every cb_s encapsulates an instance of @#@ToutListPtr which points to the list of timeouts specific to it. The list is managed in the same way of the one we just examined, with the exception that the event is scheduled as a VM event, which will ensure it will run in the context of the specified VM.

It is interesting to note that the code in Update_System_Clock shows that the timeout management is only carried on for the current VM, while for the other ones nothing is done, meaning that it is almost as if, as far as they are concerned, no time has passed. This is perfectly consistent with what the Windows Me DDK has to say about VM timeouts:

"Schedules a time-out that occurs after the specified virtual machine has run for the specified length of time. The system calls the time-out callback procedure only after the virtual machine has run for Time milliseconds. Time that elapses while other virtual machines run is not counted."

This section of code also adds the elapsed time in ecx to a dword member at offset +14h inside cb_s (I named it CBX_ELAPSED_TIME). As for the VM timeouts, this member is updated only when the virtual machine happens to be current, so its value should represent the time the VM has spent running. The value of this counter is probably the one returned by the (documented) VMM service Get_Last_Updated_VM_Exec_Time. This is what we find in the Windows Me DDK:

"When the system creates a virtual machine, it sets the execution time for the virtual machine to zero. The system increases the execution time only when the virtual machine actually runs. Therefore the execution does not reflect the length of time the virtual machine has existed, but indicates the amount of time the current virtual machine has run. Note however that any code executed in the indicated virtual machine contributes to the tally; it is not the case that one second of virtual machine execution time translates into one second of actual CPU time given to the application."

This is consistent with our findings.

Another Update_System_Time code block performs analogous operations on a timeout queue specific of the current thread. The concepts are the same and lead to similar guesses for the implementation of the service _GetLastUpdatedThreadExecTime.

Finally, the very first of these code blocks, immediately following the updating of _System_Time, performs the same management for yet another timeout queue, pointed by a static variable which I named @#@TOutDpcListParent. What is peculiar about these timeouts is that they are not enqueued as events but, rather, a callback is enqueued through a call to _KeInsertQueueDpc@12. I will return on DPCs later.

The ret instruction which closes Update_System_Time is the end of the VID_Hw_Int_Proc and execution is transferred to VPICD_Common_Master_Int, where it is determined that no other handler must be called and another ret is executed. This closes the 2^nd level handler and execution returns after the call through PM_Fault_Table. We have already seen that what follows is a jump to Return_To_VM which will finally resume the interrupted code.

1.2.6 - The "Last Updated" Timing Services vs the Standard Ones.

Together with the Get_Last_Updated_VM_Exec_Time, the VMM offers the companion service Get_VM_Exec_Time. The difference between the two is documented as follows:

For each query service there are two variants, the standard form and the last updated form (for example, _GetThreadExecTime and _GetLastUpdatedThreadExecTime). The standard form returns the time to millisecond accuracy, whereas the last updated form returns the time only to an accuracy of approximately 50 milliseconds. The difference is that the standard form will ask the timer device to give the time to millisecond accuracy, and use the result to compute the value to return, whereas the last updated form returns the value most recently obtained by a standard form call, or by the timer device explicitly updating the system clock (which happens on every timer tick).

This has two implications.

First, the statement above implies that there is a timer tick every 50ms, while we saw that, actually, there is one every 5ms. I noticed that the PIT frequency is different between Windows 95 and Windows Me. I didn’t check the numbers, but I suspect that 50ms was the Windows 95 period and that the documentation has not been updated. It seems that, from Windows 98 on, all the documentation about VXDs has simply been included as it was from the previous versions, while more attention has been given to WDM topics. The Windows Me DDK has a section about new VXD functions in Win 98, but it does not mention this change of frequency (Windows 98 shows the same factors and hence the same frequency of Me). By the way, this increase in frequency is certainly one of the factors which make Win 98 slower than Win 95.

The second implication comes from this question: how can the standard timing services return the time with an accuracy smaller than the time between two PIT interrupts? Windows has no way of knowing how many of the 5 (or 50) milliseconds have elapsed since the last interrupt occurred. There should be no other counter running faster than the one incremented by the PIT, which is the basis for all the timing services. We must conclude that there are only two ways in which the 1 msec accuracy can be reached.

The first way would be to have another device (perhaps another of the PIT counters) to interrupt with a frequency of at least 1kHz. This is hard to believe, because it will slow down the system.

The second way, which I suppose is the one Windows uses, is simply to wait for the next timer tick. This could be the meaning of asking the timer device the time to millisecond accuracy: the timer will return the result only when a new PIT interrupt occurs. Since 5 msec is an enormous time compared with the CPU speed, the timer cannot simply spin in a loop waiting for the interrupt. It will have to suspend the calling thread until the interrupt comes, perhaps by means of a timeout.

1.3 - Return_To_VM

This procedure is enclosed in RetToVM.Asm (Listing 3).

1.3.1 - The Main Event Loop

Return_To_VM is a rather complex procedure which contain also the code we already dissected, which sets up the ring 0 environment. It seems that the authors chose to group in a single place all the CPL switching code in either directions. I will start by describing the execution path taken by our interrupt coming from ring 3. Later, I will add considerations about interrupts at ring 0 which reenter ring 0.

It’s interesting to note that the first instruction in Return_To_VM is a cmp eax,259D59h which doesn’t seem to do anything useful. The flags set from this instruction are changed by subsequent ones, without having been tested. There are other useless instructions here and there in the code. I also noticed that the constant operand (259d59h) is different between Windows 98 SE and Windows Me. It is possible that these instructions are a sort of signature of the build of the kernel.

A part from this, between the procedure beginning and the label @#@GetServEvent, return_to_vm performs a loop in which the event queues are scanned and all the serviceable one are processed (see section B.4 for more information about events). This loop is the one which services the pending events. It calls Get_Serviceable_Events whose main job is to find the first event which can be serviced and return its handle into eax. The handle, in turn is a pointer to an instance of an @#@EvStr structure -see Struct.Asm (Listing 1) - although it doesn’t point to the first byte, but rather at the CallBack member. If an event is found, it is serviced by calling Process_Event and the loop is repeated. When Get_Serviceable_Events returns 0, there are no more events which can be processed now.

We will analyze Get_Serviceable_Event and Process_Event in a while, but, for now, let’s concentrate on Return_To_VM.

1.3.2 - Back to Ring 3 at Last

When Get_Serviceable_Event returns 0, which means that there aren’t any more of them, this procedure does some manipulation of a flag in the tcb_s (the meaning of which is, honestly, rather obscure), then tests the VMSTAT_PM_EXEC bit in the VM flags, to determine whether the ring 3 application is a V86 or a protected mode one. I will concentrate on the protected mode case (which goes on at my @#@RetToVmInPm label).

The stack pointer is then exchanged with the content of the tcb_s member named TCBX_ESP0_VMM. From section 1.1.7, we know that this member contain the address of the client register structure built by the status saving code at the beginning of the interrupt. After the xchg, it contains the ring 0 stack pointer, which will be used again the next time this thread will enter ring 0. So this xchg puts to rest the ring 0 stack pointer and set esp to point to the ring 3 status saved immediately after the interrupt.

The rest of the code reloads all the saved ring 3 registers, then performs an iret instruction, which will resume the interrupted program. This is the end of our round trip from ring 3 to ring 0.

I skipped over some details about the stack management, for which you can find comments in the code.

In the next sections we will analyze in more depth the event servicing mechanism.

2 - Event Servicing

The main event servicing loop inside Return_To_VM is based on Get_Serviceable_Event and Process_Event. The first procedure retrieves the handle of the first pending event which can be serviced, while the second one actually services it. Get_Serviceable_Event is actually preceded by the symbol Check_Serviceable_Event placed a few instruction above it, which is also targeted by "call" instructions in other parts of the code. I chose therefore to list the code from Check_Serviceable_Event, placing it in a procedure with the same name and containing the label Get_Serviceable_Event. In the next section, I will describe the whole procedure, highlighting the differences for the case when it is entered at Get_Serviceable_Event.

2.1 – Check_Serviceable_Event

The procedure is listed in CkSvrEv.Asm (Listing 9).

2.1.1 - Preliminary Checks

Check_Serviceable_Event shows a curious behaviour. The first thing it does is to check whether the variable Global_AE_Count is 0. If this is the case, it terminates returning 0 as event handler. The reason why I say this behaviour is curious is because Global_AE_Count only accounts for global events, while the count for VM and thread events (together) is kept in Local_AE_Count. This can be observed in Process_Event where, if the event being serviced is a VM or thread one, Local_AE_Count is decremented and Global_AE_Count is left unchanged. It is therefore strange that Check_Serviceable_Events doesn’t check whether there are pending VM or thread events, only because there aren’t global ones.

Incidentally, the same behaviour can be observed inside Return_To_VM: the main event loop can be exited not only when Get_Serviceable_Event returns 0, but also when Global_AE_Count drops to 0, even though Get_Serviceable_Event returned a valid handle.

The check on Global_AE_Count is not performed if the procedure is entered at Get_Serviceable_Event. The only other difference is that, when calling Check_Serviceable_Event, eax specifies flags related to conditions the event must satisfy, while, when calling Get_Serviceable_Event, it is ignored.

The next thing we see is how NMI event callbacks are implemented. These callbacks are described in the Windows Me DDK as procedure which are all called after the chain of handlers for the NMI have been executed. We see that Check_Serviceable_Event checks whether a static variable named NMI_Event_Count is 0. If this is not the case, the procedure simply terminates returning 0ffffffffh as event handle. If you look at Process_Event (ProcEvnt.Asm (Listing 10)) you will see that, upon receiving such a "pseudohandle", it calls another procedure which services (i.e. calls) all the installed NMI callbacks, which are kept in a separate list. So, if there are NMI events, they are serviced before the other ones.

2.1.2 - Event Selection

If there aren’t NMI events pending, execution goes on at the point I labelled as @#@InitUnmFlags from where the procedure determines which events can be serviced by the current thread and VM.

Whether an event is serviceable or not depends upon the restriction which were specified for it when it was scheduled. The VMM services for event scheduling allow to place a number of conditions which must be met for the event to be serviced, e.g. (from the Windows Me DDK documentation):

Callback function is not called until the virtual machine enables interrupts in all threads.

Callback function is not called while the VPICD is simulating a hardware interrupt.

Callback function is not called until the virtual machine or thread is executing in protected mode.

Incidentally, these conditions are defined by means of flag bits and for each of them a bit mask whose name always begin with "PEF_…" is declared. If you look at vmm.inc or vmm.h (part of the Windows Me DDK) you will find, among the usual ones, some flag bits which are not documented in the DDK manuals.

To perform its task, the program sets into ecx all the flag bits corresponding to restrictions which are not (italic) met. Later, it will simply perform an AND between ecx and the flag field of every enqueued event. If the result is different from 0, at least one of the unsatisfied conditions has been specified for the event, which is discarded.

By analyzing the code which selects serviceable events, we can discover some interesting information.

2.1.3 – PEF_WAIT_FOR_PASSIVE

The instructions at address 0c0003649h show how an event having the PEF_WAIT_FOR_PASSIVE bit set can be serviced only if the current IRQL is 0. This is interesting, because that bit is undocumented. The DDK header wdm.h defines the manifest constants for the various IRQL level and, in particular, 0 corresponds to PASSIVE_LEVEL. So we see that, by specifying that flag, we may restrict the event execution to a time when the IRQL is at PASSIVE_LEVEL. Note that, being this the lowest possible value, it means that the processor is most interruptible. Quoting from the Windows NT 4.0 DDK:

In general, threads run at PASSIVE_LEVEL IRQL: no interrupt vectors are masked.

Should we need this kind of event restriction, we could presumably use that flag in a call to Call_Restricted_Event, even if it is not listed in the documentation.

2.1.4 - Interrupt Status of the Threads of the VM

The code at 0c0003661 shows how an event for which the flag PEF_WAIT_FOR_STI is specified can be serviced only if the byte at –6ch from the current VM cb_s is 0. Since the DDK states that flag imposes that the event can be serviced only when the VM enables interrupts in all threads, we can conclude that, if one or more threads have interrupts disabled, that byte contains a non 0 value. Possibly, this byte could store a count of the threads with interrupts disabled.

2.1.5 - V86 Mutex Undocumented Effect

The code at c000366b implements the restriction flags on the status of the critical section.

The interesting point is that, if the V86 mutex is owned, Check_Serviceable_Events applies the same rules it uses when the critical section is owned.

The V86 mutex is a documented synchronization object used to coordinate execution of V86 code (most frequently DOS and BIOS) between the multiple threads of the system VM. What is not documented is the fact that, for instance, an event scheduled to run only when the critical section is not owned will not be executed if the V86 mutex is owned either.

Every flag which places a restriction upon the status of the critical section places the same restriction upon the status of the V86 mutex, because the event selection code treats them the same way, and this can be concluded by observing the code at 0c000366b.

That code block accesses a static structure variable named _mtxCritSec and from the way it uses it in conjunction with the critical section flags, we can conclude that it implements the critical section and that the member at +10h contain the handle of the owning thread. This statement is also supported by the fact that, in Windows 98, the member at +10h was named gc_owner. Unfortunately, in Windows Me the symbol gc_owner corresponds to the same address of _mtxCritSec, but I suppose this is an error.

The same code accesses a member at +10h from a variable named _mtxV86, which I believe is the implementation of the V86 mutex. I can say this because, if you look at the documented service Begin_V86_Serialization, which is the one to be used to acquire the mutex, you see that it is based on a call to a procedure named _EnterMutex, which receives the address of _mtxV86. Again, the member at +10h was mV86_owner in Windows 98, while in Me this symbol is equal to _mtxV86 (mistakenly, I believe).

From the code, we see that Check_Serviceable_Event behave identically if it finds an owner for either the critical section or the V86 mutex.

By the way, another proof of the fact that the member at +10h of the mutex structure is the handle of the owner, is given by the output of the .pmtx debugger command: the owner handle returned by it, is equal to the member content.

2.1.6 - Thread Priority

Going on with our analysis of the event conditions, we discover where the priority of a thread is stored: not surprisingly in the tcb_s. We see (code address c0003686h) that, if the critical section is not owned, the condition specified by the event flag PEF_WAIT_NOT_CRIT maybe satisfied. Whether this is the case or not depends on the content of the member at + 3ch inside the tcb_s: if it is less than 100000h, the restriction is met.

Now, the effect of the flag is documented as follows in the DDK (from the documentation of Call_Restricted_Event):

"Callback function is not called until the virtual machine is not in a critical section or time-critical operation."

So the value at 3ch is related to the fact of being in a time critical operation, because at that point in code, we have already determined that the critical section is not owned. In vmm.inc we find that, among the priority boost values, the value CRITICAL_SECTION_BOOST is equal to our threshold value 100000h, therefore the comparison we saw can be interpreted as the check that the thread priority is lower than the one associated with the critical section. In other words, we are checking that the thread is not in a "time-critical operation".

From this we learn that at tcb_s + 3ch there is the priority of the thread.

2.1.7 - Thread List Walking

After the critical section ones, other event flags are set into ecx, but I will skip over this details for which you can find additional information in CkSvrEv.Asm (Listing 9). At any rate, we arrive at the label @#@BeginEventSearch with our bit pattern into ecx.

Here the code shows how each event is represented by a node in a linked list, with each node being an instance of @#@EvStr - see Struct.Asm (Listing 1). The global event list is pointed by the static Global_Event_List and the VM and thread lists are pointed by members inside the cb_s (+ 20h) and the tcb_s (+ 6ch), respectively.

The three lists are searched for a matching event. The first one found is returned, so, for instance, if it is a global one, VM events are not considered and so on.

By the way, this code also shows that Global_AE_Count only concerns global events (see section 2.1.1), because if it is found to be 0, the lists of VM and, eventually, thread events are scanned. It is therefore rather strange the fact that, in the initial check, Check_Serviceable_Event gives up its search if Global_AE_Count is 0. I’m sorry for not being able to offer an explanation for this fact.

The walking of the lists terminates either when an event is found or when every list has been unsuccessfully searched. If an event is found, the address of the Callback member inside the corresponding @#@EvStr is returned. This is what the DDK documents as the handle, returned by the event scheduling services. This can be verified by scheduling an event and comparing the handle with the addresses of the list nodes.

2.2 - Process_Event

This procedure is listed in ProcEvnt.Asm (Listing 10).

We are now going to analyze the second fundamental procedure related to events.

2.2.1 - Preliminary Operations

The first few instructions of Process_Event manipulate the properties of the descriptor for the code segment. I explained, at least in part, the reason for doing this in the comments in RetToVM.Asm (Listing 3), at the label @#@SetEventGPF, so I will skip over this detail here. You can find more details in section 4.4.3.2 also.

As I mentioned in section 2.1.1, Process_Event starts doing its job by checking whether the handle it receives into eax is 0ffffffffh and, if this is the case, processes NMI events by jumping into Process_NMI_Events. From there, execution will return to the caller of Process_Event.

If eax contains a specific event handle, Process_Event unlinks it from the event chain (address 0c0003779h), then tests @#@EvStr.LocGlob to determine if it is a global or local event.

2.2.2 - Global Events

Let’s see what happens for global events, first.

First of all, Global_AE_Count is decremented. We see how this counter contains the number of pending global events.

If the event specified a priority boost, the boost value is loaded from @#@EvStr.PriBoost and applied calling Adjust_Thread_Exec_Priority.

Then the (optional) timeout associated with the event is cancelled (@#@CancTimeout). This timeout comes from the fact that it is possible to schedule an event specifying that it must be serviced within a certain amount of time (see, for instance, Windows Me DDK, Call_Restricted_Event). The system doesn’t ensure that the callback will be called in the specified time, but will call it with the carry set, if it is late in doing this. The callback routine has therefore the possibility to know if it has been called within the required time limit.

Process_Event doesn’t set the carry: it just cancels the timeout to keep things clean. Since we are not worrying about the carry here, the fact we found the event still pending in the list is enough to ensure that the timeout didn’t expire. This suggests that, should the timeout have expired, the event would have been removed from the list and called by means of a different mechanism, which would have also set the carry. We will see later, when analyzing Time_Out, how this procedure seems to implement this logic.

After the timeout canceling, the event callback is called and the priority boost is removed, unless the event had the PEF_DONT_UNBOOST flag set.

2.2.3 - Local Events

Now let’s look at what the differences are for local events (i.e. VM or thread events).

Local_AE_Count is decremented instead of Global_AE_Count.

For VM events, the specified boost is removed from the priority of the VM and applied to the priority of the thread (unless PEF_DONT_UNBOOST is set). This is exactly the behaviour documented in the Call_Restricted_Event section of the DDK. This fact has, however, an interesting implication: the boost is removed from the VM, so it must have been applied before. This is confirmed by another fact: for thread event, no boost is applied at all before calling the callback, but, still, the boost, if present, is removed after the call, if PEF_DONT_UNBOOST is clear.

So we must conclude that the boost has already been applied before arriving to Process_Event. And indeed this makes a lot of sense: the reason for scheduling an event with a priority boost, is usually the need of having it serviced quickly. On the other hand, a local event is bound to run only in the specified VM or thread, but, still, it wouldn’t do to have to wait for the VM or thread to become the current one before applying the boost. To the contrary, the boost must be applied shortly after the scheduling, to raise the chances of the VM/thread to gain control and quickly service the event.

Does this mean, for instance, that Call_Restricted_Event calls Adjust_Exec_Priority to raise the VM priority, when the event is scheduled? Probably not. We must recall that event scheduling services are used inside hardware interrupt handlers, to defer non-asynchronous operations. Adjust_Exec_Priority is not an asynchronous service, can cause a thread switch and can’t, therefore, be called at hardware interrupt time.

Instead, Call_Restricted_Event and the likes must use some other mechanism to "preset" a priority raising, as soon as it is possible. Maybe they directly write inside the cb_s or tcb_s member where the priority is stored.

The only other difference between local and global events is a manipulation of two cb_s members, if the flag PEF_WAIT_FOR_STI is set (see @#@TestWaitForSti), whose meaning I haven’t been able to discern.

3 - Timeouts

3.1 - Timeouts and Events

Let’s start by summarizing what we know about timeouts. We have seen in section 1.2.5 that the status of the timeout queues is checked by Update_System_Clock which, in turn, is periodically called because of the timer interrupt. If Update_System_Clock determines that the first timeout in a list has expired (there are 4 of them), it schedules an event having Time_Out as callback. Time_Out will therefore be called to service timeouts by means of the event servicing mechanism.

When the event is scheduled, the pointer to the @#@ToutListPtr structure (which stores the pointer to the list and other data) becomes the event reference data. The event servicing mechanism loads the reference data into edx before calling the callback, therefore Time_Out will be called with the same pointer into edx and will access the timeout list data through this register.

3.2 - Timeouts Processing

3.2.1 - Basic Timeout Servicing

Now let’s start with the analysis of Time_Out, for which you can find the code in TimeOut.Asm (Listing 15).

From Update_System_Clock it seems that Time_Out is called only when it has already been determined that the first timeout has expired, anyway it is conservative enough to check if this is the case. Should the timeout have time left, the procedure would return without calling the callback.

If the timeout has expired, it is removed from the list by calling Time_Out_Unlink, listed in ToutUlnk.Asm (Listing 16). This procedure removes the node from the chain, but it also adds the timeout remaining time (@#@ToutNode.Value, see section 1.2.5) to the corresponding member of the next node.

I think we must conclude from this that the member @#@ToutNode.Value stores the remaining time for the timeout, relative to the preceding node. The absolute remaining time for a timeout is thus given by the sum of all the Value members of the preceding nodes plus its Value member. The only node for which Value is equal to the absolute time is the first one. Time_Out_Unlink removes the first node in the list and updates the value of the new first node, transforming it into an absolute time.

Afterwards, Time_Out zeroes the ElapsTime member of the @#@ToutListPtr parent structure and calls the callback. At the moment of the call ecx stores the difference between the ElapsTime and Value. This represents the elapsed time since when Value was last updated, minus the remaining timeout time, i.e. the elapsed time since the timeout has expired. This is the tardiness, which the DDK documents as being passed to the callback into ecx at the moment of the call. See also section 1.2.5.

Having serviced the callback, the @#@ToutNode instance is enqueued into what looks like a list of recyclable nodes.

3.2.2 – Who’s Next?

While Time_Out is servicing the first callback in the line, it also checks whether there is someone else waiting after it. If this is the case, there are two possibilities.

The second timeout may not have expired yet. In this case, its remaining time is simply updated by subtracting ElapsTime to it. This is necessary because ElapsTime is going to be zeroed, therefore we must account for the elapsed time directly into the Value member. This is the normal behaviour when the timeout hasn’t expired yet.

On the other hand, if the second timeout has expired also, Time_Out schedules another event having itself for callback. This way, when Time_Out will return and the VMM will go on servicing events, it will call Time_Out again, which will service the timeout and, eventually, schedule itself again if it finds out that the next in the line has expired. In this case, @#@ToutListPtr.ElapsTime is not zeroed and the new timeout remaining time is left unchanged. This is consistent with what Update_System_Clock does when it finds that the first timeout in the line has expired. We are, therefore, setting up the same situation: the timeout is expired, its tardiness is given by ElapsTime – Value, and Time_Out is enqueued as an event.

The event is scheduled as a global, VM or thread event, depending on which timeout line it was enqueued in.

3.3.3 - Timeouts for Events

We saw how timeouts are implemented by means of events. Now we are going to see that they are used to service events also.

Some of the event scheduling services allow to specify a timeout for the event. If the VMM succeeds in calling the callback only after the specified time has passed, the callback will be called with the carry set.

Call_Priority_VM_Event is one of the services which allow to specify a timeout. The service is intended to schedule a VM specific timeout (one that can be called only in the context of a specific VM), but the DDK states that, if the timeout expires, the callback is called as soon as possible, regardless of which virtual machine is running. Thus, the callback has the responsibility of checking the carry to determine if it’s running in the expected VM.

The service Call_Restricted_Event allows to schedule similar events for a specific thread. For it, the Windows Me DDK doesn’t state that the callback can be called in any thread if the timeout expires, but, given the similarities between the two services, I suppose this is what would happen.

We are now going to look at a code block inside Time_Out which, I believe, implements this functionality

I think that, whenever an event with a timeout is enqueued, the VMM schedule a timeout to be able to detect when the specified amount of time has passed. We already saw in section 2.2.2 how, if an event is being processed before its timeout expires, the latter is cancelled. Inside Time_Out we find the other half of the story: what happens if the timeout clicks before the event has been processed.

The code we are going to look at is at @#@RestrictedTimeout inside TimeOut.Asm (Listing 15). This label is reached if the member @#@ToutNode.Flags is not 0. Here are, presumably, stored the flags specified for the associated event. If the execution arrives here, instead of calling the timeout callback, a global event is scheduled, having as callback a VMM procedure named Time_Out_Worker. The event is scheduled using the content of the @#@ToutNode.Flags member as the event flags. In this way, Time_Out worker will be called with the same restrictions of the original event, except that this event is global, i.e. the function is called "immediately regardless of which virtual machine is currently running" (from the documentation of Call_Priority_VM_Event).

Time_Out_Worker will bother to set the carry flag before calling the callback. The pointer to the @#@ToutNode is passed as the event reference data, so Time_Out_Worker will find it into edx, when it will be called.

In this case, the timeout node is not appended to the list of recyclable ones (although it has already been unchained from the list of the pending ones), because Time_Out_Worker will need it. It will be this procedure, thus, which will recycle the node later.

As a final note, the event is scheduled with the Call_Restricted_Event service, which is actually one of those which allow to specify a timeout. It is very unlikely, though, that this functionality is used here, because we are already responding to an expired timeout. Whether the timeout is requested or not depends from the event flags, which we don’t know, because they are copied from the @#@ToutNode.Flags member, i.e. they are not explicitly coded in the program. Nevertheless, the DDK states that, if a timeout is requested, its value must be loaded into edi, which is otherwise ignored. It can be observed from Time_Out that, at the moment of the call to Call_Restricted_Event, edi holds the current thread handle and not a timeout value. This allows us to say that the event being scheduled has no timeout, which fits with our analysis.

So we have seen how, in this case, a timeout is transformed into an event, probably because it had been scheduled to service yet another event.

4 - When the VMM Is Reentered.

In this section we will cover the case of an interrupt occurring in a thread which is already at ring 0. The most interesting facts we will see are that some kinds of event can be serviced and even a thread switch may occur, even though this is not documented in the DDK.

The code mainline for a reentered interrupt is inside Return_To_Vm - RetToVM.Asm (Listing 3). We saw in section 1.1.5 how, at @#@SetPmAppRing0Env, reentrancy is detected by looking at esp. Let’s see what happens at @#@ReenterVmm.

4.1 - Handling of Exceptions on Segment Reloading

The first interesting thing we see is that the VMM checks whether the interrupt number is comprised between 0bh and 0dh. This range includes two fundamental processor exceptions: the stack fault (0ch) and the general protection fault (0dh).

The VMM wants to know if it has stepped on such an exception while reloading the ring 3 segment register, immediately before returning to the ring 3 program. In other words, the VMM is worrying about an exception caused by itself, due to an invalid segment register that was being used by the client program. To determine if this is the case, the VMM actually compares the address of the faulting instruction with the addresses of the code block that reload the segment registers, inside Return_To_VM.

If the VMM finds that a GP fault occurred reloading a ring 3 segment register, it simply zeroes the saved ring 3 value and then resumes the reloading. The processor allows loading 0 in a segment register, without raising exceptions. In this case, execution goes on without adverse effects on the program about to be resumed, other than the zeroing of a segment selector that was invalid anyway.

Exceptions other than a GP fault are "reflected" on the ring 3 program, meaning that the stack and ring 3 status are adjusted as if the exception occurred on the ring 3 instruction whose address is saved on the stack (i.e. the ring 3 instruction at which executions should have returned). The VMM redirects itself to execute the handler for an exception coming from ring 3 (and not from ring 0), as if the ring 3 program had been interrupted by the exception.

4.2 - Implementation of the VMM Reentrancy Count

The VMM reentrancy count is a documented counter which stores the number of time the VMM has been reentered (i.e. the number of nested interrupt at ring 0). It is worth noting that is implemented inside Return_To_VM, near @#@NormalHandling: it is simply incremented along the path of a reentering interrupt and decremented before returning.

4.3 - The Call to the Second Level Handler

As in the ring 3 interrupt case, Return_To_VM provides a general framework and then handles control to a second level handler whose address is stored in a table. The table for reentering interrupts is different from the one for ring 3 interrupts: the former is named VMM_Fault_Table and the latter PM_Fault_Table. It is interesting to note that, at the moment of the call, ebp does not point to the client register structure (i.e. to the ring 3 execution status). Instead it points to the last dword pushed on the stack by the pushad after the interrupt (which is the copy of EDI).

This is consistent with what the DDK says about second level handlers for interrupts from ring 0 (see Hook_VMM_Fault):

"The system disables interrupts and calls the fault handler as follows:
mov     ebx, VM                 ; current VM handle
mov     ebp, OFFSET32 stkfrm    ; points to VMM re-entrant stack frame
call    [FaultProc]
 
The VM parameter is a handle identifying the current virtual machine, and the stkfrm parameter points to the VMM re-entrant fault stack frame.

The EBP register does not point to a client register structure. "

So we see what the reentrant stack frame is.

Passing it to the installed handler makes sense, because any vxd can always retrieve the address of the client register structure from the thread or VM handle, while the reentrant frame is only known at this time. The second level handler has thus the chance of saving it somewhere, before, eventually, modifying ebp.

Finally, it is useful to remember that to have two different tables doesn’t necessarily mean to have two different set of handlers. For instance, both tables store the same address for the handler of the PIT interrupt: Hw_IRQ_0.

4.4 - Deferred Procedure Calls (DPCs) and Preemptable Events.

This section covers what, I think, is the more interesting part of a nested interrupt.

The DDK states that the VMM processes events only when it is about to return to ring 3. In this section we will see how, actually, the VMM processes a special kind of event in a nested interrupt also.

Deferred procedure calls are a concept for which I have not been able to find background documentation in the DDK. Here we are going to see what they are and how they work.

Our analysis start after the call to the second level handler, which has done whatever was necessary to service the interrupt.

4.4.1 - Effect of the IRQL

By examining the code (below @#@ReenterEpilog) we see that, if the IRQL is above or equal 2, we just restore the processor registers and return to the interrupted code (without forgetting to decrement VMM_Reenter_Count). In this case, all the operations we are going to analyze in the next sections are bypassed. Level 2 is defined in wdm.h as DISPATCH_LEVEL.

This is another situation in which the system is less willing to be disturbed, if the IRQL is above a certain threshold. Events and DPCs are callbacks waiting to be called, whose effect is unknown to Return_To_VM. If the IRQL is such that execution has better to be straightforward, they are left in the line.

Return_To_VM chooses the same behaviour if it finds that the current stack segment is not the standard one. This is probably a conservative behaviour: if the interrupted code was toying with the stack, let’s not put too much on the plate.

4.4.2 - DPCs

If everything is fine with the stack and the IRQL, we arrive at the DPC processing (0c00015bb). Before going on with the code, let’s introduce the concepts.

4.4.2.1 - DPCs Concepts

DPCs are similar to events, in that they are callback functions waiting to be executed.

The Windows 98 DDK, at least, documents a number of routines which are part of WDM, were DPCs pop up. This documentation has, unfortunately, disappeared from the Windows Me DDK, at least in the first version available through the Internet.

In particular, in the Windows 98 DDK we find two interesting routines: KeInitializeDpc and KeInsertQueueDpc.

KeInitializeDpc is used to initialize a structure of type KDPC which implement the DPC object. The caller provides the storage for the structure then calls this routine to initialize it with the address of the callback and other parameters. The layout of KDPC (see wdm.h and ntdef.h - for the LIST_ENTRY type) is consistent with the one of the structure we see used in the code, which I called @#@DpcListNodeS. It is very likely that they are the same structure.

KeInsertQueueDpc is used to enqueue a call for an existing and initialized DPC object. It is interesting to note that the documentation says:

"KeInsertQueueDpc queues a DPC for execution when the IRQL of a processor drops below DISPATCH_LEVEL."

We have already seen that DPCs are not processed if the IRQL is higher or equal to DISPATCH_LEVEL.

4.4.2.2 - DPCs Processing

Returning to the code in RetToVM.Asm (Listing 3), a static variable named _DPC_Count is checked: if it is 0, execution goes on to preemptable events.

Otherwise, _ProcessDpcs@0 - see ProcDpcs.Asm (Listing 11) – is called. From this procedure we see that each DPC is represented by a node of a linked list, which is an instance of @#@DpcListNodeS – see Struct.Asm (Listing 1). Each node stores the address of a callback and other parameters. _ProcessDpcs@0 walks the list, unchains each node and calls the callback.

DPCs are not very different from events, but they are called only when returning from a nested interrupt. When returning to ring 3, they are ignored. They are probably used to schedule work that can’t wait the return to ring 3. We will also see that they are called immediately when the IRQL drops below dispatch level.

Before calling _ProcessDpcs@0, Return_To_VM temporarily raises the IRQL to 2. We know that this actually inhibits the DPC processing for nested interrupts. This is done here to prevent a subsequent interrupt occurring while the DPC processing is in progress from messing with the DPC list.

It is interesting to note that every list node is unchained and serviced, but nothing is done to recycle it or to free the storage it occupies. This is consistent with the documentation we saw in the previous section: the storage for the DPC object is provided by the caller. The calling code has thus the responsibility of freeing the storage when it doesn’t need the DPC object anymore. On the other hand, for events, a different policy has been adopted: the list nodes are generated internally and managed by the VMM, which enqueues them in a recycled list when it is processing them.

4.4.3 - Preemptable Events

Going on with the code in RetToVM.Asm (Listing 3), we arrive at @#@CheckPreemptable. Like in the previous section, I would like to introduce the subject, before delving into the code.

4.4.3.1 – Preemptable Events Concepts

The VMM allow a thread to specify that it is in a preemptable status, which means that it can tolerate the processing of event scheduled with the (undocumented) PEF_WAIT_PREEMPTABLE flag. We can see this by looking at an undocumented procedure named _Begin_Preemptable_Code, which you can find in BegPreCd.Asm (Listing 12). I will refer to events with that flag set as "preemptable events"

From BegPreCd.Asm (Listing 12) we learn that every time this procedure is called, it increments a counter stored in a tcb_s member (undocumented as well), which I named TCBX_PREEMP_CNT. The VMM provides a companion procedure named _End_Preemptable_Code, whose only action is to decrement this counter for the current thread and which you can find in EndPreCd.Asm (Listing 13). It is thus apparent that the VMM maintains a count of how many time a thread has declared itself as preemptable. We will see later how, when returning from a ring 0 interrupt, the VMM checks if this counter is different from 0 and, if so, processes preemptable events. Thus, a thread that called _Begin_Preemptable_Code a number of times must call _End_Preemptable_Code the same number of times, before the counter drops to 0 and the VMM stops using it to process preemptable events.

Having incremented the counter, _Begin_Preemptable_Code checks whether there are pending preemptable events, by inspecting wait_preemptable_count. If this is the case, it checks whether the IRQL is below DISPATCH_LEVEL and, if this is true, services the pending events. This means that as soon as the thread enters the preemptable status, it is used to service preemptable events, but this also is subject to the effect of the IRQL. It is interesting that an IRQL at or above DISPATCH_LEVEL inhibits both DPCs and preemptable events.

The waiting events are serviced by a loop similar to the one we find in Return_To_VM, which calls Check_Serviceable_Events and Process_Event. By calling the former with the flag PEF_WAIT_PREEMPTABLE set into eax, the loop selects only the events having that flag set and feeds them to the latter.

Before doing this, TCBX_PREEMP_CNT is temporarily zeroed, in order to prevent interrupts from servicing preemptable events. This is necessary, because, as we are going to see, the VMM processes preemptable events when returning from an interrupt occurred at ring 0. Before returning, TCBX_PREEMP_CNT will be restored, since we actually arrived here to raise it.

So now we know that a thread can inform the VMM that it is in preemptable code, which has the effect of incrementing a counter and triggering preemptable events. Now let’s see the implications of these facts on ring 0 interrupts.

4.4.3.2 - Preemptable Events Processing

The code at @#@CheckPreemptable checks wait_preemptable_count and, if it is equal to 0, control returns to the interrupted code. If there are pending events TCBX_PREEMP_CNT of the current tcb_s is compared with 0 and, if it is found different, they are serviced. Again, before servicing them, TCBX_PREEMP_CNT is temporarily zeroed to prevent nested interrupts from reentering this path of execution.

So we see that a nonzero TCBX_PREEMP_CNT has the effect of enabling the processing of a special kind of events, even if the VMM has been reentered.

There is also another peculiar detail: the VMM reentrancy count is decremented, as if the VMM wanted the event callbacks to believe that they are called in the outer instance of execution.

Preemptable events are processed by calling _Process_Preemptable_Events, which you can find in PrcPreEv.Asm (Listing 14) and which actually jumps to the event processing loop inside _Begin_Preemptable_Code - BegPreCd.Asm (Listing 12).

From Return_To_VM we see that preemptable events can be processed even if TCBX_PREEMP_CNT is 0 (see @#@ProcPreemptable). The eip saved by the interrupt, i.e. the address at which execution should resume, is compared with a static named _pPreemptBase. If eip is above it, preemptable events are processed anyway. It is as if the VMM reserves an address range for code which is always preemptable. I always found _pPreemptBase equal to 0ff000000h.

If Return_To_VM finds that there are pending preemptable events but can’t process them, it adopts a peculiar behaviour (see also @#@SetEventGPF). It lowers the limit of the code segment descriptors to a value which depends on _pPreemptBase. With _pPreemptBase being 0ff000000h, the new limit will be 0ff000fffh. Then it sets a flag in a static named _fPreempt, which will modify the handling of the next GP fault.

When a GP fault will occur, Return_To_VM will detect that the flag is set, will clear it and will restore the original code segment limit. Then it will jump to the fault epilogue without executing the second level handler, thus "masking" the fault to the rest of the system. The epilogue will detect this situation (by means of esi set to 0ffffffffh) and will service preemptable events, even though TCBX_PREEMP_CNT is zero. Then the faulting program will be resumed without further actions.

It seems therefore that the VMM can cause a dummy GP fault by referencing an instruction at an address above the limit to trigger the event processing.

Return_To_VM can’t be sure that the first GP fault that will happen will be the dummy one. Should a real GP fault occur, it would repeat itself, because the faulting instruction is resumed after the event processing. On this second time, _fPreempt will be found clear and the normal GP fault handling will occur.

One of the most interesting use of preemptable events are timeouts. We have already seen in section 1.2.5 that Update_System_Clock, which is periodically called by the PIT interrupt, checks whether one or more timeouts have expired and, if so, schedules an event for a procedure named Time_Out. This event is a preemptable event and this is not surprising, since it is a good idea to service it as soon as possible, without having to wait for the VMM to return to ring 3.

By analyzing how Windows Me manages multitasking, it can be found that the thread switching actually occurs because of a timeout, which calls a procedure who determines that the current thread has used all its time slice and resumes another one. Since this chain of events can happen when timeouts, and hence preemptable events, are serviced we can conclude that, when a thread calls _Begin_Preemptable_Code it can, actually, be... preempted.

Finally, it is important to remind that preemptable events are not serviced only when returning from a ring 0 interrupt. They can also be serviced during the general events servicing phase which precedes the return to ring 3. It can be observed that, every time it is called, Process_Event checks whether PEF_WAIT_PREEMPTABLE is set for the current event and, if so, decrements wait_preemptable_count. In this, preemptable events differ from DPCs, which are serviced only at the end of reentered interrupts.

5 - IRQL Management

The time has come to keep the promise I made in section 1.2.2, to analyze two procedures named @KfRaiseIrql@4 and @KfLowerIrql@4. Now that we know about DPCs, preemptable events and their relationship with DISPATCH_LEVEL IRQL, we can better understand what these two procedure do.

5.1 - @KfRaiseIrql@4

The code for this procedure is enclosed in RaisIrql.Asm (Listing 17).

This procedure is a thin wrapper around KpSetIrql, which is the one used to both raise and lower the IRQL.

It checks whether the new IRQL value it receives into cl is actually higher than the current one. If this is not the case, it just returns. Otherwise, it calls KpSetIrql to do the job. The latter needs to be called with a flag image pushed on the stack, cl containing the new IRQL and ch containing the current one.

5.2 - KpSetIrql

The code for this procedure is enclosed in SetIrql.Asm (Listing 18).

The procedure immediately sets the new IRQL by writing it into the static CurrentIrql. Afterwards, it proceeds to update the status of the interrupt lines, according to the newly set IRQL.

In order to do this, it uses two table, which I call "index table" and "jump table".

The first one is used to calculate a value which is a function of both the former and the new IRQL, i.e. of the transition. The code uses the former IRQL as an index into the table, from which it loads a byte value. This value is the shifted left by two positions. Then the new IRQL is used as an index as well, to fetch another byte, which is ORed with the previous value.

The byte values are actually comprised between 0 and 2, therefore they only occupy two bits. The value corresponding to the former IRQL ends up in bits 2, 3 of the result, while the one relative to the new IRQL occupies bit 0, 1. This lead to a result comprised between 0 and 0ah.

This value is then used as an index into the jump table, which contains code addresses, to determine the address to which transfer control with a jmp instruction.

So, in other words, control is transferred to another point in code, depending on the former and new IRQL. There are 4 possible code blocks whose addresses are stored in the jump table, which I named @#@DoNothing, @#@CallVPICD, @#@ClearIF, @#@RestoreIF.

@#@DoNothing is reached when the IRQL is moved from a value between 0 and 2 to another value inside this range. This code block doesn’t do anything and just return. Note how 2, i.e. DISPATCH_LEVEL, is a borderline between different IRQL effects.

@#@CallVPICD is reached for every transitions except for:

the ones defined above, which lead to @#@DoNothing

the ones to or from IRQL = 1fh (1fh is either the former or the new IRQL). Note that 1fh is the maximum possible value.

This code block calls a VPICD procedure named _VPICD_Set_Irql_Mask, which masks certain IRQ lines depending on the new IRQL value. I will analyze this procedure in the next section.

@#@ClearIF is reached for every transition to 1fh and simply clears the IF bit.

@#@RestoreIF is reached for every transition from 1fh to a lower level. It restores IF to its previous status. It’s interesting to note that the status of the previous IF is maintained in a single static variable, which I labelled @#@CallerIFStat. Since a static is used, the VMM must be sure that the same thread which raised the IRQL to 1fh is now lowering it, otherwise it would set the IF of the function caller accordingly to the status it had in another thread. But of course, the VMM is sure of this, because hardware interrupts were disabled, therefore nothing could have triggered a thread switch. You may say that a page fault can still cause a thread switch, but the fact is probably that a page fault occurring while the IRQL is above a certain threshold is treated by the VMM as a fatal error. For one thing, this is explicitly documented in windows NT, which is were the concept of IRQL was imported from.

So, recapitulating:

as long as the IRQL stays below or at DISPATCH_LEVEL there are no side effects due to a change in its value.

If the transition crosses DISAPTCH_LEVEL or is between two levels both above DISPATCH_LEVEL, the VPICD is called to selectively mask/unmask IRQ lines.

If the transition is to/from the maximum level (HIGH_LEVEL from wdm.h), interrupts are completely disabled/restored.

5.3 _VPICD_Set_Irql_Mask

You can find this procedure in SIrqlMsk.Asm (Listing 19).

This procedure updates the static named VPICD_Irql_IMR by writing into it a word value, taken from VPICD_Irql_Mask_Table, using the new IRQL as an index. This value is then used to calculate a bit mask used to command the two PICs, in order to mask certain IRQ lines. By inspecting the actual table values it turns out that:

For IRQL values between 0 and 0fh, no IRQ are masked.

For values between 10h and 11h, IRQ 11 is masked. On my PC it is used for the audio controller and the USB interface.

For values between 12h and 1ah, IRQ 11 and 9 are masked. The latter is the keyboard interrupt. At this IRQL the system stops responding to the keyboard.

For values between 1bh and 1fh, all IRQs except number 2 are disabled. IRQ 2 is the link between the master and slave PICs and, since the slave already has its IRQs masked, there is no point in disabling it.

So we see again how, the higher the IRQL is, the more the system becomes insensible to interrupt sources.

The reason why this procedure stores the current masking bit pattern into VPICD_Irql_IMR is that other procedures need to manipulate the PIC mask register and, in this way, they can OR their own bit pattern to this one, in order not to unmask IRQs which need to stay disabled because of the IRQL. An example of such procedure is Hw_Irq_0, which we met in section 1.2.1.

_VPICD_Set_Irql_Mask also is aware of other reasons to mask IRQs: it ORs VPICD_Irql_IMR with other masks named VPICD_Phys_ISR and VPICD_Phys_IMR, before outing the bit pattern to the PIC.

5.4 - @KfLowerIrql@4

The code for @KfLowerIrql@4 can be found in LoweIrql.Asm (Listing 20).

Having dissected KpSetIrql we may expect to have little left to do, but this procedure reserves us some interesting new detail.

The first part of the procedure is equivalent to @KfRaiseIrql@4: it checks whether the current IRQL is already equal to or lower than the requested value and, if so, just returns. Otherwise, it calls KpSetIrql to set the new value.

Then, instead of simply terminating, it checks whether the new IRQL value is below DISPATCH_LEVEL. And, if this is the case, it processes DPCs and preemptable events. So we see that, as soon as the IRQL drops down enough to process pending callbacks, the VMM tries to service them.

Before actually processing them, this procedure compares the current code selector with a value stored in a static named _RtCs. If the two are equal, DPCs and events are skipped. This means that the VMM reserves a special code segment for which these callbacks are inhibited. This could be a code segment reserved for special operations, which need to lower the IRQL and yet must not be interrupted (or preempted). The name of the static could stand for Real-time Code segment. By the way, this is a new functionality of Windows Me and wasn’t present in Windows 98 SE.

If @KfLowerIrql@4 decides to process DPCs it does so simply by calling _ProcessDPCs@0, which we already met in section 4.4.2.2.

As for preemptable events, this procedure doesn’t actually service them. Instead it behaves like Return_To_VM, lowering the code segment limit and setting the _fPreempt flag. We know from section 4.4.3.2 that this flag has the effect of masking the next GP fault to the system and of triggering the preemptable events servicing, as well as the restoring of the normal code segment limit.

By the way, Windows 98 SE showed a more simple behaviour: it just checked that the current thread was in preemptable code by testing TCBX_PREEMP_CNT and, if so, it jumped to the event servicing loop inside _Begin_Preemptable_Code.

Summary

We followed the path taken by a timer interrupt occurring in a Win 32 application. We examined the transition from ring 3 to ring 0, then the handling of the interrupt, which allowed us to understand how Windows Me updates its time counters.

We also saw how, along the path of an interrupt, events are serviced and so we analyzed how this is accomplished.

Having learned that the PIT interrupt is also used to trigger the servicing of timeouts, we examined them as well.

By looking at the code which handles an interrupt at ring 0, we learned about DPCs and preemptable events.

In many of the topics examined, we saw the influence of the IRQL and we also had the opportunity to look at the procedures which set its current value and at the side effects which may occur, particularly when it is lowered.

Appendix A – Intel Processors Concepts

A.1 – What Is A Ring Transition

In this section I will explain briefly the concept of privilege level defined in the Intel IA32 architecture.

Beginning with the 80286, Intel processors may be operated in what is called Protected Mode (PM), which has capabilities oriented to multitasking operating systems where the code and data of one task must be protected from illegal accesses from another task and the operating system must be protected from all application tasks. In this framework, the Intel architecture defines four levels of privilege, also called rings, at which the processor may be running, numbered from 0 (most privileged) to 3 (least privileged). At any time, the processor is executing at one of the 4 privilege levels and the current value is called the current privilege level (CPL). The privilege level at which a section of code executes is determined by the attributes of the memory segment which contains it, therefore it’s usual to speak of a privilege level of the code, i.e., we speak of ring 0 code and ring 3 code (Windows doesn’t use the intermediate levels 1 and 2. Everything runs either at ring 3 or ring 0).

This isn’t completely accurate, though, because it would be possible to change the attributes of a memory segment or even to define two different logical segment, with different attributes (and thus privileges) but with the same base address, i.e. defining the same memory region. Then the same code could be executed at different privilege level, depending on which segment definition the processor is using. So you can see it’s not strictly exact to speak of a privilege level of the code, but normally, Windows executes application code using a ring 3 segment definition, while the operating system core runs at ring 0 and doesn’t toy with segment definitions.

Another fact that makes reasonable to speak of code privilege is that certain instructions can only be executed at ring 0, otherwise the processor generates an exception. It is therefore likely that a code block meant to be executed at ring 0 will cause an exception if executed using a ring 3 segment definition.

The segment definition that the processor is currently using for the code segment is determined by the value loaded into the CS register. This value is called segment selector and it is used as an index into a table of 8 byte data structures which contain the segment attributes. The table entries are called descriptors.

So, in summary, the transition we are going to examine is the one from the application privilege level (ring 3) to the kernel privilege level (ring 0). Why does an interrupt from the PIT cause such a transition, you may ask. The answer is that interrupts are handled through a table (the interrupt descriptors table, or IDT), which, in Windows, is initialized so that such a transition occurs. The IDT also contain the address of the code which must handle the interrupt.

A.2 Interrupt Stack Frames

A.2.1 - Basic Stack Frame

Every time an interruption (hardware interrupt, software interrupt or processor exception) occurs, the processor builds a stack frame which is described by the following structure:

StkFrm struc SavEip dword ? ;(+00h) SavCs dword ? ;(+04h) CS is 16 bit long. Bits 16-31 undef. SavFlg dword ? ;(+08h) StkFrm ends

Figure 2

When the first instruction after the interruption is executed, esp points to SavEip, i.e. the lowest address, thus the offsets above are also valid from esp.

Some (but not all) of the processor exceptions push an extra error code dword on the stack, immediately below eip. Since Windows already has a lot of things to worry about, it avoids the need to remember which stack layout the current interruption built by subtracting 4 from esp for all the interruptions which don’t push this extra error code, before pushing anything else. In this way, all the interruptions will have a 4 dword stack frame and the only piece of code which has to know about the specific stack layout is the 1^st level code, which may or may not include the sub esp,4 instruction.

The stack frame built by the processor depends also from the type of gate found in the IDT for the interrupt. The one shown above is for a 32 bit gate. For more information, see [1], Volume 3.

A.2.2 - Stack Frame for Privilege Level Change

Depending on the attributes of the gate installed in the IDT for a given interrupt vector, the interruption can have the effect to lower the privilege level value (i.e. go into a more privileged ring). This is generally the case in Windows.

When such an event occurs, the processor changes the stack segment (SS) and stack pointer (esp) registers, in order to allow ring 0 code to have its own clean stack, separated from the application one. The initial values for SS and esp are taken from a special data structure named task status segment (TSS). The content of SS and esp at the moment of the interrupt are saved on the ring 0 stack, above the usual interrupt frame. Therefore, after an interrupt causing a ring 3 to 0 transition, the ring 3 stack will be completely unaffected, because everything is pushed on the ring 0 stack, and the ring 0 stack layout will be:

StkFrm2 struc SavEip dword ? ;(+00h) SavCs dword ? ;(+04h) SavFlg dword ? ;(+08h) Rg3Esp dword ? ;(+0ch) Rg3Ss dword ? ;(+10h) StkFrm2 ends

Figure 3

The ring 3 SS and esp saved on the ring 0 stack point to the bottom of the ring 3 stack. It is important to note that on the ring 3 stack nothing is pushed by the interrupt and everything is written on the new ring 0 stack. When the processor will execute an iretd instruction with esp pointing at the frame above, it will detect that the saved CS refers to a ring 3 segment an therefore that a ring 0 – 3 transition is in progress. It will then load the two doublewords above the flags into esp and SS as part of the iretd.

Note how this feature allow an exception from ring 3 to successfully enter ring 0 even if the ring 3 (application) code has messed up with the stack: a clean pointer for the ring0 (system) software is loaded from the TSS.

Appendix B - Virtual Machine Manager (VMM) Concepts

A note on terminology: the ring 0 core of Windows is also known as the virtual machine manager (VMM), so you’ll find this term in this document as the name of the program which is made up of the analyzed code. A thread which is executing VMM code is said to be inside the VMM.

B.1 - VMs, IDTs,VM Handles.

The system maintains an execution environment for win16 and win32 programs known as the system virtual machine (VM). There may be other VMs also, because one is created for every running DOS box. Each VM has two IDTs, one used when the currently executing program is in protected mode (PM), and the other one used when the program is in virtual 86 mode (V86). Although the system VM generally executes PM code, it may be switched to V86 to execute DOS or BIOS routines, therefore this VM also has two IDTs.

The VM handle is a pointer to a data structure holding VM control data, which corresponds to the structure cb_s documented in the Windows Me DDK. Every VM in the system has its cb_s and this structure is also referred to as the VM control block. Thus, the handle is simply the pointer to the control block.

A note on handles: if you are accustomed to win32 application handles, you probably know they are generally indices inside a per-process table, where additional information on the object represented by the handle are stored. In Windows Me ring 0 programming, instead, handles are generally linear addresses, which fall in the shared system region (0c0000000h – 0ffffffffh), where the system data they point to live.

Among the many useful things stored inside cb_s there is the pointer to the client register structure of the thread currently executing in the VM, which is saved in the CB_Client_Pointer member. Thus, if we are not sure whether ebp still contains this pointer, we can reload it from there.

B.2 - Ring 3 And Ring 0 Interrupts.

When a thread is inside the VMM, it keeps, for most of the time, the interrupts enabled, therefore the VMM execution can itself be interrupted by what I will call a ring 0 interrupt, as opposed to an interrupt received by application code, which I will call a ring 3 interrupt. There are some rules that Windows follows for ring 0 interrupts.

First of all, Windows has two sets of interrupts handlers: one for interrupts from ring 3 and another for interrupts from ring 0, thus, the handler which receives control depends upon the ring of the interrupted code. Keep in mind though that the very 1^st code executed, i.e. the one analyzed in sections 1.1.2 through 1.1.5, is common to the two cases.

Next, Windows guarantees that the current thread is not preempted in the middle of an interrupt coming from ring 0. Recall that threads are normally preempted after they have entered 0, because it’s the VMM who decide whether to return to the interrupted code or to resume a different thread. Nevertheless, if an interrupt occurs at ring 0, the VMM guarantees that it will always return to the interrupted ring 0 code, without suspending the thread (see, for instance, [2], p.226). It will be at the moment of returning to ring 3 that the VMM will, eventually, decide to preempt the thread.

B.3 - Thread Handles

The thread handle is analogous to the VM handle: the system maintains a control block for each thread, defined by the tcb_s structure of the Windows Me DDK. The tcb_s is the counterpart of the cb_s (the VM control block) for the individual threads and it’s also called the thread control block. The thread handle is simply the pointer to a thread tcb_s.

B.4 - Events

Events are used to let an interrupt handler schedule an operation it cannot perform immediately.

The VMM architecture imposes that the code executed while handling an hardware interrupt must not cause a page fault, i.e. it cannot access pageable memory and cannot call a number of Windows services, primarily because they access pageable memory on their own. Procedures complying with this rule are called "asynchronous" because they can be safely executed to handle an interruption in the current program, asynchronously triggered by an hardware interrupt.

To handle an interrupt, though, it maybe necessary to call non asynchronous services, therefore what an handler does is to schedule an event to perform the forbidden operation. It calls an event scheduling service passing to it the address of a procedure, which is called a callback. The effect of this is that the VMM will call the specified procedure later, at a time when it is safe to use all the Windows services. This procedure can therefore call non asynchronous services to finish the job.

There are various event scheduling services and some of them allow to specify restrictions on the status of the system, which must be met for the callback to be called. In other words, the VMM will call the callback only when the restrictions are met. For instance, it is possible to specify that the callback must be called only when the client thread has interrupts enabled.

An event can be scheduled as global, VM specific or thread specific. If it is global, it will be executed in the context of the first thread which meets the restrictions, regardless to which VM it belongs. If it is VM specific, it will be executed in any such thread of the specified VM, and, if it is thread specific, the callback will be called only by the specified thread, when it satisfies the other conditions, if any.

Acronyms List

CPL: current privilege level

DPC: deferred procedure call

IA32: Intel’s architecture implemented by the 386 and above processors.

IDT: interrupt descriptors table

NMI: non maskable interrupt

PIC: programmable interrupt controller

PIT: programmable interval timer

PM: protected mode

TSS: task status segment

V86: virtual 86 mode

VM: virtual machine

VMM: virtual machine manager

WDM: Windows Driver Model

Bibliography

[1], Intel Architecture Software Developer’s Manual.

Volume 1: Basic Architecture (24319001.pdf)

Volume 2: Instruction Set Reference (24319101.pdf)

Volume 3: System Programming Guide (24319201.pdf)

Available at www.intel.com.

[2], Walter Oney, Systems Programming for Windows 95, Microsoft Press, 1996.

The Ticking of the Clock

Introduction

Document convention

Symbolic Names

My Own Ones

Structures Undocumented Members

Addresses in the Code Samples

Organization of Code in Procedures

Structures Unknown Fields