Search This Blog

Thursday 30 October 2014

WinDbg : Debugging A Critical Section Based Dead Lock Scenario

WinDbg : Debugging A Critical Section Based Dead Lock Scenario


Dead locks can be very nasty and sometimes very hard to debug and root cause. This post is an attempt to walk you through one such scenario, where a critical section dead locks a thread. As with most of the previous posts, we will refrain from having source level help, that is, we will try and debug the faulty application without any source code and only symbols. 

Commands Used:

We start by attaching the debugger to he hung application, or we can also start the application inside the debugger if that is an alternative. In this example the name of the application is FaultyApp.Exe.

Once the application hangs, we start by taking a stack dump of all the threads currently running. Since this is an user land debugging scenario, the ~* command will give us the thread wise stack dump.

kd>~*kvn
0:004> ~*kvn

   0  Id: 6a0.5f8 Suspend: 1 Teb: 7efdd000 Unfrozen
 # ChildEBP RetAddr  Args to Child              
00 0016f7b8 770315f7 00000003 0016f808 00000001 ntdll!NtWaitForMultipleObjects+0x15 (FPO: [5,0,0])
01 0016f854 76e719f8 0016f808 0016f87c 00000000 KERNELBASE!WaitForMultipleObjectsEx+0x100 (FPO: [Non-Fpo])
02 0016f89c 76e74200 00000003 7efde000 00000000 kernel32!WaitForMultipleObjectsExImplementation+0xe0 (FPO: [Non-Fpo])
03 0016f8b8 00bd11b5 00000003 0016f8d4 00000000 kernel32!WaitForMultipleObjects+0x18 (FPO: [Non-Fpo])
04 0016f8e4 00bd143d 00000001 00699318 00699368 FaultyApp!main+0x95 (FPO: [Non-Fpo]) (CONV: cdecl) 
05 0016f92c 76e7338a 7efde000 0016f978 77d79f72 FaultyApp!__tmainCRTStartup+0xfe (FPO: [Non-Fpo]) (CONV: cdecl) 
06 0016f938 77d79f72 7efde000 735ac1c9 00000000 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo])
07 0016f978 77d79f45 00bd1500 7efde000 ffffffff ntdll!__RtlUserThreadStart+0x70 (FPO: [Non-Fpo])
08 0016f990 00000000 00bd1500 7efde000 00000000 ntdll!_RtlUserThreadStart+0x1b (FPO: [Non-Fpo])

   1  Id: 6a0.bc4 Suspend: 1 Teb: 7efda000 Unfrozen
 # ChildEBP RetAddr  Args to Child              
00 0089feb4 77d78e44 00000048 00000000 00000000 ntdll!ZwWaitForSingleObject+0x15 (FPO: [3,0,0])
01 0089ff18 77d78d28 00000000 00000000 00000000 ntdll!RtlpWaitOnCriticalSection+0x13e (FPO: [Non-Fpo])
02 0089ff40 00bd105c 6ce392e0 00000000 00000000 ntdll!RtlEnterCriticalSection+0x150 (FPO: [Non-Fpo])
03 0089ff54 76e7338a 00000000 0089ffa0 77d79f72 FaultyApp!FnThread1+0x2c (FPO: [Non-Fpo]) (CONV: cdecl) 
04 0089ff60 77d79f72 00000000 73c5c711 00000000 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo])
05 0089ffa0 77d79f45 00bd1005 00000000 ffffffff ntdll!__RtlUserThreadStart+0x70 (FPO: [Non-Fpo])
06 0089ffb8 00000000 00bd1005 00000000 00000000 ntdll!_RtlUserThreadStart+0x1b (FPO: [Non-Fpo])

   2  Id: 6a0.12fc Suspend: 1 Teb: 7efaf000 Unfrozen
 # ChildEBP RetAddr  Args to Child              
00 00dffa50 77d78e44 00000048 00000000 00000000 ntdll!ZwWaitForSingleObject+0x15 (FPO: [3,0,0])
01 00dffab4 77d78d28 00000000 00000000 00000000 ntdll!RtlpWaitOnCriticalSection+0x13e (FPO: [Non-Fpo])
02 00dffadc 00bd10fc 6ce392e0 00000000 00000000 ntdll!RtlEnterCriticalSection+0x150 (FPO: [Non-Fpo])
03 00dffaf0 76e7338a 00000000 00dffb3c 77d79f72 FaultyApp!FnThread3+0x2c (FPO: [Non-Fpo]) (CONV: cdecl) 
04 00dffafc 77d79f72 00000000 7393c38d 00000000 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo])
05 00dffb3c 77d79f45 00bd100f 00000000 ffffffff ntdll!__RtlUserThreadStart+0x70 (FPO: [Non-Fpo])
06 00dffb54 00000000 00bd100f 00000000 00000000 ntdll!_RtlUserThreadStart+0x1b (FPO: [Non-Fpo])

   3  Id: 6a0.11f0 Suspend: 1 Teb: 7efd7000 Unfrozen
 # ChildEBP RetAddr  Args to Child              
00 00a6fa34 77d78e44 00000048 00000000 00000000 ntdll!ZwWaitForSingleObject+0x15 (FPO: [3,0,0])
01 00a6fa98 77d78d28 00000000 00000000 00000000 ntdll!RtlpWaitOnCriticalSection+0x13e (FPO: [Non-Fpo])
02 00a6fac0 00bd10ac 6ce392e0 00000000 00000000 ntdll!RtlEnterCriticalSection+0x150 (FPO: [Non-Fpo])
03 00a6fad4 76e7338a 00000000 00a6fb20 77d79f72 FaultyApp!FnThread2+0x2c (FPO: [Non-Fpo]) (CONV: cdecl) 
04 00a6fae0 77d79f72 00000000 73eac391 00000000 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo])
05 00a6fb20 77d79f45 00bd100a 00000000 ffffffff ntdll!__RtlUserThreadStart+0x70 (FPO: [Non-Fpo])
06 00a6fb38 00000000 00bd100a 00000000 00000000 ntdll!_RtlUserThreadStart+0x1b (FPO: [Non-Fpo])

#  4  Id: 6a0.11c8 Suspend: 1 Teb: 7efac000 Unfrozen
 # ChildEBP RetAddr  Args to Child              
00 00f0fd3c 77ddf926 73bcc5dd 00000000 00000000 ntdll!DbgBreakPoint (FPO: [0,0,0])
01 00f0fd6c 76e7338a 00000000 00f0fdb8 77d79f72 ntdll!DbgUiRemoteBreakin+0x3c (FPO: [Non-Fpo])
02 00f0fd78 77d79f72 00000000 73bcc509 00000000 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo])
03 00f0fdb8 77d79f45 77ddf8ea 00000000 ffffffff ntdll!__RtlUserThreadStart+0x70 (FPO: [Non-Fpo])
04 00f0fdd0 00000000 77ddf8ea 00000000 00000000 ntdll!_RtlUserThreadStart+0x1b (FPO: [Non-Fpo])

The clue here is to look for the Windows API call RtlEnterCriticalSection. There might be  more than one thread in the stack making this function call, we can use any such thread.

Looking up on MSDN we see that the API is defined as follows:

NTSTATUS RtlEnterCriticalSection
 (
  RTL_CRITICAL_SECTION* crit
 )

In the above stack dump, we have several such threads, we take any one and them from the parameters list take the address of the RTL_CRITICAL_SECTION object pointer and type cast it to see the contents.

02 0089ff40 00bd105c 6ce392e0 00000000 00000000 ntdll!RtlEnterCriticalSection+0x150 (FPO: [Non-Fpo])

0:004> dt ntdll!_RTL_CRITICAL_SECTION 6ce392e0
   +0x000 DebugInfo        : 0x0069acf8 _RTL_CRITICAL_SECTION_DEBUG
   +0x004 LockCount        : 0n-14
   +0x008 RecursionCount   : 0n1
   +0x00c OwningThread     : 0x000005f8 Void
   +0x010 LockSemaphore    : 0x00000048 Void
   +0x014 SpinCount        : 0

We see that one of the members of this object gives us the Owning thread ID. 0x000005f8.

Lets try and find out more details about this thread. In user mode, the double tilde (~~) command will do it for us.

0:004> ~~[0x000005f8 ]
   0  Id: 6a0.5f8 Suspend: 1 Teb: 7efdd000 Unfrozen
      Start: FaultyApp!mainCRTStartup (00bd1500)
      Priority: 0  Priority class: 32  Affinity: ff

Ok so the thread offending is the main thread. This is good. Now we know where in the code of FaultyApp we need to start digging.

Lets start with disassembling the Main function. I will add the code in small sections explaining the relevant portions in between.

0:004> uF FaultyApp!main
FaultyApp!main:
00bd1120 55              push    ebp
00bd1121 8bec           mov    ebp,esp 

This is standard function prologue code, that we see in the above two lines.


00bd1123 83ec14        sub    esp,14h                                           

The above line is allocating space for local variables in the stack 

00bd1126 a11070bf00  mov   eax,dword ptr [FaultyApp!__security_cookie (00bf7010)]
00bd112b 33c5            xor     eax,ebp                                           


These two lines above is the code for stack guard.


00bd112d 8945fc          mov     dword ptr [ebp-4],eax
00bd1130 a100b0bf00   mov     eax,dword ptr [FaultyApp!_imp__g_CritSec (00bfb000)]
00bd1135 50                 push    eax
00bd1136 ff1534b0bf00 call    dword ptr [FaultyApp!_imp__InitializeCriticalSection (00bfb034)]    

the four lines above  is the code to initialize the critical section

00bd113c c745ec00000000  mov     dword ptr [ebp-14h],0
00bd1143 eb09                   jmp     FaultyApp!main+0x2e (00bd114e)

FaultyApp!main+0x25:
00bd1145 8b4dec          mov     ecx,dword ptr [ebp-14h]
00bd1148 83c101          add      ecx,1
00bd114b 894dec          mov     dword ptr [ebp-14h],ecx

FaultyApp!main+0x2e:
00bd114e 837dec03        cmp     dword ptr [ebp-14h],3                             

The line just above is comparing a local variable with a constant '3'

00bd1152 7333            jae     FaultyApp!main+0x67 (00bd1187)                  

If the comparison result is equal or greater than 3 then then jump out, else loop                                                    
FaultyApp!main+0x34:
00bd1154 6a00                    push    0
00bd1156 6a00                    push    0
00bd1158 6a00                    push    0
00bd115a 8b55ec                mov     edx,dword ptr [ebp-14h]
00bd115d 8b04950070bf00  mov     eax,dword ptr FaultyApp!pFnarry (00bf7000)[edx*4]   

This above sections seems to be initializing a function pointer array for thread functions.  We know this because of the call to CreateThread API a few lines below....
                                
00bd1164 50              push    eax
00bd1165 6a00            push    0
00bd1167 6a00            push    0
00bd1169 ff154cb0bf00    call    dword ptr [FaultyApp!_imp__CreateThread (00bfb04c)]

(the reader is encouraged to use MSDN to see the signature for CreateThread and all the parameters pushed into it inside this function to decrypt what all was passed)

00bd116f 8b4dec          mov     ecx,dword ptr [ebp-14h]
00bd1172 89448df0       mov     dword ptr [ebp+ecx*4-10h],eax
00bd1176 8b55ec          mov     edx,dword ptr [ebp-14h]
00bd1179 837c95f000    cmp     dword ptr [ebp+edx*4-10h],0
00bd117e 7505              jne     FaultyApp!main+0x65 (00bd1185)

FaultyApp!main+0x60:
00bd1180 83c8ff          or      eax,0FFFFFFFFh
00bd1183 eb32            jmp     FaultyApp!main+0x97 (00bd11b7)

FaultyApp!main+0x65:
00bd1185 ebbe            jmp     FaultyApp!main+0x25 (00bd1145)

FaultyApp!main+0x67:
00bd1187 a1b892bf00      mov     eax,dword ptr [FaultyApp!hStartThreads (00bf92b8)]
00bd118c 50                    push    eax
00bd118d ff1540b0bf00   call    dword ptr [FaultyApp!_imp__SetEvent (00bfb040)]
00bd1193 6a00                push    0
00bd1195 68f4010000      push    1F4h
00bd119a ff1544b0bf00   call    dword ptr [FaultyApp!_imp__SleepEx (00bfb044)]
00bd11a0 e885010000     call    FaultyApp!DllFunction (00bd132a)                      

After setting an event and sleeping for a while, it finally jumps to this function in the above line. DllFunction. We need to disassemble it to see where it leads us. I am not pasting the rest of the disassembly of the Main function as it is irrelevant to the rest of the context.

Lets start with the disassembly of DllFunction.


0:004> u FaultyApp!DllFunction
FaultyApp!DllFunction:
00bd132a ff2504b0bf00    jmp     dword ptr [FaultyApp!_imp__DllFunction (00bfb004)]
FaultyApp!__security_check_cookie:
00bd1330 3b0d1070bf00    cmp     ecx,dword ptr [FaultyApp!__security_cookie (00bf7010)]
00bd1336 7502                  jne     FaultyApp!__security_check_cookie+0xa (00bd133a)
00bd1338 f3c3                  rep ret
00bd133a e908020000       jmp     FaultyApp!__report_gsfailure (00bd1547)
FaultyApp!__tmainCRTStartup:
00bd133f 6a14                  push    14h
00bd1341 68d05dbf00       push    offset FaultyApp!__rtc_tzz+0x104 (00bf5dd0)
00bd1346 e8551a0000       call    FaultyApp!__SEH_prolog4 (00bd2da0)

Whats going on? This doesn't look like a function at all....so look closely at the first line copied again in the line below for your convenience...

00bd132a ff2504b0bf00    jmp     dword ptr [FaultyApp!_imp__DllFunction (00bfb004)]

This is actually a function pointer which is being de-referenced. So if it is a pointer, we need to see what it is pointing at:

0:004> dd 00bfb004
00bfb004  6ce11005 00000000 00000000 00000000
00bfb014  00000000 00000000 00000000 00000000
00bfb024  00000000 00000000 00000000 00000000
00bfb034  77d72c8a 77d622c0 77d62280 76e71691
00bfb044  76e71215 76e7324c 76e73495 76e741e8
00bfb054  76e75169 76e74a25 76e751fd 76e711c0
00bfb064  76e711a9 76e717b8 76e71420 77d8107b
00bfb074  77d79dd5 76e779d8 76e74a37 76e71222

Lets have a look at what is actually at this address.

0:004> u 6ce11005
FaultyDLL!ILT+0(_DllFunction):
6ce11005 e916000000      jmp     FaultyDLL!DllFunction (6ce11020)
FaultyDLL!ILT+5(_DllMain:
6ce1100a e931000000      jmp     FaultyDLL!DllMain (6ce11040)
6ce1100f cc              int     3
6ce11010 cc              int     3
6ce11011 cc              int     3
6ce11012 cc              int     3
6ce11013 cc              int     3
6ce11014 cc              int     3

It shows the name of a different module. FaultyDll. Looks like it is a DLL function being used, through the import address table (I know the name of the module has the word DLL in it, but that is not why i reached this conclusion).

Now lets disassemble the actual function the address of which we have been able to decipher.

0:004> u 6ce11020
FaultyDLL!DllFunction:
6ce11020 55                    push    ebp
6ce11021 8bec                mov     ebp,esp
6ce11023 68e092e36c      push    offset FaultyDLL!g_CritSec (6ce392e0)
6ce11028 ff1500b0e36c   call    dword ptr [FaultyDLL!_imp__EnterCriticalSection (6ce3b000)]
6ce1102e b801000000      mov     eax,1
6ce11033 5d                    pop     ebp
6ce11034 c3                    ret
6ce11035 cc                    int     3

It is a small function and what do we see in it? It enter's the critical section, but never leaves it. This is the bug!!

Main is shown as the owner of the critical section because it calls a faulty function which enters but never leaves the critical section.

Note: This is a very simple test application tailor made to produce the hang, real life scenarios might be much harder to debug and locate. But I do hope that this gives a starting point to folks wanting to chase such bugs!


Trivia : !cs command also gives info about all critical sections. !cs <address of crit> will dump info about that particular critical section. I deliberately avoided using it, as this exercise helps one understand such debugging scenarios in depth, rather than being dependent on a debugger extension.

No comments:

Post a Comment