WinDbg : Debugging A Critical Section Based Dead Lock Scenario
Dead locks can be very nasty and sometimes very hard to debug and root cause. This post is an attempt to walk you through one such scenario, where a critical section dead locks a thread. As with most of the previous posts, we will refrain from having source level help, that is, we will try and debug the faulty application without any source code and only symbols.
Commands Used:
We start by attaching the debugger to he hung application, or we can also start the application inside the debugger if that is an alternative. In this example the name of the application is FaultyApp.Exe.
Once the application hangs, we start by taking a stack dump of all the threads currently running. Since this is an user land debugging scenario, the ~* command will give us the thread wise stack dump.
kd>~*kvn
0:004> ~*kvn
0 Id: 6a0.5f8 Suspend: 1 Teb: 7efdd000 Unfrozen
# ChildEBP RetAddr Args to Child
00 0016f7b8 770315f7 00000003 0016f808 00000001 ntdll!NtWaitForMultipleObjects+0x15 (FPO: [5,0,0])
01 0016f854 76e719f8 0016f808 0016f87c 00000000 KERNELBASE!WaitForMultipleObjectsEx+0x100 (FPO: [Non-Fpo])
02 0016f89c 76e74200 00000003 7efde000 00000000 kernel32!WaitForMultipleObjectsExImplementation+0xe0 (FPO: [Non-Fpo])
03 0016f8b8 00bd11b5 00000003 0016f8d4 00000000 kernel32!WaitForMultipleObjects+0x18 (FPO: [Non-Fpo])
04 0016f8e4 00bd143d 00000001 00699318 00699368 FaultyApp!main+0x95 (FPO: [Non-Fpo]) (CONV: cdecl)
05 0016f92c 76e7338a 7efde000 0016f978 77d79f72 FaultyApp!__tmainCRTStartup+0xfe (FPO: [Non-Fpo]) (CONV: cdecl)
06 0016f938 77d79f72 7efde000 735ac1c9 00000000 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo])
07 0016f978 77d79f45 00bd1500 7efde000 ffffffff ntdll!__RtlUserThreadStart+0x70 (FPO: [Non-Fpo])
08 0016f990 00000000 00bd1500 7efde000 00000000 ntdll!_RtlUserThreadStart+0x1b (FPO: [Non-Fpo])
1 Id: 6a0.bc4 Suspend: 1 Teb: 7efda000 Unfrozen
# ChildEBP RetAddr Args to Child
00 0089feb4 77d78e44 00000048 00000000 00000000 ntdll!ZwWaitForSingleObject+0x15 (FPO: [3,0,0])
01 0089ff18 77d78d28 00000000 00000000 00000000 ntdll!RtlpWaitOnCriticalSection+0x13e (FPO: [Non-Fpo])
02 0089ff40 00bd105c 6ce392e0 00000000 00000000 ntdll!RtlEnterCriticalSection+0x150 (FPO: [Non-Fpo])
03 0089ff54 76e7338a 00000000 0089ffa0 77d79f72 FaultyApp!FnThread1+0x2c (FPO: [Non-Fpo]) (CONV: cdecl)
04 0089ff60 77d79f72 00000000 73c5c711 00000000 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo])
05 0089ffa0 77d79f45 00bd1005 00000000 ffffffff ntdll!__RtlUserThreadStart+0x70 (FPO: [Non-Fpo])
06 0089ffb8 00000000 00bd1005 00000000 00000000 ntdll!_RtlUserThreadStart+0x1b (FPO: [Non-Fpo])
2 Id: 6a0.12fc Suspend: 1 Teb: 7efaf000 Unfrozen
# ChildEBP RetAddr Args to Child
00 00dffa50 77d78e44 00000048 00000000 00000000 ntdll!ZwWaitForSingleObject+0x15 (FPO: [3,0,0])
01 00dffab4 77d78d28 00000000 00000000 00000000 ntdll!RtlpWaitOnCriticalSection+0x13e (FPO: [Non-Fpo])
02 00dffadc 00bd10fc 6ce392e0 00000000 00000000 ntdll!RtlEnterCriticalSection+0x150 (FPO: [Non-Fpo])
03 00dffaf0 76e7338a 00000000 00dffb3c 77d79f72 FaultyApp!FnThread3+0x2c (FPO: [Non-Fpo]) (CONV: cdecl)
04 00dffafc 77d79f72 00000000 7393c38d 00000000 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo])
05 00dffb3c 77d79f45 00bd100f 00000000 ffffffff ntdll!__RtlUserThreadStart+0x70 (FPO: [Non-Fpo])
06 00dffb54 00000000 00bd100f 00000000 00000000 ntdll!_RtlUserThreadStart+0x1b (FPO: [Non-Fpo])
3 Id: 6a0.11f0 Suspend: 1 Teb: 7efd7000 Unfrozen
# ChildEBP RetAddr Args to Child
00 00a6fa34 77d78e44 00000048 00000000 00000000 ntdll!ZwWaitForSingleObject+0x15 (FPO: [3,0,0])
01 00a6fa98 77d78d28 00000000 00000000 00000000 ntdll!RtlpWaitOnCriticalSection+0x13e (FPO: [Non-Fpo])
02 00a6fac0 00bd10ac 6ce392e0 00000000 00000000 ntdll!RtlEnterCriticalSection+0x150 (FPO: [Non-Fpo])
03 00a6fad4 76e7338a 00000000 00a6fb20 77d79f72 FaultyApp!FnThread2+0x2c (FPO: [Non-Fpo]) (CONV: cdecl)
04 00a6fae0 77d79f72 00000000 73eac391 00000000 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo])
05 00a6fb20 77d79f45 00bd100a 00000000 ffffffff ntdll!__RtlUserThreadStart+0x70 (FPO: [Non-Fpo])
06 00a6fb38 00000000 00bd100a 00000000 00000000 ntdll!_RtlUserThreadStart+0x1b (FPO: [Non-Fpo])
# 4 Id: 6a0.11c8 Suspend: 1 Teb: 7efac000 Unfrozen
# ChildEBP RetAddr Args to Child
00 00f0fd3c 77ddf926 73bcc5dd 00000000 00000000 ntdll!DbgBreakPoint (FPO: [0,0,0])
01 00f0fd6c 76e7338a 00000000 00f0fdb8 77d79f72 ntdll!DbgUiRemoteBreakin+0x3c (FPO: [Non-Fpo])
02 00f0fd78 77d79f72 00000000 73bcc509 00000000 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo])
03 00f0fdb8 77d79f45 77ddf8ea 00000000 ffffffff ntdll!__RtlUserThreadStart+0x70 (FPO: [Non-Fpo])
04 00f0fdd0 00000000 77ddf8ea 00000000 00000000 ntdll!_RtlUserThreadStart+0x1b (FPO: [Non-Fpo])
The clue here is to look for the Windows API call RtlEnterCriticalSection. There might be more than one thread in the stack making this function call, we can use any such thread.
Looking up on MSDN we see that the API is defined as follows:
NTSTATUS RtlEnterCriticalSection
(
RTL_CRITICAL_SECTION* crit
)
In the above stack dump, we have several such threads, we take any one and them from the parameters list take the address of the RTL_CRITICAL_SECTION object pointer and type cast it to see the contents.
02 0089ff40 00bd105c 6ce392e0 00000000 00000000 ntdll!RtlEnterCriticalSection+0x150 (FPO: [Non-Fpo])
0:004> dt ntdll!_RTL_CRITICAL_SECTION 6ce392e0
+0x000 DebugInfo : 0x0069acf8 _RTL_CRITICAL_SECTION_DEBUG
+0x004 LockCount : 0n-14
+0x008 RecursionCount : 0n1
+0x00c OwningThread : 0x000005f8 Void
+0x010 LockSemaphore : 0x00000048 Void
+0x014 SpinCount : 0
We see that one of the members of this object gives us the Owning thread ID. 0x000005f8.
Lets try and find out more details about this thread. In user mode, the double tilde (~~) command will do it for us.
0:004> ~~[0x000005f8 ]
0 Id: 6a0.5f8 Suspend: 1 Teb: 7efdd000 Unfrozen
Start: FaultyApp!mainCRTStartup (00bd1500)
Priority: 0 Priority class: 32 Affinity: ff
Ok so the thread offending is the main thread. This is good. Now we know where in the code of FaultyApp we need to start digging.
Lets start with disassembling the Main function. I will add the code in small sections explaining the relevant portions in between.
0:004> uF FaultyApp!main
FaultyApp!main:
00bd1120 55 push ebp
00bd1121 8bec mov ebp,esp
This is standard function prologue code, that we see in the above two lines.
00bd1123 83ec14 sub esp,14h
The above line is allocating space for local variables in the stack
00bd1126 a11070bf00 mov eax,dword ptr [FaultyApp!__security_cookie (00bf7010)]
00bd112b 33c5 xor eax,ebp
These two lines above is the code for stack guard.
00bd112d 8945fc mov dword ptr [ebp-4],eax
00bd1130 a100b0bf00 mov eax,dword ptr [FaultyApp!_imp__g_CritSec (00bfb000)]
00bd1135 50 push eax
00bd1136 ff1534b0bf00 call dword ptr [FaultyApp!_imp__InitializeCriticalSection (00bfb034)]
the four lines above is the code to initialize the critical section
00bd113c c745ec00000000 mov dword ptr [ebp-14h],0
00bd1143 eb09 jmp FaultyApp!main+0x2e (00bd114e)
FaultyApp!main+0x25:
00bd1145 8b4dec mov ecx,dword ptr [ebp-14h]
00bd1148 83c101 add ecx,1
00bd114b 894dec mov dword ptr [ebp-14h],ecx
FaultyApp!main+0x2e:
00bd114e 837dec03 cmp dword ptr [ebp-14h],3
The line just above is comparing a local variable with a constant '3'
00bd1152 7333 jae FaultyApp!main+0x67 (00bd1187)
If the comparison result is equal or greater than 3 then then jump out, else loop
FaultyApp!main+0x34:
00bd1154 6a00 push 0
00bd1156 6a00 push 0
00bd1158 6a00 push 0
00bd115a 8b55ec mov edx,dword ptr [ebp-14h]
00bd115d 8b04950070bf00 mov eax,dword ptr FaultyApp!pFnarry (00bf7000)[edx*4]
This above sections seems to be initializing a function pointer array for thread functions. We know this because of the call to CreateThread API a few lines below....
00bd1164 50 push eax
00bd1165 6a00 push 0
00bd1167 6a00 push 0
00bd1169 ff154cb0bf00 call dword ptr [FaultyApp!_imp__CreateThread (00bfb04c)]
(the reader is encouraged to use MSDN to see the signature for CreateThread and all the parameters pushed into it inside this function to decrypt what all was passed)
00bd116f 8b4dec mov ecx,dword ptr [ebp-14h]
00bd1172 89448df0 mov dword ptr [ebp+ecx*4-10h],eax
00bd1176 8b55ec mov edx,dword ptr [ebp-14h]
00bd1179 837c95f000 cmp dword ptr [ebp+edx*4-10h],0
00bd117e 7505 jne FaultyApp!main+0x65 (00bd1185)
FaultyApp!main+0x60:
00bd1180 83c8ff or eax,0FFFFFFFFh
00bd1183 eb32 jmp FaultyApp!main+0x97 (00bd11b7)
FaultyApp!main+0x65:
00bd1185 ebbe jmp FaultyApp!main+0x25 (00bd1145)
FaultyApp!main+0x67:
00bd1187 a1b892bf00 mov eax,dword ptr [FaultyApp!hStartThreads (00bf92b8)]
00bd118c 50 push eax
00bd118d ff1540b0bf00 call dword ptr [FaultyApp!_imp__SetEvent (00bfb040)]
00bd1193 6a00 push 0
00bd1195 68f4010000 push 1F4h
00bd119a ff1544b0bf00 call dword ptr [FaultyApp!_imp__SleepEx (00bfb044)]
00bd11a0 e885010000 call FaultyApp!DllFunction (00bd132a)
After setting an event and sleeping for a while, it finally jumps to this function in the above line. DllFunction. We need to disassemble it to see where it leads us. I am not pasting the rest of the disassembly of the Main function as it is irrelevant to the rest of the context.
Lets start with the disassembly of DllFunction.
0:004> u FaultyApp!DllFunction
FaultyApp!DllFunction:
00bd132a ff2504b0bf00 jmp dword ptr [FaultyApp!_imp__DllFunction (00bfb004)]
FaultyApp!__security_check_cookie:
00bd1330 3b0d1070bf00 cmp ecx,dword ptr [FaultyApp!__security_cookie (00bf7010)]
00bd1336 7502 jne FaultyApp!__security_check_cookie+0xa (00bd133a)
00bd1338 f3c3 rep ret
00bd133a e908020000 jmp FaultyApp!__report_gsfailure (00bd1547)
FaultyApp!__tmainCRTStartup:
00bd133f 6a14 push 14h
00bd1341 68d05dbf00 push offset FaultyApp!__rtc_tzz+0x104 (00bf5dd0)
00bd1346 e8551a0000 call FaultyApp!__SEH_prolog4 (00bd2da0)
Whats going on? This doesn't look like a function at all....so look closely at the first line copied again in the line below for your convenience...
00bd132a ff2504b0bf00 jmp dword ptr [FaultyApp!_imp__DllFunction (00bfb004)]
This is actually a function pointer which is being de-referenced. So if it is a pointer, we need to see what it is pointing at:
0:004> dd 00bfb004
00bfb004 6ce11005 00000000 00000000 00000000
00bfb014 00000000 00000000 00000000 00000000
00bfb024 00000000 00000000 00000000 00000000
00bfb034 77d72c8a 77d622c0 77d62280 76e71691
00bfb044 76e71215 76e7324c 76e73495 76e741e8
00bfb054 76e75169 76e74a25 76e751fd 76e711c0
00bfb064 76e711a9 76e717b8 76e71420 77d8107b
00bfb074 77d79dd5 76e779d8 76e74a37 76e71222
Lets have a look at what is actually at this address.
0:004> u 6ce11005
FaultyDLL!ILT+0(_DllFunction):
6ce11005 e916000000 jmp FaultyDLL!DllFunction (6ce11020)
FaultyDLL!ILT+5(_DllMain:
6ce1100a e931000000 jmp FaultyDLL!DllMain (6ce11040)
6ce1100f cc int 3
6ce11010 cc int 3
6ce11011 cc int 3
6ce11012 cc int 3
6ce11013 cc int 3
6ce11014 cc int 3
It shows the name of a different module. FaultyDll. Looks like it is a DLL function being used, through the import address table (I know the name of the module has the word DLL in it, but that is not why i reached this conclusion).
Now lets disassemble the actual function the address of which we have been able to decipher.
0:004> u 6ce11020
FaultyDLL!DllFunction:
6ce11020 55 push ebp
6ce11021 8bec mov ebp,esp
6ce11023 68e092e36c push offset FaultyDLL!g_CritSec (6ce392e0)
6ce11028 ff1500b0e36c call dword ptr [FaultyDLL!_imp__EnterCriticalSection (6ce3b000)]
6ce1102e b801000000 mov eax,1
6ce11033 5d pop ebp
6ce11034 c3 ret
6ce11035 cc int 3
It is a small function and what do we see in it? It enter's the critical section, but never leaves it. This is the bug!!
Main is shown as the owner of the critical section because it calls a faulty function which enters but never leaves the critical section.
Note: This is a very simple test application tailor made to produce the hang, real life scenarios might be much harder to debug and locate. But I do hope that this gives a starting point to folks wanting to chase such bugs!
Trivia : !cs command also gives info about all critical sections. !cs <address of crit> will dump info about that particular critical section. I deliberately avoided using it, as this exercise helps one understand such debugging scenarios in depth, rather than being dependent on a debugger extension.