-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crashtracking for Windows #892
base: main
Are you sure you want to change the base?
Conversation
BenchmarksComparisonBenchmark execution time: 2025-03-13 16:28:08 Comparing candidate commit 17fe93c in PR branch Found 1 performance improvements and 0 performance regressions! Performance is the same for 51 metrics, 2 unstable metrics. scenario:tags/replace_trace_tags
CandidateCandidate benchmark detailsGroup 1
Group 2
Group 3
Group 4
Group 5
Group 6
Group 7
Group 8
Group 9
Group 10
Group 11
Group 12
Group 13
BaselineOmitted due to size. |
aa6bc16
to
14136df
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #892 +/- ##
==========================================
+ Coverage 72.07% 72.48% +0.41%
==========================================
Files 328 333 +5
Lines 48891 50097 +1206
==========================================
+ Hits 35237 36312 +1075
- Misses 13654 13785 +131
🚀 New features to boost your workflow:
|
let mut path = env::temp_dir().join(process_name); | ||
path.set_extension("dll"); | ||
|
||
// Attempt to move it just in case it already exists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would this happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we log something here? Is this unexpected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reused the logic from the trampoline (since the need is the same): https://github.com/DataDog/libdatadog/blob/main/spawn_worker/src/win32.rs#L48
The filename is made of the user SID and the version number, so if multiple instances of PHP are running they will share the same file. I think this is a good thing for crashtracking because we need to add the path to the registry, and I'm afraid we would add a lot of garbage if the path was random.
.file("src/crashtracking_trampoline.cpp") // Path to your C++ file | ||
.warnings(true) | ||
.warnings_into_errors(true) | ||
.flag("/std:c++17") // Set the C++ standard (adjust as needed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does having a C++
binary increase the size of libdatadog vs C
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably but the size is still reasonable. The only reason I used C++ is because it has regex support in the stdlib. The DLL size is 160 KB, I believe it's acceptable (it was ~60 KB in C with manual parsing).
|
||
if (!EnumProcessModules(process, nullptr, 0, &cbNeeded)) | ||
{ | ||
OutputDebugStringW(L"Failed to enumerate process modules (1st)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does 1st vs 2nd mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We call EnumProcessModules
twice (first to get the number of modules, then to populate them). It's simply to know if we failed in the first or the second call.
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { | ||
write!( | ||
f, | ||
"{:08x}{:04x}{:04x}{:02x}{:02x}{:02x}{:02x}{:02x}{:02x}{:02x}{:02x}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this documented somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is standard guid formatting on Windows, but without the dashes: https://devblogs.microsoft.com/oldnewthing/20220928-00/?p=107221
let debug_data_dir: IMAGE_DATA_DIRECTORY = if is_pe32 { | ||
let nt_headers32: IMAGE_NT_HEADERS32 = read_memory(process_handle, nt_headers_address)?; | ||
nt_headers32.OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_DEBUG.0 as usize] | ||
} else { | ||
let nt_headers64: IMAGE_NT_HEADERS64 = read_memory(process_handle, nt_headers_address)?; | ||
nt_headers64.OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_DEBUG.0 as usize] | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there documentation for why this is the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is a good official documentation about the PE format. It's mostly the definitions in the official headers (https://learn.microsoft.com/en-us/windows/win32/api/winnt/ns-winnt-image_nt_headers64) and then a bunch of third-party articles: https://learn.microsoft.com/en-us/archive/msdn-magazine/2002/february/inside-windows-win32-portable-executable-file-format-in-detail https://wiki.osdev.org/PE
For the Rust implementation, I simply converted the C++ code we wrote for crashtracking in .net: https://github.com/DataDog/dd-trace-dotnet/blob/master/profiler/src/ProfilerEngine/Datadog.Profiler.Native.Windows/CrashReportingWindows.cpp#L271
which has proper testing: https://github.com/DataDog/dd-trace-dotnet/blob/master/profiler/test/Datadog.Profiler.Native.Tests/CrashReportingTest.cpp
We probably want to add similar testing in the libdatadog repository.
if thread_entry.th32OwnerProcessID == pid { | ||
thread_ids.push(thread_entry.th32ThreadID); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We loop over every thread on the machine? Could that be expensive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and yes. But as crazy as it sounds, this is the normal way.
In "recent" versions of windows (since 2012 R2) there is an alternative way that doesn't require to enumerate all threads (using something called "process snapshotting", which can be thought as Windows' sane version of vfork
). However we would need to confirm that it works correctly in the context of WER, so that requires additional research. Our implementation of crashtracking in .NET uses CreateToolhelp32Snapshot
, so I'd rather rely on this battle-tested solution for now.
Cargo.toml
Outdated
"crashtracker-ffi/tests/test_app", | ||
"crashtracker-ffi/tests/test_app_lib", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need a test app to crash. The lib is because I need a DLL for WER. Maybe it's possible to reference datadog-crashtracker-ffi as a DLL but I didn't find how (when I reference the crate it gets statically linked)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reorganized the tests and remove test_app_lib (but for that I had to add a cdylib
crate-type to crashtracker-ffi)
Ok(()) | ||
} | ||
|
||
unsafe fn output_debug_string(message: &str) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best practice is to put unsafe
on the function if there is a contract the caller must follow, and have an unsafe
block inside a "safe" function otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking pretty good!
crashtracker-ffi/Cargo.toml
Outdated
@@ -12,10 +12,11 @@ license.workspace = true | |||
bench = false | |||
|
|||
[features] | |||
default = ["cbindgen", "collector", "demangler", "receiver"] | |||
default = ["cbindgen", "collector", "collector_windows", "demangler", "receiver"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be cleaner if its possible
|
||
[[bin]] | ||
name = "test_app" | ||
path = "tests/test_app/src/main.rs" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens with this on Linux?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test application will compile but be empty.
} | ||
|
||
fn output_debug_string(message: &str) { | ||
unsafe { OutputDebugStringW(&HSTRING::from(message)) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Safety comment
"exception_information is null" | ||
); | ||
|
||
let process_handle = unsafe { (*exception_information).hProcess }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Safety comments
unsafe { | ||
*ptr = 42; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to wrap this in blackbox
like some of the other tests do to avoid the compiler doing something clever
|
||
if open_result == ERROR_SUCCESS { | ||
// Check if the value exists | ||
let query_result = unsafe { RegQueryValueExW(hkey, &name, None, None, None, None) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
safety comments
} { | ||
let mut frame = StackFrame::new(); | ||
|
||
frame.ip = Some(format!("{:x}", native_frame.AddrPC.Offset)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to specify a width here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand, why would I? I'm just converting the ip to hex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think he means to have consistent padding, like always 16 hex chars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh 🤔 Those values are not meant to be interpreted by a human, so I don't think the padding makes sense
|
||
// Force a segfault to crash | ||
let ptr = std::ptr::null_mut::<i32>(); | ||
// SAFETY: Don't worry, we are crashing on purpose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤣
What does this PR do?
Add Windows support for crashtracking.
Motivation
PHP (and probably other languages) have some unique crashes on Windows, so the Linux support isn't enough.
Additional Notes
Our implementation of crashtracking for Windows relies on WER (Windows Error Reporting).
The way it works is: the process calls
WerRegisterRuntimeExceptionModule
to register the DLL that contains the crash handling code. The DLL must expose three methods:OutOfProcessExceptionEventCallback
: gives a chance to analyze the crashing process. Optionally, the crash handler can setownershipClaimed
to true to "claim" the crash. If it does, then the crash handler provide some metadata that will be attached to the crash report in the Windows event log. Since we are only implementing crash tracking for telemetry, we never claim the crash.OutOfProcessExceptionEventSignatureCallback
: this method is only called if the crash handler claimed the crash. It's used to provide metadata to help triaging the crash.OutOfProcessExceptionEventDebuggerLaunchCallback
: this method is only called if the crash handler claimed the crash. It's used to provide the path to a custom debugger that will be used to debug the crashing process.In addition, the path to the crash handler must be added to the registry, in
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\Windows Error Reporting\RuntimeExceptionHelperModules
(requires administrator permissions). Since Windows 10 20H1, it's possible to useHKEY_CURRENT_USER\SOFTWARE\Microsoft\Windows\Windows Error Reporting\RuntimeExceptionHelperModules
instead (doesn't require additional permissions in most environments). The only limitation of using HKCU instead of HKLM is that the crash handler can't claim the crash (which we don't really care about).Ideally, we should add the key to
HKEY_LOCAL_MACHINE
during installation. To handle scenarios where it's not possible, theinit_crashtracking_windows
method looks for the key, and creates one inHKEY_CURRENT_USER
if it can't be found.When the process crashes, Windows suspends it then spawns
WerFault.exe
. WerFault loads the registered crash handlers in order, and callsOutOfProcessExceptionEventCallback
for each of them until the crash is claimed (or none are left). The crashing process is resumed when WerFault terminates (gracefully or abruptly). In theory, crashing in WerFault has no visible consequence on the system, so this is a very safe way to inspect the crashes.To inspect the crashing process during
OutOfProcessExceptionEventCallback
, we use the APIs exposed bydbghelp.dll
.dbghelp.dll
is present on all Windows systems, however the version of the library can change. Because of this, we must be careful about what APIs we call, and favor the oldest ones. Currently we limit ourselves toSymInitializeW
(it's unclear what it does exactly, but it's required to be able to walk the callstacks), andStackWalkEx
(walks the stack of the given thread). The .NET tracer already relies on those methods for its crashtracking implementation, so we know they're safe to call on the supported versions of Windows (2012 R2 and ulterior).It's very hard to monitor what happens inside of the crash handler loaded in WerFault.exe, because it's a background process with no window or console. To provide a bit of visibility, we call the Win32 API
OutputDebugString
, which logs message that can be listened to using a special system-wide debugger (for instance, DebugView).How to test the change?
test_crash
incollector_windows_tests
launches a test application and checks that a crash report is generated. The same test application can be used to test manually if needed.