Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorflow is taking over my openssl and causing segfaults #417

Open
msdrigg opened this issue Sep 27, 2023 · 5 comments
Open

Tensorflow is taking over my openssl and causing segfaults #417

msdrigg opened this issue Sep 27, 2023 · 5 comments

Comments

@msdrigg
Copy link

msdrigg commented Sep 27, 2023

So I recently added tensorflow to a rust project that had an external openssl dependency (reqwests and paho-mqtt) and I immediately started seeing segfaults. The strange thing is that these segfaults are coming from crypto functions being called in the tensorflow_framework.so.2 library from from paho-mqtt (SSLSocket_initialize in the core dump shown below). If I remove the paho-mqtt dependency on ssl, I see similar things with reqwests

Relevant Logs

This backtrace reliably occurs everytime I run my program.

(gdb) bt
#0  __pthread_rwlock_wrlock_full64 (abstime=0x0, clockid=0, rwlock=0x0)
    at ./nptl/pthread_rwlock_common.c:603
#1  ___pthread_rwlock_wrlock (rwlock=0x0) at ./nptl/pthread_rwlock_wrlock.c:26
#2  0x00007f8ec0e6db69 in CRYPTO_STATIC_MUTEX_lock_write ()
   from /home/myuser/workspace/target/debug/build/tensorflow-sys-b3a831e1f8b18f5e/out/libtensorflow_framework.so.2
#3  0x00007f8ec0df6263 in CRYPTO_get_ex_new_index ()
   from /home/myuser/workspace/target/debug/build/tensorflow-sys-b3a831e1f8b18f5e/out/libtensorflow_framework.so.2
#4  0x0000564ee8a50b43 in SSLSocket_initialize ()
    at /home/myuser/.cargo/registry/src/index.crates.io-6f17d22bba15001f/paho-mqtt-sys-0.8.1/paho.mqtt.c/src/SSLSocket.c:492
#5  0x0000564ee8a440ff in MQTTAsync_createWithOptions (handle=0x7f8ea4bdfe00, 
    serverURI=0x7f8df4004fc0 "tcp://localhost:1883", 
    clientId=0x7f8df4004fe0 "program", persistence_type=1, 
    persistence_context=0x0, options=0x7f8ea4bdfcc8)
    at /home/myuser/.cargo/registry/src/index.crates.io-6f17d22bba15001f/paho-mqtt-sys-0.8.1/paho.mqtt.c/src/MQTTAsync.c:372
#6  0x0000564ee8a22c37 in paho_mqtt::async_client::AsyncClient::new<paho_mqtt::create_options::CreateOptions> (opts=...) at src/async_client.rs:201
#7  0x0000564ee8a2127a in paho_mqtt::create_options::CreateOptionsBuilder::create_client (self=...)
    at src/create_options.rs:444

Interestingly, here's what I see from ldd. Note that libssl.so.3 does correctly point to the real openssl, so I don't know why at runtime it gets linked to tensorflow_framework.so.2

$ldd target/debug/program
        linux-vdso.so.1 (0x00007ffc46ffe000)
        libtensorflow_framework.so.2 => /usr/local/lib/libtensorflow_framework.so.2 (0x00007fb1b0000000)
        libtensorflow.so.2 => /usr/local/lib/libtensorflow.so.2 (0x00007fb19f000000)
        libssl.so.3 => /lib/x86_64-linux-gnu/libssl.so.3 (0x00007fb1b767e000)
        libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x00007fb19ea00000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fb1b765e000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb19ef19000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb19e600000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fb1b773c000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fb19e200000)

Note: I am using the latest rust versions and the latest versions of all packages mentioned here. Here's what my uname -a output looks like:

Linux pop-os 6.4.6-76060406-generic #202307241739~1690928105~22.04~d567a38 SMP PREEMPT_DYNAMIC Tue A x86_64 x86_64 x86_64 GNU/Linux

Prior Art

The only other mention of this issue I could find was here tensorflow/tensorflow#34742, and I am currently trying to resolve my problem using the steps outlined in that issue.

Goals

A perfect fix would be for me to be able to seamlessly use tensorflow and openssl in a project without any tweaks, but I would consider this issue closed for me if we could find some workaround (environmental variables, build script or something similar) so that I could make my project run without segfaulting.

@msdrigg
Copy link
Author

msdrigg commented Sep 27, 2023

I tried all solutions mentioned in tensorflow/tensorflow#34742, and nothing works. My final attempt was bazel build --compilation_mode=opt --jobs=25 --config=noaws --config=nogcp --config=nohdfs --config=nonccl --config=monolithic tensorflow and it still did not solve the problem.

@adamcrume
Copy link
Contributor

Are you pointing Rust to the TensorFlow library you built? There are instructions on how to do that at https://github.com/tensorflow/rust/blob/master/tensorflow-sys/README.md#manual-tensorflow-compilation.

@msdrigg
Copy link
Author

msdrigg commented Oct 5, 2023

Yes, I moved the compiled objects into /usr/local/lib and ran ldconfig on the directory.

@treehaqr
Copy link

treehaqr commented Jun 4, 2024

+1 for me. All I do is open a http connection with the reqwest crate and it crashes. It's totally unrelated to tensorflow, but somehow it now takes ownership of openssl lib.

  * frame #0: 0x00007fffcc6969fc libc.so.6`__GI___pthread_kill at pthread_kill.c:44:76
    frame #1: 0x00007fffcc6969b0 libc.so.6`__GI___pthread_kill [inlined] __pthread_kill_internal(signo=6, threadid=140737314203328) at pthread_kill.c:78:10
    frame #2: 0x00007fffcc6969b0 libc.so.6`__GI___pthread_kill(threadid=140737314203328, signo=6) at pthread_kill.c:89:10
    frame #3: 0x00007fffcc642476 libc.so.6`__GI_raise(sig=6) at raise.c:26:13
    frame #4: 0x00007fffcc6287f3 libc.so.6`__GI_abort at abort.c:79:7
    frame #5: 0x00007fffcc689676 libc.so.6`__libc_message(action=do_abort, fmt="\U00000010") at libc_fatal.c:155:5
    frame #6: 0x00007fffcc6a0cfc libc.so.6`malloc_printerr(str=<unavailable>) at malloc.c:5664:3
    frame #7: 0x00007fffcc6a2a44 libc.so.6`_int_free(av=<unavailable>, p=<unavailable>, have_lock=0) at malloc.c:4439:5
    frame #8: 0x00007fffcc6a5453 libc.so.6`__GI___libc_free(mem=<unavailable>) at malloc.c:3391:7
    frame #9: 0x00007ffff70a1c9a libtensorflow_framework.so.2`bssl::ssl_crypto_x509_ssl_ctx_free(ssl_ctx_st*) + 58
    frame #10: 0x00007ffff7094f86 libtensorflow_framework.so.2`ssl_ctx_st::~ssl_ctx_st() + 70
    frame #11: 0x00007ffff7095456 libtensorflow_framework.so.2`SSL_CTX_free + 38
    frame #12: 0x00005555565c990e program`_$LT$openssl..ssl..SslContext$u20$as$u20$core..ops..drop..Drop$GT$::drop::he1e1bafd7778b929(self=0x00007fffcbe22000) at lib.rs:241:26
    frame #13: 0x00005555565d58da program`core::ptr::drop_in_place$LT$openssl..ssl..SslContext$GT$::h8483f3eb796b6aee((null)=0x00007fffcbe22000) at mod.rs:497:1
    frame #14: 0x000055555600318b program`core::ptr::drop_in_place$LT$openssl..ssl..connector..SslConnector$GT$::ha52c6b5831405ca0((null)=0x00007fffcbe22000) at mod.rs:497:1
    frame #15: 0x000055555600316b program`core::ptr::drop_in_place$LT$native_tls..imp..TlsConnector$GT$::h898a325e5e6a2390((null)=0x00007fffcbe22000) at mod.rs:497:1
    frame #16: 0x000055555600315b program`core::ptr::drop_in_place$LT$native_tls..TlsConnector$GT$::h05dcf2f2ec19f859((null)=0x00007fffcbe22000) at mod.rs:497:1
    frame #17: 0x0000555555f4215c program`core::ptr::drop_in_place$LT$reqwest..connect..Inner$GT$::h1ccf2f0fb635dba6((null)=0x00007fffcbe21fe8) at mod.rs:497:1
    frame #18: 0x0000555555f425cb program`core::ptr::drop_in_place$LT$reqwest..connect..Connector$GT$::hbfb676efb078b00f((null)=0x00007fffcbe21fd8) at mod.rs:497:1
    frame #19: 0x0000555555f3a42e program`core::ptr::drop_in_place$LT$hyper_util..client..legacy..client..Client$LT$reqwest..connect..Connector$C$reqwest..async_impl..body..Body$GT$$GT$::h6a79d474bb243160((null)=0x00007fffcbe21f10) at mod.rs:497:1
    frame #20: 0x0000555555f4346c program`core::ptr::drop_in_place$LT$reqwest..async_impl..client..ClientRef$GT$::h621ffac56c4ab15f((null)=0x00007fffcbe21f10) at mod.rs:497:1
    frame #21: 0x0000555555f0ee3f program`alloc::sync::Arc$LT$T$C$A$GT$::drop_slow::h6322ecbb95a2aa20(self=0x00007fffffff3540) at sync.rs:1751:18
    frame #22: 0x0000555555f147e5 program`_$LT$alloc..sync..Arc$LT$T$C$A$GT$$u20$as$u20$core..ops..drop..Drop$GT$::drop::h25fd65b8fc0fe2ed(self=0x00007fffffff3540) at sync.rs:2407:13
    frame #23: 0x0000555555f4485b program`core::ptr::drop_in_place$LT$alloc..sync..Arc$LT$reqwest..async_impl..client..ClientRef$GT$$GT$::h15126958e085d642((null)=0x00007fffffff3540) at mod.rs:497:1
    frame #24: 0x0000555555f431db program`core::ptr::drop_in_place$LT$reqwest..async_impl..client..Client$GT$::h9be9cca15fbe82e9((null)=0x00007fffffff3540) at mod.rs:497:1
ldd target/debug/program
        linux-vdso.so.1 (0x00007ffef968b000)
        libtensorflow_framework.so.2 => /usr/local/lib/libtensorflow_framework.so.2 (0x0000783696400000)
        libtensorflow.so.2 => /usr/local/lib/libtensorflow.so.2 (0x000078366de00000)
        libssl.so.3 => /lib/x86_64-linux-gnu/libssl.so.3 (0x000078369895c000)
        libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x000078366d800000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x000078366d400000)
        libgomp.so.1 => /lib/x86_64-linux-gnu/libgomp.so.1 (0x000078369ba1f000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x000078369b9ff000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000078366dd19000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000078366d000000)
        /lib64/ld-linux-x86-64.so.2 (0x000078369ba80000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x000078369b9f8000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x000078369b9f3000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x000078369b9ee000)

@treehaqr
Copy link

treehaqr commented Jun 7, 2024

I worked around this by placing libssl and libcrypto before tensorflow in order of priority above. Create a build.rs with this code:

use std::error::Error;
fn main() -> Result<(), Box<dyn Error>> {
    println!("cargo:rustc-link-lib=dylib=ssl");
    println!("cargo:rustc-link-lib=dylib=crypto");
    Ok(())
}

and note that libssl and libcrypto are not before libtensorflow so it would never try to use tensorflow's statically linked ssl:

        linux-vdso.so.1 (0x00007fffbb3a9000)
        libssl.so.3 => /lib/x86_64-linux-gnu/libssl.so.3 (0x0000772aa0f5c000) <-- here
        libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x0000772aa0a00000) <-- here
        libtensorflow_framework.so.2 => /usr/local/lib/libtensorflow_framework.so.2 (0x0000772a9b400000)
        libtensorflow.so.2 => /usr/local/lib/libtensorflow.so.2 (0x0000772a76000000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x0000772a75c00000)
        libgomp.so.1 => /lib/x86_64-linux-gnu/libgomp.so.1 (0x0000772aa401c000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x0000772aa3ffc000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x0000772aa0e75000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000772a75800000)
        /lib64/ld-linux-x86-64.so.2 (0x0000772aa407d000)

It's still broken for unit tests because to my knowledge there's no way to enforce the linking order in tests.

Ideally libtensorflow should never be statically linked to openssl and let the binary choose its own libssl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants