-
Notifications
You must be signed in to change notification settings - Fork 69
MemoryChecker
Starting with 0.4.46 Ocelot includes a memory checker that will detect when either a CUDA runtime function or a thread in a PTX kernel accesses a memory region not associated with the currently selected device.
Memory checking comes on by default after 0.4.46. However, it does introduce a fairly significant performance overhead 1.5-3x slowdown depending on how many memory regions are mapped and how many memory operations your program does. To turn it off
- Edit ocelot/executive/implementation/CooperativeThreadArray.cpp
- Set CHECK_GLOBAL_ACCESSES to 0
The following example shows a program that makes 3 different kind of memory errors and the output from each when run through Ocelot.
/*!
*/
#include <string>
__global__ void init(int* data)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
data[tid] = tid;
}
void wrongMemcpyDirection()
{
int* hostA = new int[128];
memset(hostA, 0, sizeof(int)*128);
int* deviceA;
cudaMalloc( (void**) &deviceA, sizeof(int)*128 );
cudaMemcpy( hostA, deviceA, sizeof(int)*128, cudaMemcpyHostToDevice );
cudaFree( deviceA );
delete[] hostA;
}
void hostPointerOnDevice()
{
int* hostA = new int[128];
memset(hostA, 0, sizeof(int)*128);
int* deviceA;
cudaMalloc( (void**) &deviceA, sizeof(int)*128 );
cudaMemcpy( deviceA, hostA, sizeof(int)*128, cudaMemcpyHostToDevice );
init<<< 4, 32 >>>( hostA );
cudaFree( deviceA );
delete[] hostA;
}
void bufferOverrun()
{
int* hostA = new int[128];
memset(hostA, 0, sizeof(int)*128);
int* deviceA;
cudaMalloc( (void**) &deviceA, sizeof(int)*128 );
cudaMemcpy( deviceA, hostA, sizeof(int)*128, cudaMemcpyHostToDevice );
init<<< 4, 33 >>>( deviceA );
cudaFree( deviceA );
delete[] hostA;
}
int main( int argc, char** argv )
{
if( argc > 1 )
{
if( std::string(argv[1]) == "0" )
{
wrongMemcpyDirection();
}
else if( std::string(argv[1]) == "1" )
{
hostPointerOnDevice();
}
else if( std::string(argv[1]) == "2" )
{
bufferOverrun();
}
}
else
{
wrongMemcpyDirection();
}
return 0;
}
How I built the example on my machine.
nvcc --cuda memoryErrors.cu
g++ -o memoryErrors memoryErrors.cu.cpp `OcelotConfig -l`
Running the first test, tries to do a cudaMemcpyDeviceToHost with the pointers reversed.
./memoryErrors 0
terminate called after throwing an instance of 'hydrazine::Exception'
what(): Invalid destination 0x1d4da50 ( 512bytes) in host to device memcpy.
Device 0 : Ocelot PTX Emulator
Nearby Global Variable Allocations
No Allocations.
Nearby Device Memory Allocations
[0x1d4dd20] - [0x1d4df20] (512 bytes) (global)
****0x1d4da50****
Aborted
The second test passes a host pointer to a device function.
==Ocelot== Emulator failed to run kernel "_Z4initPi" with exception:
==Ocelot== [PC 9] [thread 0] [cta 0] st.global.s32 [%rd4 + 0], %r3 - Global memory address 0x1921a50 of size 4 is out of any allocated or mapped range.
==Ocelot== Memory Map:
==Ocelot== Device 0 : Ocelot PTX Emulator
==Ocelot== Nearby Global Variable Allocations
==Ocelot== No Allocations.
==Ocelot==
==Ocelot== Nearby Device Memory Allocations
==Ocelot== [0x1921d20] - [0x1921f20] (512 bytes) (global)
==Ocelot== ****0x1921a50****
==Ocelot==
The final test has device threads write off the end of an allocated region.
==Ocelot== Emulator failed to run kernel "_Z4initPi" with exception:
==Ocelot== [PC 9] [thread 29] [cta 3] st.global.s32 [%rd4 + 0], %r3 - Global memory address 0xf52f20 of size 4 is out of any allocated or mapped range.
==Ocelot== Memory Map:
==Ocelot== Device 0 : Ocelot PTX Emulator
==Ocelot== Nearby Global Variable Allocations
==Ocelot== No Allocations.
==Ocelot==
==Ocelot== Nearby Device Memory Allocations
==Ocelot== [0xf52d20] - [0xf52f20] (512 bytes) (global)
==Ocelot== ****0xf52f20****
==Ocelot==