Skip to content

Commit

Permalink
Basic RocksDB follower implementation (facebook#12540)
Browse files Browse the repository at this point in the history
Summary:
A basic implementation of RocksDB follower mode, which opens a remote database (referred to as leader) on a distributed file system by tailing its MANIFEST. It leverages the secondary instance mode, but is different in some key ways -
1. It has its own directory with links to the leader's database
2. Periodically refreshes itself
3. (Future) Snapshot support
4. (Future) Garbage collection of obsolete links
5. (Long term) Memtable replication

There are two main classes implementing this functionality - `DBImplFollower` and `OnDemandFileSystem`. The former is derived from `DBImplSecondary`. Similar to `DBImplSecondary`, it implements recovery and catch up through MANIFEST tailing using the `ReactiveVersionSet`, but does not consider logs. In a future PR, we will implement memtable replication, which will eliminate the need to catch up using logs. In addition, the recovery and catch-up tries to avoid directory listing as repeated metadata operations are expensive.

The second main piece is the `OnDemandFileSystem`, which plugs in as an `Env` for the follower instance and creates the illusion of the follower directory as a clone of the leader directory. It creates links to SSTs on first reference. When the follower tails the MANIFEST and attempts to create a new `Version`, it calls `VerifyFileMetadata` to verify the size of the file, and optionally the unique ID of the file. During this process, links are created which prevent the underlying files from getting deallocated even if the leader deletes the files.

TODOs: Deletion of obsolete links, snapshots, robust checking against misconfigurations, better observability etc.

Pull Request resolved: facebook#12540

Reviewed By: jowlyzhang

Differential Revision: D56315718

Pulled By: anand1976

fbshipit-source-id: d19e1aca43a6af4000cb8622a718031b69ebd97b
  • Loading branch information
anand1976 authored and facebook-github-bot committed Apr 20, 2024
1 parent f0864d3 commit d8fb849
Show file tree
Hide file tree
Showing 17 changed files with 1,009 additions and 6 deletions.
3 changes: 3 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1037,8 +1037,10 @@ endif()

else()
list(APPEND SOURCES
db/db_impl/db_impl_follower.cc
port/port_posix.cc
env/env_posix.cc
env/fs_on_demand.cc
env/fs_posix.cc
env/io_posix.cc)
endif()
Expand Down Expand Up @@ -1363,6 +1365,7 @@ if(WITH_TESTS)
db/file_indexer_test.cc
db/filename_test.cc
db/flush_job_test.cc
db/db_follower_test.cc
db/import_column_family_test.cc
db/listener_test.cc
db/log_test.cc
Expand Down
3 changes: 3 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -1922,6 +1922,9 @@ sst_file_reader_test: $(OBJ_DIR)/table/sst_file_reader_test.o $(TEST_LIBRARY) $(
db_secondary_test: $(OBJ_DIR)/db/db_secondary_test.o $(TEST_LIBRARY) $(LIBRARY)
$(AM_LINK)

db_follower_test: $(OBJ_DIR)/db/db_follower_test.o $(TEST_LIBRARY) $(LIBRARY)
$(AM_LINK)

block_cache_tracer_test: $(OBJ_DIR)/trace_replay/block_cache_tracer_test.o $(TEST_LIBRARY) $(LIBRARY)
$(AM_LINK)

Expand Down
8 changes: 8 additions & 0 deletions TARGETS
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ cpp_library_wrapper(name="rocksdb_lib", srcs=[
"db/db_impl/db_impl_debug.cc",
"db/db_impl/db_impl_experimental.cc",
"db/db_impl/db_impl_files.cc",
"db/db_impl/db_impl_follower.cc",
"db/db_impl/db_impl_open.cc",
"db/db_impl/db_impl_readonly.cc",
"db/db_impl/db_impl_secondary.cc",
Expand Down Expand Up @@ -117,6 +118,7 @@ cpp_library_wrapper(name="rocksdb_lib", srcs=[
"env/env_posix.cc",
"env/file_system.cc",
"env/file_system_tracer.cc",
"env/fs_on_demand.cc",
"env/fs_posix.cc",
"env/fs_remap.cc",
"env/io_posix.cc",
Expand Down Expand Up @@ -4795,6 +4797,12 @@ cpp_unittest_wrapper(name="db_flush_test",
extra_compiler_flags=[])


cpp_unittest_wrapper(name="db_follower_test",
srcs=["db/db_follower_test.cc"],
deps=[":rocksdb_test_lib"],
extra_compiler_flags=[])


cpp_unittest_wrapper(name="db_inplace_update_test",
srcs=["db/db_inplace_update_test.cc"],
deps=[":rocksdb_test_lib"],
Expand Down
63 changes: 63 additions & 0 deletions db/db_follower_test.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
// Copyright (c) 2024-present, Facebook, Inc. All rights reserved.
// This source code is licensed under both the GPLv2 (found in the
// COPYING file in the root directory) and Apache 2.0 License
// (found in the LICENSE.Apache file in the root directory).

#include "db/db_test_util.h"
#include "port/stack_trace.h"
#include "test_util/sync_point.h"

namespace ROCKSDB_NAMESPACE {

#ifdef OS_LINUX

class DBFollowerTest : public DBTestBase {
public:
// Create directories for leader and follower
// Create the leader DB object
DBFollowerTest() : DBTestBase("/db_follower_test", /*env_do_fsync*/ false) {
follower_name_ = dbname_ + "/follower";
Close();
Destroy(CurrentOptions());
EXPECT_EQ(env_->CreateDirIfMissing(dbname_), Status::OK());
dbname_ = dbname_ + "/leader";
Reopen(CurrentOptions());
}

~DBFollowerTest() {
follower_.reset();
EXPECT_EQ(DestroyDB(follower_name_, CurrentOptions()), Status::OK());
}

protected:
Status OpenAsFollower() {
return DB::OpenAsFollower(CurrentOptions(), follower_name_, dbname_,
&follower_);
}
DB* follower() { return follower_.get(); }

private:
std::string follower_name_;
std::unique_ptr<DB> follower_;
};

TEST_F(DBFollowerTest, Basic) {
ASSERT_OK(Put("k1", "v1"));
ASSERT_OK(Flush());
ASSERT_OK(Put("k2", "v2"));
ASSERT_OK(Flush());

ASSERT_OK(OpenAsFollower());
std::string val;
ASSERT_OK(follower()->Get(ReadOptions(), "k1", &val));
ASSERT_EQ(val, "v1");
}

#endif
} // namespace ROCKSDB_NAMESPACE

int main(int argc, char** argv) {
ROCKSDB_NAMESPACE::port::InstallStackTraceHandler();
::testing::InitGoogleTest(&argc, argv);
return RUN_ALL_TESTS();
}
2 changes: 1 addition & 1 deletion db/db_impl/db_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -1535,7 +1535,7 @@ class DBImpl : public DB {
Status WriteRecoverableState();

// Actual implementation of Close()
Status CloseImpl();
virtual Status CloseImpl();

// Recover the descriptor from persistent storage. May do a significant
// amount of work to recover recently logged updates. Any changes to
Expand Down
Loading

0 comments on commit d8fb849

Please sign in to comment.