Can Nebula2Nebula support multi-thread sync ？ #136

awang12345 · 2023-12-28T02:09:32Z

reason
The current logic is to read all the partition data and then compose insert statements to write it out to new nebula .
It's possible to run out of memory if you have a large amount of data, and of course spark does overflow to disk, but I did run into OOM job interruptions. In addition, it does not give full play to the ability of multitasking, in fact, it is completely possible to split the partition, each partition a task, read a partition to write, the efficiency will be much higher.

// multi-thread sync  one partition data of tag or edge
for (partitionId <- 1 to partitions) {
      val task = new Runnable {
        def run(): Unit = {
          syncTagPartitionData(spark,
            ........
            partitionId
          )
        }
      }
 
      threadPool.execute(task);
    }

//set  special scan partitionId  
   val nebulaReadVertexConfig: ReadNebulaConfig = ReadNebulaConfig
      .builder()
      .with.....
      ......
      .withPartitionId(partitionId) 
      .build()
    var vertex = spark.read.nebula(sourceConfig, nebulaReadVertexConfig).loadVerticesToDF()
 
  // create task for special partition id
   class SimpleScan(nebulaOptions: NebulaOptions, nebulaTotalPart: Int, schema: StructType)
  extends Scan
    with Batch {
  override def planInputPartitions(): Array[InputPartition] = {
    //return special partiton id for task
    if (nebulaOptions.readPartitionId != null && nebulaOptions.readPartitionId > 0) {
      LOG.info(s"planInputPartitions partions:${nebulaOptions.readPartitionId}")
      return Array(NebulaPartitionBatch(Array(nebulaOptions.readPartitionId)))
    }
    .....
  }

The text was updated successfully, but these errors were encountered:

Nicole00 · 2024-01-02T06:05:25Z

Thanks to propose a great idea. Your purpose is to increase the concurrency and avoid the OOM problem?
I wonder Why use multiple threads but not use the ability of spark multiple partitions? If we want a high concurrency, we can distribute more total executor cores to allow all partitions of all tags and edges to run together.
And if we want to avoid oom, On the contrary we should decrease the concurrency to decrease the data amount in memory.

awang12345 · 2024-01-02T06:12:52Z

oom is only one of the reasons, and the main reason is that the existing example Nebula2Nebula needs to read all the data before writing, which is too slow. Part of the partition can be read first, and then the data after reading can be written. The purpose is to make reading and writing concurrent as far as possible, rather than only reading or writing at a certain time, for faster synchronization

awang12345 · 2024-01-02T06:14:13Z

If we simply increase the concurrency of reading, we can improve the efficiency, but it will put too much pressure on the source Nebula server

Nicole00 · 2024-01-02T06:18:31Z

No matter which method is used, as long as the number of concurrency on the upper layer is the same, the pressure on the server is the same.
Besides, the reader is using client's ScanIterator, which cannot be split into sub tasks.

awang12345 · 2024-01-02T06:47:42Z

But if I want to read a partition and write it, can I do that? I don't know much about spark, and from my practice, I currently write when I've read everything.For example, 256 partitions, I can set a maximum of 16 partitions to be read concurrently and synchronize to the target nebula server every time I read a partition

Nicole00 · 2024-01-03T03:18:05Z

But if I want to read a partition and write it, can I do that? I don't know much about spark, and from my practice, I currently write when I've read everything.For example, 256 partitions, I can set a maximum of 16 partitions to be read concurrently and synchronize to the target nebula server every time I read a partition

First of all, for the spark connector, the reader refers to reading the data of the specified tag/edge. This does not involve specifying the part, so the partId is not exposed to the outside world.
Secondly, from the connector's perspective, there is no business requirement to specify only a certain part of data, because such data is incomplete and this part of data has no business significance.

If we want to scan a certain part, we should use the StorageClient in the Java client to scan the data of the specified part.

awang12345 · 2024-01-03T03:21:53Z

Sorry, maybe I didn't make that clear, but what I meant to say was that I could write as soon as one of the one partitions had been read, instead of waiting for all the partitions to be read

Nicole00 · 2024-01-03T06:40:40Z

Yeah, I got your point.
You want to read and write at the same time, and the granularity of reading and writing needs to be lower. Currently, the granularity of reading and writing is at the part level. After the data of a part is read, it will be written subsequently.

For now, the part level is the smallest granularity. It is impossible to read data concurrently within a single part because you do not know where the cursor for data scanning is in each iteration.

awang12345 · 2024-01-03T06:51:01Z

Thank you very much for your response.My idea is to split the partition list of a tag or edge into one dataframe by dataframe to sync. In my example above, I can read a partition and then write it, and it is already applied online

Nicole00 · 2024-01-03T09:04:07Z

Wow that's great. can you use the ability of multiple machines of spark cluster?

awang12345 · 2024-01-03T09:54:11Z

Yes，The online spark environment is a YARN-based cluster.Synchronization speed up by 50%

Nicole00 · 2024-01-08T01:36:09Z

Yes，The online spark environment is a YARN-based cluster.Synchronization speed up by 50%

This is amazing improvement. How fast can data be migrated between Nebulas? I'll test it and see if we can get it in our environment.

awang12345 · 2024-01-08T01:48:59Z

3 billion data 6 completed migration.The original program was cost 14 hours. Of course, this has something to do with limiting the read rate. But the same derived rate case. Using multithreaded partition for synchronization is still 50% better. Because reads and writes are parallel

QingZ11 added the type/question Type: question about the product label Jan 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can Nebula2Nebula support multi-thread sync ？ #136

Can Nebula2Nebula support multi-thread sync ？ #136

awang12345 commented Dec 28, 2023 •

edited

Loading

Nicole00 commented Jan 2, 2024

awang12345 commented Jan 2, 2024

awang12345 commented Jan 2, 2024

Nicole00 commented Jan 2, 2024

awang12345 commented Jan 2, 2024

Nicole00 commented Jan 3, 2024

awang12345 commented Jan 3, 2024

Nicole00 commented Jan 3, 2024

awang12345 commented Jan 3, 2024

Nicole00 commented Jan 3, 2024

awang12345 commented Jan 3, 2024 •

edited

Loading

Nicole00 commented Jan 8, 2024

awang12345 commented Jan 8, 2024 •

edited

Loading

Can Nebula2Nebula support multi-thread sync ？ #136

Can Nebula2Nebula support multi-thread sync ？ #136

Comments

awang12345 commented Dec 28, 2023 • edited Loading

Nicole00 commented Jan 2, 2024

awang12345 commented Jan 2, 2024

awang12345 commented Jan 2, 2024

Nicole00 commented Jan 2, 2024

awang12345 commented Jan 2, 2024

Nicole00 commented Jan 3, 2024

awang12345 commented Jan 3, 2024

Nicole00 commented Jan 3, 2024

awang12345 commented Jan 3, 2024

Nicole00 commented Jan 3, 2024

awang12345 commented Jan 3, 2024 • edited Loading

Nicole00 commented Jan 8, 2024

awang12345 commented Jan 8, 2024 • edited Loading

awang12345 commented Dec 28, 2023 •

edited

Loading

awang12345 commented Jan 3, 2024 •

edited

Loading

awang12345 commented Jan 8, 2024 •

edited

Loading