-
Notifications
You must be signed in to change notification settings - Fork 766
Getting Started for StreamingBench
Note: this is for HiBench 5.0
-
Prerequirements.
Finish configurations described in Getting Started . For running Samza, Hadoop YARN cluster is needed.
Download & setup ZooKeeper (3.3.3 is preferred).
Download & setup Apache Kafka (0.8.1, scala version 2.10 is preferred).
Download & setup Apache Storm (0.9.3 is preferred).
-
ZooKeeper setup
Edit the config file in zookeeper installation directory, please refer to conf/example/zookeeper for an example.
Go to the install directory of Zookeeper, start Zookeeper with that config file.
You may run
bin/zkCli.sh
to verify if zookeeper is working properly.Sometimes you may need to clean up the data inside zookeeper. First stop the server, then run "rm -rf /path/to/zookeeper/datadir" to clean the data dir. The directory is defined in your config file.
-
Kafka setup
When configuring Kafka and topic count, we need to ensure disk won't become bottleneck. It is suggested to start several brokers in each kafka node, and configure each broker several disks. Different brokers in the same node may share disks but have their own directories in the same disk. Our topic count is 16 for each kafka node, that is, if the kafka cluster contains only 1 kafka node, then we create topics with 16 partitions. For environment with 3 kafka nodes, we create topics with 48 partitions.
A typical set of kafka config files are config/serv1.properties till serv4.properties under kafka installation directory.
Ensure zookeeper configured in the config/servX.properties is working properly
To start 4 brokers on a node, go to the kafka install directory and run the following commands: "env JMX_PORT=10000 bin/kafka-server-start.sh config/serv1.properties" "env JMX_PORT=10001 bin/kafka-server-start.sh config/serv2.properties" "env JMX_PORT=10002 bin/kafka-server-start.sh config/serv3.properties" "env JMX_PORT=10003 bin/kafka-server-start.sh config/serv4.properties".
To see if kafka brokers are registered in zookeeper, go to zookeeper install directory and run
bin/zkCli.sh
to start zookeeper client window and runls /brokers/ids
.Same with ZooKeeper, you may need to clean old data that's located in disks of kafka brokers. Just
rm -rf <all_data_path>
in all your kafka nodes and directories. -
Spark setup
All spark streaming related parameters can be defined in
conf/99-user_defined_properties.conf
.Param Name Param Meaning spark.executor.memory available memory for Spark worker machines spark.serializer spark.kryo.referenceTracking relevant to data encoding format spark.streaming.receiver.writeAheadLog.enable whether to enable Write Ahead Log spark.streaming.blockQueueSize size of streaming block queue Spark streaming can be deployed as YARN mode or standalone mode. For YARN mode, just set
hibench.spark.master
toyarn-client
. For standalone mode, set it tospark://spark_master_ip:port
and runsbin/start-master.sh
in your spark home. -
Storm setup
The conf file is
conf/storm.yaml
. Basically we configure following params:Param Name Param Meaning supervisor.slots.ports number of workers in one supervisor (we set 3 slots each supervisor) nimbus.childopts jvm size of nimbus supervisor.childopts jvm size of supervisor worker.childopts jvm size of worker topology.max.spout.pending pending spout threads that can be tolerated Run
bin/storm nimbus
to start nimbus andbin/storm ui
to setup storm ui Runbin/storm supervisor
to start storm supervisors -
HiBench setup
Same as step.2 in Getting Started
Streaming workloads is defined in
conf/99-user_defined_properties.conf
, inhibench.streamingbench.benchname
. You may set it to a value of following:identity
,sample
,project
,grep
,wordcount
,distinctcount
andstatistics
.Other parameters can be adjusted in
conf/01-default-streamingbench.conf
.Param Name Param Meaning hibench.streamingbench.prepare.mode push / periodic mode hibench.streamingbench.prepare.push.records records to send in push mode hibench.streamingbench.prepare.periodic.recordPerInterval records to send per interval in periodic mode hibench.streamingbench.prepare.periodic.intervalSpan interval in periodic mode hibench.streamingbench.prepare.periodic.totalRound total round in periodic mode hibench.streamingbench.zookeeper.host zookeeper host:port of kafka cluster hibench.streamingbench.receiver_nodes number of nodes that will receive kafka input hibench.streamingbench.brokerList Kafka broker lists hibench.streamingbench.direct_mode direct mode selection (Sparkstreaming only) hibench.streamingbench.storm.home storm home hibench.streamingbench.kafka.home kafka home hibench.streamingbench.storm.nimbus host name of storm-nimbus hibench.streamingbench.storm.nimbusAPIPort port number of storm-nimbus hibench.streamingbench.storm.ackon ack mode on/off for storm -
Run. Usually you need to run the streaming data generation scripts to push data to kafka while running the streaming job. Please create the kafka topics first, generate the seed file and then generate the real data. You can run the following 3 scripts.
workloads/streamingbench/prepare/initTopic.sh workloads/streamingbench/prepare/genSeedDataset.sh workloads/streamingbench/prepare/gendata.sh
While the data are being sent to kafka, start the streaming job like Spark Streaming to process the data:
workloads/streamingbench/spark/bin/run.sh
-
View the report:
Same as step.4 in Getting Started.
However, the streamingbench is very different with nonstreaming workloads. Streaming workloads will collect throughput and lattency endlessly and print to terminal directly and log to
report/<workload>/<language APIs>/bench.log
. -
Stop the streaming workloads:
For SparkStreaming press
ctrl+c
will stop the works. For Storm & Trident, you'll need to executestorm/bin/stop.sh
to stop the works. For Samza, currently you'll have to kill all applications in YARN manually, or restart YARN cluster directly.