Releases: Intel-bigdata/HiBench
HiBench-7.1.1
v7.1.1 update to v7.1.1
HiBench-7.1
We are happy to announce HiBench-7.1 which includes Spark 2.3, Spark 2.4 support, a repartition workload, an optimized K-means implementation based on DAL (Intel Data Analytics Library), and various bug fixes and improvements.
HiBench-7.0
We are happy to announce HiBench-7.0, a major release of HiBench. This release includes new features like more Machine Learning workloads, Spark 2.1, Spark 2.2 support. It also includes many bug fixes to the previous release.
Spark 2.1, 2.2 Support
Apache Spark 2.1 & 2.2 are new major releases with a few API changes. One of the features of HiBench 7.0 is to fully support Spark 1.6, Spark 2.0, Spark 2.1 and Spark 2.2. You can choose the Spark version when building HiBench and test these Spark versions.
New Workloads
Eight ML workloads for Spark are added. They are ALS (Alternating Least Squares), Bayes, GBT (Gradient Boosting Trees), LDA (Latent Dirichlet Allocation), Linear (Linear Regression), PCA (Principal Component Analysis), RF (Random forests), SVD (Singular Value Decomposition) and SVM (Support Vector Machine).
The alternating least squares (ALS) algorithm is a well-known algorithm for collaborative filtering that implemented in spark.mllib. The input data set is generated by RatingDataGenerator for a product recommendation system.
Naive Bayes (Bayes) is a simple multiclass classification algorithm with the assumption of independence between every pair of features. The workload is implemented in spark.mllib and uses the automatically generated documents whose words follow the zipfian distribution. The dict used for text generation is also from the default linux file /usr/share/dict/linux.words.
Gradient-boosted trees (GBT) is a popular regression method using ensembles of decision trees that implemented in spark.mllib. The input data set is generated by GradientBoostingTreeDataGenerator.
Latent Dirichlet allocation (LDA) is a topic model which infers topics from a collection of text documents that implemented in spark.mllib. The input data set is generated by LDADataGenerator.
Logistic Regression (LR) is a popular method to predict a categorical response that implemented in spark.mllib with LBFGS optimizer. The input data set is generated by LogisticRegressionDataGenerator based on random balance decision tree. It contains three different kinds of data types, including categorical data, continuous data, and binary data.
Linear Regression (Linear) is a workload that implemented in spark.mllib with SGD optimizer. The input data set is generated by LinearRegressionDataGenerator.
Principal component analysis (PCA) is a statistical method to find a rotation such that the first coordinate has the largest variance possible, and each succeeding coordinate in turn has the largest variance possible. PCA is used widely in dimensionality reduction. This workload is implemented in spark.mllib. The input data set is generated by PCADataGenerator.
Random forests (RF) are ensembles of decision trees. Random forests are one of the most successful machine learning models for classification and regression. They combine many decision trees in order to reduce the risk of overfitting. This workload is implemented in spark.mllib and the input data set is generated by RandomForestDataGenerator.
Singular value decomposition (SVD) factorizes a matrix into three matrices. This workload is implemented in spark.mllib and its input data set is generated by SVDDataGenerator.
Support Vector Machine (SVM) is a standard method for large-scale classification tasks that implemented in spark.mllib. The input data set is generated by SVMDataGenerator.
Contributors
The following developers contributed to this release:
Yinan Xiang(@ynXiang)
Teng Jiang(@jtengyp)
zhuoxiangchen(@zhuoxiangchen)
Shilei Qian(@qiansl127)
Vincent Xie(@VinceShieh)
Peng Meng(@mpjlu)
Carson Wang(@carsonwang)
Yu He(@heyu1)
Chenzhao Guo(@gczsjdy)
Naresh Gundla(@nareshgundla)
Rajarshi Biswas(@rajarshibiswas)
n3rV3(@n3rV3)
Ziyue Huang(@ZiyueHuang)
Chong Tang(@ChongTang)
Michael Mior(@michaelmior)
Yanbing Zhang(@zybing)
Huafeng Wang(@huafengw)
Sophia Sun(@sophia-sun)
Thanks to everyone who contributed! We are looking forward to more contributions from every one for the next release.
HiBench 6.0
We are happy to announce HiBench-6.0, a major release of HiBench. This release includes new features like more flexible build, Spark 2.0 support, more workloads, better configuration and a new streaming benchmark that supports Spark Streaming, Flink, Storm and Gearpump. It also includes many bug fixes to the previous release.
Flexible Build
HiBench 6.0 supports building only benchmarks for specific frameworks. Building all the benchmarks in HiBench could be time consuming because the hadoop benchmark relies on 3rd party tools like Mahout and Nutch. You can now only build for a specific framework to speed up the build process. If you are only interested in a single workload in HiBench, you can also only build a single module.
Spark 2.0 Support
Apache Spark 2.0 is a new major release with a few API changes. One of the features of HiBench 6.0 is to fully support Spark 1.6 and Spark 2.0. You can choose the Spark version when building HiBench and test these two Spark versions.
New Streaming Benchmark
A new streaming benchmark is included in HiBench 6.0. The new streaming benchmark supports Spark Streaming, Flink, Storm/Trident and Gearpump. There are four workloads identity, repartition, stateful wordcount and FixWindow. A common metrics framework was developed to assess all these streaming frameworks.
New Workloads
A new workload named NWeight for Spark is added. NWeight is an iterative graph-parallel algorithm implemented by Spark GraphX and Pregel. The algorithm computes associations between two vertices that are n-hop away.
Configurations
The configurations are split into several files for different frameworks. Each framework has its own configuration file. This simplifies the configurations because you may only interested in some of the frameworks. A few unnecessary configurations are also removed.
Continuous Integration
HiBench 6.0 uses Travis CI for continuous integration which builds the project and runs a set of workloads for testing.
Contributors
The following developers contributed to this release:
Adam Roberts (@a-roberts)
Andrew Audibert(@aaudiber)
Carlos Eduardo Moreira dos Santos(@cemsbr)
Carson Wang(@carsonwang)
Chenzhao Guo(@gczsjdy)
Daoyuan Wang(@adrian-wang)
Ge Chen(@princhenee)
Harschware(@harschware)
Huafeng Wang(@huafengw)
James Bogosian(@bogosj)
Jayanth(@prajay)
Ling Zhou(@lingzhouHZ)
Liye Zhang(@liyezhang556520)
Lun Gao(@gallenvara)
Mahmoud Ismail(@maismail)
Manu Zhang(@manuzhang)
Monta Yashi(@84monta)
Naresh Gundla(@nareshgundla)
Pengfei Xuan(@pfxuan)
Robert Schmidtke(@robert-schmidtke)
Shilei Qian(@qiansl127)
Tony Zhao(@touchdown)
Wei Mao(@mwws)
Xianyang Liu(@ConeyLiu)
Yanbing Zhang(@zybing)
Yi Cui(@Moonlit-Sailor)
ZheHan Wang(@han-wang)
Thanks to everyone who contributed! We are looking forward to more contributions from every one for the next release.
HiBench 5.0
We are happy to announce HiBench-5.0, a major release with streaming feature support!
We are now introducing streaming features
Streaming bench is all the news about HiBench 5.0. It provides benchmark against multiple frameworks, new streaming workloads abstractions.
Multiple Streaming Frameworks
Spark Streaming
SparkStreaming is a batch based extension of the core Spark API for streaming data processing feature. HiBench 5.0 supports SparkStreaming from spark1.3 to 1.5, Kafka and direct mode.
Storm & Trident
Apache Storm is a event based distributed realtime computation system contributed from Twitter. Trident is a high-level abstraction for doing realtime computing on top of Storm. Hibench 5.0 supports them both.
Samza
Apache Samza is a distributed stream processing framework contributed from LinkedIn, which is also supported in Hibench 5.0.
7 Streaming Workloads Abstraction
We introduce identity
, sample
, projection
, grep
for single step workloads, and wordcount
, distinctcount
statistics
for multiple steps workloads. And text data generated from Hive's uservisits
test cases, numeric data generated from Kmeans's vectors.
Flexible Data Source
You can feed data to Kafka from data stored in HDFS, which is a great helps to send distributed data concurrently.
You can push the data once for all or feed data continuously and periodically.
You can adjust data offset in Kafka to avoid reuse sent data.
Contributors
The following developers contributed to this release (ordered by Github ID):
Daoyuan Wang(@adrian-wang)
Earnest(@Earne)
Minho Kim (@eoriented)
Gayathri Mutrali(@GayathriMurali)
Jie Huang(@GraceH)
Joseph Lorenzini(@jaloren)
Jay Vyas(@jayunit100)
Jintao Guan(@jintaoguan)
Kai Wei(@kai-wei)
Zhihui Li(@li-zhihui)
Qi Lv(@lvsoft)
Nishkam Ravi (@nishkamravi2)
(@pipamc)
Kevin CHEN(@princhenee)
Neo Qian(@qiansl127)
Mingfei Shi(@shimingfei)
ShelleyRuirui(@ShelleyRuirui)
Imran Rashid(@squito)
(@Silent-Hill)
Viplav(@viplav)
(@XiaominZhang)
Dong Li(@zkld123)
Markus Z(@zyxar)
Thanks to everyone who contributed! We are looking forward to more contributions from every one for next release.
HiBench 4.1
We are happy to announce HiBench-4.1, a minor release with lots of bug fixes, new features, new platform supports!
We add docker support
Thanks to @princhenee, you can try HiBench with docker. Just execute a single script. It'll download and prepare depended environments including Hadoop / Spark for you and run the benchmark. Please go to docker
folder for more details.
We now support for HDP2.3
Thanks to @jerryshao, now HiBench support HDP2.3, which is Hortonworks Data Platform. It'll auto probe and apply HDP configuration as long as your environment is HDP.
Thanks to everyone who contributed and we are looking forward to more contributions from eveyone for next release soon.
HiBench 4.0
We are happy to announce HiBench-4.0, a major release with a totally new design. There're lots of usability enhancements and development improvements, new features, new platform supports!
We are now support multiple language APIs
Spark is a fast and general engine for large-scale data processing framework, which has multiple language backends. HiBench is now supporting MapReduce, Spark/Scala, Spark/Java, Spark/Python for all workloads (except for nutchindexing & dfsioe).
Unified and auto-probe configurations
In this version, all you need is set HiBench hadoop home, spark home, hdfs master and spark master. HiBench will detect hadoop/spark releases and versions, and infer configures for you. There'll be no need to re-build HiBench if you switch between MR1 with MR2, or Spark1.2 with Spark1.3.
Explicitly report
Now, the reported information is greatly enriched. For each workload, each language API, an all-in-one configuration file will be generated in report folder, with hints about where/how the config values come from, along with the related logs.
And during benchmarking, HiBench will monitor CPU, Network, Disk IO, Memory and system loads of all your slave nodes. The monitor report is also generated for each workload, each language APIs.
Flexible configurations
HiBench4.0 improved configurations greatly. There're several built-in data scale profiles, compression profiles. You can switch different profiles very easily.
Sometimes, some workload needs slightly different parameter. Now, you can configure every parameters (including spark conf) in global, workload and language API scope, which is great for fine tuning.
Others
- 4-step quick start added to document.
- Colorful log output, progress bar for MapReduce and Spark.
- A strictly assertion which will detect error at earlier stage.
We have verified HiBench 4.0 with newest Hadoop distribution versions, including CDH5.3.2(MRv1, MRv2), Apache Hadoop 1.0.4, 1.2.1, Apache Hadoop 2.2.0, 2.5.2, Apache Spark 1.2, 1.3.
Contributors
The following developers contributed to this release (ordered by Github ID):
Daoyuan Wang(@adrian-wang)
Earnest(@Earne)
Minho Kim (@eoriented)
Jie Huang(@GraceH)
Joseph Lorenzini(@jaloren)
Jay Vyas(@jayunit100)
Jintao Guan(@jintaoguan)
Kai Wei(@kai-wei)
Qi Lv(@lvsoft)
Nishkam Ravi (@nishkamravi2)
(@pipamc)
Neo Qian(@qiansl127)
Mingfei Shi(@shimingfei)
Dong Li(@zkld123)
Thanks to everyone who contributed! We are looking forward to more contributions from every one for next release soon.
HiBench 3.0
Release date: Oct 31, 2014
We are happy to announce HiBench-3.0, a major release with lots of usability and development improvements, new features, and a number of bug fixes!
We are now supporting YARN!
Hadoop YARN is a framework for job scheduling and cluster resource management. HiBench is now supporting Hadoop YARN.
Unified MR1 and MR2 branches with different dependencies on-demand!
In this version, there won’t be two isolated working branches for MR1 and Yarn respectively. All the dependencies will be downloaded on-demand by configuring the maven configuration file. That really saves those overheads in shifting back and forth among different branches while executing your HiBench workloads.
We have new workload in HiBench
We add a sleep workload to HiBench-3.0. This helps to test the capability of the scheduler.
Benchmarks can run concurrently
Now user can configure the number of benchmarks which issue simultaneously. This scenario is very helpful to benchmark those clusters’ throughput with heavy loads.
Other Improvements
- Documentation improvements
- Various bug fixes
- Better log output
- Scripts improvements
We have verified HiBench 3.0 with some new Hadoop distribution versions, including CDH5.1.0(MRv1, MRv2), Apache Hadoop 1.0.4, Apache Hadoop 2.2.0
Contributors
The following developers contributed to this release (ordered by Github ID):
Daoyuan Wang(@adrian-wang)
Raymond Liu(@colorant)
Earnest(@Earne)
Minho Kim (@eoriented)
Jie Huang(@GraceH)
Joseph Lorenzini(@jaloren)
Jay Vyas(@jayunit100)
Jintao Guan(@jintaoguan)
Kai Wei(@kai-wei)
Qi Lv(@lvsoft)
Nishkam Ravi (@nishkamravi2)
(@pipamc)
Neo Qian(@qiansl127)
Mingfei Shi(@shimingfei)
Dong Li(@zkld123)
Thanks to everyone who contributed! We are looking forward to more contributions from every one for next release soon.