Repository for benchmarks with Unicage.
This contains scripts for data generation and implementations of workloads.
Documentation is available in here.
An automated deployment of a local testing environment with Vagrant and VirtualBox is available in this repository.
- The BigDataGenerationSuite used to generate data-sets for the benchmarks was forked from this repository, and slightly adjusted, as made available in this folder, together with the proper LICENSE. The altered files include the configuration XML files found in the ./BigDataGeneratorSuite/Table_datagen/e-com/config/ directory.
- All cluster monitoring and benchmarked resource usage metrics collection was achieved with Netdata. These make up +8GB, together with the logs of the benchmarks in this separate repository.
- There are some stream processing implementations in this folder. Tweets are fetched with Kafka + Twitter API and fed into Spark to be processed with Spark Streaming and Spark Structured Streaming. These implementations were not tested nor benchmarked in a cluster environment, and are not a part of the benchmark – they were purely exploratory.
The following table keeps track of what has been done:
Legend:
- ✓ – reliable output – output we are confident is correct, through verification
- ✓* – reliable but unvalidated output – reliable output with no statistical vality (single runs)
- ✓? – unreliable output – output that we have no way of verifying if it is correct
- ✗ – incorrect output – output we are confident is incorrect, through verification
Size | |||||||||
---|---|---|---|---|---|---|---|---|---|
Operation | System | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 | 8192 |
Grep | Generation & Loading |
Done | Done | Done | Done | Done | Done | Done | Done |
Hadoop | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
Spark | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
Unicage | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
Sort | Generation & Loading |
Done | Done | Done | Done | ||||
Hadoop | ✓ | ✓ | ✓ | ✓* | |||||
Spark | ✓ | ✓ | ✓ | ✓ | |||||
Unicage | ✓ | ✓ | ✓ | ✗ | |||||
Wordcount | Generation & Loading |
Done | Done | Done | Done | Done | Done | Done | Done |
Hadoop | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓* | |
Spark | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | |
Unicage | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
Select | Generation & Loading |
Done | Done | Done | Done | Done | Done | Done | Done |
Hive | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
Spark | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
Unicage | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
Join | Generation & Loading |
Done | Done | Done | Done | Done | Done | Done | Done |
Hive | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
Spark | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
Unicage | ✓ | ✓ | ✗ | ||||||
Aggregation | Generation & Loading |
Done | Done | Done | Done | Done | Done | Done | Done |
Hive | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | |
Spark | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
Unicage | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
The complete benchmark results can be found here.
-
The select Unicage workload aggregates results in leader node with no need, adding time to the workload. This aggregation should be moved to the verification script.
-
The data generators for aggregation and select should be adjusted to generate a single table with the desired volume, as opposed to two tables, with one of them not being used by the workloads.
BigDataBench 5.0:
https://github.com/BenchCouncil/BigDataBench_V5.0_BigData_MicroBenchmark
https://github.com/yangqiang/BigDataBench-Spark
https://github.com/BenchCouncil/BigDataBench_V5.0_Streaming
HiBench:
https://github.com/Intel-bigdata/HiBench
Install Hadoop 3.3.1 (pseudo-distributed):
https://klasserom.azurewebsites.net/Lessons/Binder/2410#CourseStrand_3988
https://www.youtube.com/watch?v=QDpA3A0MXJY&t=156s&ab_channel=ChrisDyck (and Part 2)
Install Hive 3.1.2 (pseudo-distributed):
https://hadooptutorials.info/2020/10/11/part-3-install-hive-on-hadoop/
Install Spark 3.2.0 (pseudo-distributed):
https://msris108.medium.com/how-to-setup-a-pseudo-distributed-cluster-with-hadoop-3-2-1-and-apache-spark-3-0-34406a85130f
Streaming examples in Spark Streaming:
https://spark.apache.org/docs/latest/streaming-programming-guide.html
https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples/streaming
https://jhui.github.io/2017/01/15/Apache-Spark-Streaming/
Kafka Streams wordcount example:
https://kafka.apache.org/documentation.html#quickstart
Kafka generic non-JVM producer/consumer (may be useful for stream processing in Unicage BOA!)
https://github.com/edenhill/kcat
Kafka + TwitterAPI + Spark Streaming + Hive example:
https://github.com/dbusteed/kafka-spark-streaming-example
https://www.youtube.com/watch?v=9D7-BZnPiTY
WARNING: This example uses Spark Streaming from Spark 2.4.0. Kafka no longer integrates with Spark Streaming, as of v2.4, instead it integrates with Spark Structured Streaming. How to integrate Kafka with Structured Streaming in Spark 3.2.0.
https://github.com/duartegithub/Unicage-Benchmarks-2021-2022-Code
Leave a Reply