Unicage-Benchmarks-2021-2022-Code

Written by

in

Unicage-Benchmarks

Repository for benchmarks with Unicage.
This contains scripts for data generation and implementations of workloads.

Documentation is available in here.

An automated deployment of a local testing environment with Vagrant and VirtualBox is available in this repository.

DISCLAIMERS & CREDITS:

Data generators:

The BigDataGenerationSuite used to generate data-sets for the benchmarks was forked from this repository, and slightly adjusted, as made available in this folder, together with the proper LICENSE. The altered files include the configuration XML files found in the ./BigDataGeneratorSuite/Table_datagen/e-com/config/ directory.

Benchmarking and Monitoring:

All cluster monitoring and benchmarked resource usage metrics collection was achieved with Netdata. These make up +8GB, together with the logs of the benchmarks in this separate repository.

Stream processing (Kafka and Twitter API):

There are some stream processing implementations in this folder. Tweets are fetched with Kafka + Twitter API and fed into Spark to be processed with Spark Streaming and Spark Structured Streaming. These implementations were not tested nor benchmarked in a cluster environment, and are not a part of the benchmark – they were purely exploratory.

Benchmark Progress:

The following table keeps track of what has been done:

Legend:

✓ – reliable output – output we are confident is correct, through verification
✓* – reliable but unvalidated output – reliable output with no statistical vality (single runs)
✓? – unreliable output – output that we have no way of verifying if it is correct
✗ – incorrect output – output we are confident is incorrect, through verification

		Size
Operation	System	64	128	256	512	1024	2048	4096	8192
Grep	Generation & Loading	Done	Done	Done	Done	Done	Done	Done	Done
	Hadoop	✓	✓	✓	✓	✓	✓	✓	✓
	Spark	✓	✓	✓	✓	✓	✓	✓	✓
	Unicage	✓	✓	✓	✓	✓	✓	✓	✓
Sort	Generation & Loading	Done	Done	Done	Done
	Hadoop	✓	✓	✓	✓*
	Spark	✓	✓	✓	✓
	Unicage	✓	✓	✓	✗
Wordcount	Generation & Loading	Done	Done	Done	Done	Done	Done	Done	Done
	Hadoop	✓	✓	✓	✓	✓	✓	✓	✓*
	Spark	✓	✓	✓	✓	✓	✓	✓	✗
	Unicage	✓	✓	✓	✓	✓	✓	✓	✓
Select	Generation & Loading	Done	Done	Done	Done	Done	Done	Done	Done
	Hive	✓	✓	✓	✓	✓	✓	✓	✓
	Spark	✓	✓	✓	✓	✓	✓	✓	✓
	Unicage	✓	✓	✓	✓	✓	✓	✓	✓
Join	Generation & Loading	Done	Done	Done	Done	Done	Done	Done	Done
	Hive	✓	✓	✓	✓	✓	✓	✓	✓
	Spark	✓	✓	✓	✓	✓	✓	✓	✓
	Unicage	✓	✓	✗
Aggregation	Generation & Loading	Done	Done	Done	Done	Done	Done	Done	Done
	Hive	✓	✓	✓	✓	✓	✓	✓	✗
	Spark	✓	✓	✓	✓	✓	✓	✓	✓
	Unicage	✓	✓	✓	✓	✓	✓	✓	✓

The complete benchmark results can be found here.

Known Bugs & TODOs:

The select Unicage workload aggregates results in leader node with no need, adding time to the workload. This aggregation should be moved to the verification script.
The data generators for aggregation and select should be adjusted to generate a single table with the desired volume, as opposed to two tables, with one of them not being used by the workloads.

Links & References:

BigDataBench 5.0:

https://github.com/BenchCouncil/BigDataBench_V5.0_BigData_MicroBenchmark

https://github.com/yangqiang/BigDataBench-Spark

https://github.com/BenchCouncil/BigDataBench_V5.0_Streaming

HiBench:

https://github.com/Intel-bigdata/HiBench

Install Hadoop 3.3.1 (pseudo-distributed):

~~https://klasserom.azurewebsites.net/Lessons/Binder/2410#CourseStrand_3988~~

https://www.youtube.com/watch?v=QDpA3A0MXJY&t=156s&ab_channel=ChrisDyck (and Part 2)

Install Hive 3.1.2 (pseudo-distributed):

https://hadooptutorials.info/2020/10/11/part-3-install-hive-on-hadoop/

Install Spark 3.2.0 (pseudo-distributed):

https://msris108.medium.com/how-to-setup-a-pseudo-distributed-cluster-with-hadoop-3-2-1-and-apache-spark-3-0-34406a85130f

Streaming examples in Spark Streaming:

https://spark.apache.org/docs/latest/streaming-programming-guide.html

https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples/streaming

https://jhui.github.io/2017/01/15/Apache-Spark-Streaming/

Kafka Streams wordcount example:

https://kafka.apache.org/documentation.html#quickstart

Kafka generic non-JVM producer/consumer (may be useful for stream processing in Unicage BOA!)

https://github.com/edenhill/kcat

Kafka + TwitterAPI + Spark Streaming + Hive example:

https://github.com/dbusteed/kafka-spark-streaming-example

https://www.youtube.com/watch?v=9D7-BZnPiTY

WARNING: This example uses Spark Streaming from Spark 2.4.0. Kafka no longer integrates with Spark Streaming, as of v2.4, instead it integrates with Spark Structured Streaming. How to integrate Kafka with Structured Streaming in Spark 3.2.0.

https://github.com/duartegithub/Unicage-Benchmarks-2021-2022-Code

Comments

Leave a Reply Cancel reply

More posts

All content on this blog belongs to its creator and its shared for promotional, educational, and awareness-raising purposes. In addition, if we have alive AdSense ads on our site, it is for the purpose of generating income. You can access the original content via the link below the articles. For content you wish to delete or if you interested to buy "sixblue.top" with AdSense activated account, feel free to contact us with admin@sixblue.top