Unicage-Benchmarks-2021-2022-Code

Unicage-Benchmarks

Repository for benchmarks with Unicage.
This contains scripts for data generation and implementations of workloads.

Documentation is available in here.

An automated deployment of a local testing environment with Vagrant and VirtualBox is available in this repository.


DISCLAIMERS & CREDITS:

Data generators:

Benchmarking and Monitoring:

  • All cluster monitoring and benchmarked resource usage metrics collection was achieved with Netdata. These make up +8GB, together with the logs of the benchmarks in this separate repository.

Stream processing (Kafka and Twitter API):

  • There are some stream processing implementations in this folder. Tweets are fetched with Kafka + Twitter API and fed into Spark to be processed with Spark Streaming and Spark Structured Streaming. These implementations were not tested nor benchmarked in a cluster environment, and are not a part of the benchmark – they were purely exploratory.

Benchmark Progress:

The following table keeps track of what has been done:

Legend:

  • ✓ – reliable output – output we are confident is correct, through verification
  • ✓* – reliable but unvalidated output – reliable output with no statistical vality (single runs)
  • ✓? – unreliable output – output that we have no way of verifying if it is correct
  • ✗ – incorrect output – output we are confident is incorrect, through verification

Size
Operation System 64 128 256 512 1024 2048 4096 8192
Grep Generation &
Loading
Done Done Done Done Done Done Done Done
Hadoop
Spark
Unicage
Sort Generation &
Loading
Done Done Done Done
Hadoop ✓*
Spark
Unicage
Wordcount Generation &
Loading
Done Done Done Done Done Done Done Done
Hadoop ✓*
Spark
Unicage
Select Generation &
Loading
Done Done Done Done Done Done Done Done
Hive
Spark
Unicage
Join Generation &
Loading
Done Done Done Done Done Done Done Done
Hive
Spark
Unicage
Aggregation Generation &
Loading
Done Done Done Done Done Done Done Done
Hive
Spark
Unicage

The complete benchmark results can be found here.


Known Bugs & TODOs:

  • The select Unicage workload aggregates results in leader node with no need, adding time to the workload. This aggregation should be moved to the verification script.

  • The data generators for aggregation and select should be adjusted to generate a single table with the desired volume, as opposed to two tables, with one of them not being used by the workloads.


Links & References:

BigDataBench 5.0:

https://github.com/BenchCouncil/BigDataBench_V5.0_BigData_MicroBenchmark

https://github.com/yangqiang/BigDataBench-Spark

https://github.com/BenchCouncil/BigDataBench_V5.0_Streaming

HiBench:

https://github.com/Intel-bigdata/HiBench

Install Hadoop 3.3.1 (pseudo-distributed):

https://klasserom.azurewebsites.net/Lessons/Binder/2410#CourseStrand_3988

https://www.youtube.com/watch?v=QDpA3A0MXJY&t=156s&ab_channel=ChrisDyck (and Part 2)

Install Hive 3.1.2 (pseudo-distributed):

https://hadooptutorials.info/2020/10/11/part-3-install-hive-on-hadoop/

Install Spark 3.2.0 (pseudo-distributed):

https://msris108.medium.com/how-to-setup-a-pseudo-distributed-cluster-with-hadoop-3-2-1-and-apache-spark-3-0-34406a85130f

Streaming examples in Spark Streaming:

https://spark.apache.org/docs/latest/streaming-programming-guide.html

https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples/streaming

https://jhui.github.io/2017/01/15/Apache-Spark-Streaming/

Kafka Streams wordcount example:

https://kafka.apache.org/documentation.html#quickstart

Kafka generic non-JVM producer/consumer (may be useful for stream processing in Unicage BOA!)

https://github.com/edenhill/kcat

Kafka + TwitterAPI + Spark Streaming + Hive example:

https://github.com/dbusteed/kafka-spark-streaming-example

https://www.youtube.com/watch?v=9D7-BZnPiTY

WARNING: This example uses Spark Streaming from Spark 2.4.0. Kafka no longer integrates with Spark Streaming, as of v2.4, instead it integrates with Spark Structured Streaming. How to integrate Kafka with Structured Streaming in Spark 3.2.0.

Visit original content creator repository
https://github.com/duartegithub/Unicage-Benchmarks-2021-2022-Code

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *