Solid Big Data
Data is the lifeblood of any organization. However the volume of data is growing exponentially, and is magnitudes larger than a few years ago. How can an organization get business value out of massive datasets? This is the problem that big data technologies aim to solve.
This course introduces the components of big data solutions and then dives into the technical and architectural details of Apache Spark. We'll explain how the Spark API fits in with Hadoop and how it offers an improved and powerful alternative to MapReduce. We also explore related Spark features including Spark SQL, Spark Streaming, and Spark Machine Learning.
The course is intensively hands-on. We explore numerous examples and run through many common scenarios in the lab exercises.
Modules in Solid Big Data:
Creating Spark Applications
This module describes how to write a complete Spark application in Scala:
- Create a new Spark application
- Import Spark packages
- Write code to make use of the Spark libraries
We then show how to build the app and run it on Spark:
- Set the classpath to incorporate Spark JARs
- Compile the Scala code into Java classes
- Coalesce the Java classes into a JAR
- Submit the JAR to Spark, to be executed
Resilient Distributed DataSets (RDDs) are the key abstraction in the Spark Core API. This module takes a close look at how to perform transformations and actions on RDDs. We cover the following topics:
- RDD transformations
- RDD transformations on key-value pairs
- RDD actions
- Spark jobs - the big picture
Using The Spark API
This chapter introduces the concept of Resilient Distributed Datasets (RDD), the key abstraction in the Spark Core API. We show how to create RDDs from a range of data sources, including in-memory collections, files on the local file system, files on an HDFS file system, and so on. We also describe SparkContext and show how to configure it.
Getting Started with Spark
Apache Spark is the most important open-source project in the field of big data today. It offers the ability to process big data using commodity hardware, and is effectively the successor to Hadoop MapReduce. Many organizations today are replacing MapReduce with Spark.
This chapter describes Spark Core, which is the foundation stone of the Spark offering. We take a look at the Spark architecture, and explain terms such as Worker Node, Driver Program, Cluster Manager, Executor, and Task. We also discuss important Spark concepts such as shuffling, jobs, and stages.
The chapter also introduces the Spark Shell, and shows how to start writing simple scripts to manipulate data using the REPL interface.
Additional Big Data Technologies
This module investigates several key aspects of big data solutions:
- Data serialization formats
- Columnar storage
- Messaging systems
We explain each of these topics and discuss popular products and solutions in each case.
Introduction To Big Data
In this module you'll learn about fundamental big data concepts and technology. We describe the problems facing organizations today, who are confronted with the need to process massive datasets with high volume and velocity. We explain why relational databases are insufficient, and explore the features required in big data solutions.
We then take a detailed look at Hadoop. We explain the components of Hadoop - HDFS, MapReduce, and YARN. We also explain important terminology and explore the design goals of Hadoop: parallelization via commodity hardware and open-source software; scalability; fault tolerance; load balancing; and so on.
Getting Started Spark SQL
Spark SQL is a Spark library that runs on top of Spark, providing a higher-level abstraction than Spark Core for processing structured data.
This module introduces the key concepts of Spark SQL, showing how to create DFrames from a wide range of data sources including relational databases, NoSQL data sources, text formats, binary formats, data warehousing systems, and columnar storage systems. We then show how to formulate simple queries in Spark SQL using SQL, HiveQL, and language integrated queries.
Some applications require data to be processed and analysed in real time as it's collected - e.g. fraud detection in an e-commerce system, device failure detection in a data centre, etc.
Handling high-velocity data in real time is a challenge. An application running on a single machine probably can't handle it. A distributed stream processing framework addresses the issue. Spark Streaming is a distributed data stream processing framework that makes it easy to develop distributed applications for processing live data streams in near real time.
This module takes a detailed look at the Spark Streaming API. We describe how to create DStreams from various types of data sources, then explore how to process DStreams using a wide range of transformation operations and actions.
This module takes a detailed look at the Spark SQL DataFrame operations:
- Language-integrated query operations
- RDD operations
- Output operations
Machine learning is an area of computer science rooted in pattern matching and artificial intelligence. The goal is to create systems that learn and evolve based on the data that runs through the systems, rather than baking hard-coded logic into systems at the outset.
Spark has excellent support for machine learning. This module explains how it all fits together. We begin with a discussion of important concepts and terminology associated with machine learning, then take a detailed look at two Spark machine learning APIs - MLib and Spark ML. We'll show worked examples of both APIs, and compare and contrast their features.