Big Data Hadoop & Spark Developer - eLearning - Certification Training
Big Data Hadoop and Spark Developer
eLearning
Includes FREE COURSE - Apache Kafka and Core Java
With this Big Data Hadoop course, you will learn the big data framework with Hadoop and Spark, including HDFS, YARN , and MapReduce. The course will also cover Pig, Hive, and Impala to process and analyze large datasets stored in HDFS and use Sqoop and Flume for data ingestion.
You will be shown real-time data processing with Spark, including functional programming in Spark, implementing Spark applications, understanding parallel processing in Spark, and using Spark RDD optimization techniques. You will also learn the various interactive algorithms in Spark and use Spark SQL to create, transform and query data forms.
Finally, you will be required to complete real-world, industry-based projects with CloudLab in the domains of banking, telecommunications, social media, insurance, and e-commerce.
WHAT IS INCLUDED?
- 74 hours of blended learning
- o 22 hours of e-learning
- or
- o 52 hours of instructor-led online training
- One year/ 12 months access to the e-learning platform
- Four industry-based projects at the end of the course
- Interactive learning with integrated labs
- The curriculum is aligned with the Cloudera CCA175 certification exam.
- Training on key tools for big data and the Hadoop ecosystem as well as Apache Spark.
- Special mentorship sessions from teachers with industry experts.
- Free course included - Apache Kafka
- Free course included - Core Java
- Round-the-clock access
Details and criteria for certification:
- It is essential that you have completed at least 85 percent of the self-paced online training or attended a live training session.
- virtual classroom
- A score of at least 75 percent in the end-of-course assessment.
- Successful evaluation in at least one project.
Certification Alignment:
Our curriculum is aligned to the Cloudera CCA175 certification exam.
COURSE OBJECTIVES You will learn:
By the end of the course, you will be able to understand:
- The various components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
- Hadoop Distributed File System (HDFS) and YARN architecture
- MapReduce and its features and assimilating advanced MapReduce concepts
- Different types of file formats, Avro schema, using Avro with Hive, and Sqoop and Schema development
- Flume, Flume architecture, sources, flush sinks, channels and flume configurations
- HBase, its architecture and data storage and learn the difference between HBase and RDBMS
- Resilient distribution datasets (RDD) in detail
- Common use cases for Spark and various interactive algorithms
You will also be able to:
- ingest data with Sqoop and Flume
- Create database and tables in Hive and Impala, understand HBase and use Hive and Impala for partitioning
- Get a working knowledge of Pig and its components
- Do functional programming in Spark and implement and build Spark applications
- Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
- Create, transform and query data frames with Spark SQL
Who it is aimed at
Big data career opportunities are on the rise and Hadoop is fast becoming a must-have technology in big data architecture. Big Data training is suitable for IT, data management and analysts, including:
- Software developers and architects
- Analytics professionals
- Senior IT professionals
- Testing and mainframe professionals
- Data management professionals
- Business intelligence professionals
- Project managers
- Aspiring data scientists
- Candidates looking to build a career in big data analytics
Course content
The course covers the following topics:
- Course Introduction
- Lesson 1 - Introduction to big data and Hadoop ecosystem
- Lesson 2 - HDFS and GARN
- Lesson 3 - MapReduce and Sqoop
- Lesson 4 - The basics of the hive and impala
- Lesson 5 - Working with Hive and Impala
- Lesson 6 - Types of data formats
- Lesson 7 - Advanced Hive Concept and Data File Partitioning
- Lesson 8 - Apache Flume and HBase
- Lesson 9 - 'Pig'
- Lesson 10 - Basics of Apache Spark
- Lesson 11 - RDD in Spark
- Lesson 12 - Implementing Spark applications
- Lesson 13 - Spark parallel processing
- Lesson 14 - Spark RDD Optimization Techniques
- Lesson 15 - Spark Algorithm
- Lesson 16 - Spark SQL
FREE COURSE - Apache Kafka
FREE COURSE - Core Java
More detailed course plan:
Lesson 01 - Introduction to Big Data and Hadoop
- Introduction to Big Data and Hadoop
- Introduction to Big Data
- Analyzing Big Data
- What is Big Data?
- Four different types of Big Data
- Royal Bank of Scotland case study
- Challenges with traditional systems
- Distributed systems
- Introduction to Hadoop
- Components of the Hadoop Ecosystem - Part One
- The Components of the Hadoop Ecosystem Part Two
- Components of the Hadoop Ecosystem - Part Three
- Commercial Hadoop Deployments
- Demo: Walkthrough of Simplilearn Cloudlab: Demonstration
- Key conclusions
- Knowledge check
Lesson 02 - The Hadoop distributed storage (HDFS) architecture and YARN
- Hadoop distributed storage architecture (HDFS) and YARN
- What is HDFS?
- The need for HDFS
- Regular file system vs HDFS
- Characteristics of HDFS
- HDFS architecture and components
- Implementation of high availability clusters
- HDFS component namespace for file systems
- Breakdown of data blocks
- Topology for data replication
- HDFS command line
- Demo: Common HDFS commands
- Practice Project: HDFS Command Line
- Introduction to Yarn
- Use cases for Yarn
- Yarn and its architecture
- Resource Manager
- How the Resource Manager works
- Application Manager
- How Yarn runs an application
- Tools for Yarn developers
- Demo: Part one: Review of the Cluster
- Demo: Part Two: Walkthrough of Cluster
- Key findings
- Knowledge check
- Practice Project: Hadoop Architecture, Distributed Storage (HDFS) and Yarn
Lesson 03 - Data entry in Big Data systems and ETL
- Data entry in Big Data systems and ETL
- Overview of data entry - part one
- Overview of data entry, part two
- Apache Sqoop
- Sqoop and its uses
- Sqoop processing
- The Sqoop import process
- Sqoop connections
- Demo: Importing and exporting data from MySQL to HDFS
- Practice project: Apache Sqoop
- Apache Flume
- The Flume model
- Scalability in Flume
- Components of the Flume architecture
- Configuration of Flume components
- Demo: Ingest Twitter data
- Apache Kafka
- Aggregating user activity using Kafka
- Kafka data model
- Partitions
- Apache Kafka architecture
- Demo: Configuring the Kafka Cluster
- Example of producer-side API
- Consumer-side API examples
- Example of consumer-side API
- Kafka Connect
- Demo: Creating sample Kafka data pipeline using producer and consumer
- Key conclusions
- Knowledge check
- Practice Project: Data Ingestion into Big Data Systems and ETL
Lesson 04 - Distributed Processing MapReduce Framework and Pig
- The distributed processing framework of Mapreduce and Pig
- Distributed Processing in Mapreduce
- Example of word count
- The phases of execution of maps
- Execution of maps in a distributed environment with two nodes
- Mapreduce jobs
- Hadoop Mapreduce Job Work Interaction
- Setting up the environment for Mapreduce development
- Setting up classes
- Creating a new project
- Advanced Mapreduce
- Data types in Hadoop
- Output formats in Mapreduce
- Using distributed caching
- Joins in Mapreduce
- Replicated joins
- Introduction to Pig
- Components of Pig
- Data model for Pig
- Interactive methods for Pig
- Pig operations
- Different relationships performed by developers
- Demo: Analyzing weblog data using Mapreduce
- Demo: Analyze sales data and solve Kpis using Pig
- Practice project: Apache Pig
- Demo: Wordcount
- Key conclusions
- Knowledge check
- Practice Project: Distributed Processing - Mapreduce Framework and Pig
Lesson 05 - Apache Hive
- Apache Hive
- Hive SQL over Hadoop Mapreduce
- Hive architecture
- Interface to run Hive queries
- Running Beeline from the command line
- Hive Metastore
- Hive DDL and DML
- Creating a new table
- Data types
- Validation of data
- Types of file formats
- Serialization of data
- Hive tables and Avro schema
- Hive optimization Partitioning Bucketing and sampling
- Non-partitioned table
- Insertion of data
- Dynamic partitioning in Hive
- Hive bucketing
- What do buckets do?
- Hive Analytics UDF and UDAF
- Other features in Hive
- Demo: Real-time analytics and data filtering
- Demo: Problems in the real world
- Demo: Representation and import of data using Hive
- Key findings
- Knowledge check
- Practice project: Apache Hive
Lesson 06 - NoSQL databases HBase
- NoSQL databases HBase
- Introduction to NoSQL
- Demo: Yarn Tuning
- Overview of Hbase
- Hbase architecture
- Data model
- Connecting to HBase
- Practical project: HBase Shell
- Key conclusions
- Knowledge check
- Practice project: NoSQL databases - HBase
Lesson 07 - Basics of functional programming and Scala
- Basics of functional programming and Scala
- Introduction to Scala
- Demo: Installation of Scala
- Functional programming
- Programming with Scala
- Demo: Basic literals and arithmetic programming
- Demo: Logical operators
- Type inference classes Objects and functions in Scala
- Demo: Type inference functions Anonymous functions and classes
- Collections
- Types of collections
- Demo: Five types of collections
- Demo: Operations on Lists: Demonstration of operations on lists
- Scala REPL
- Demo: Features of Scala REPL
- Key conclusions
- Knowledge check
- Practice Project: Apache Hive
Lesson 08 - Apache Spark Next Generation Big Data Framework
- Apache Spark next generation big data framework
- The history of Spark
- Limitations of Mapreduce in Hadoop
- Introduction to Apache Spark
- Components of Spark
- Application of in-memory processing
- The Hadoop ecosystem versus Spark
- Advantages of Spark
- Spark architecture
- Spark clusters in the real world
- Demo: Running a Scala program in Spark Shell
- Demo: Configuring Execution Environment in IDE
- Demo: Spark Web UI
- Key findings
- Knowledge check
- Practice Project: Apache Spark Next-Generation Big Data Framework
Lesson 09 - Spark Core Processing RDD
- Introduction to Spark RDD
- RDD in Spark
- Creating Spark RDD
- Pairing RDD
- RDD operations
- Demo: Spark Transformation Detailed exploration using Scala examples
- Demo: Spark Action Detailed exploration using Scala
- Caching and persistence
- Storage tiers
- Lineage and the DAG
- The need for DAG
- Debugging in Spark
- Partitioning in Spark
- Scheduling in Spark
- Blending in Spark
- Sorting blending
- Aggregating data with paired RDDs
- Demo: Spark application with data written back to HDFS and Spark UI
- Demo: Changing Spark application parameters
- Demo: Handling different file formats
- Demo: Spark RDD with real application
- Demo: Optimization of Spark jobs
- Key conclusions
- Knowledge check
- Practice Project: Spark Core Processing RDD
Lesson 10 - Spark SQL processing of dataframes
- Spark SQL processing of dataframes
- Introduction to Spark SQL
- Spark SQL architecture
- Dataframes
- Demo: Handling different data formats
- Demo: Implementing different dataframe operations
- Demo: UDF and UDAF
- Collaboration with RDDs
- Demo: Processing data grids using SQL query
- RDD vs Dataframe vs Dataset
- Practice project: Processing dataframes
- Key conclusions
- Knowledge check
- Practice Project: Spark SQL - Processing Dataframes
Lesson 11 - Spark MLib Modeling BigData with Spark
- Spark Mlib Modeling Big Data with Spark
- The role of data scientists and data analysts in Big Data
- Analytics in Spark
- Machine learning
- Supervised learning
- Demo: Classification of linear SVM
- Demo: Linear regression with real-world case studies
- Unsupervised learning
- Demo: Unsupervised Clustering: K-means
- Reinforcement learning
- Semi-supervised learning
- Overview of Mlib
- Mlib pipelines
- Key conclusions
- Knowledge control
- Practice Project: Spark Mlib - Modeling Big Data with Spark
Prerequisites
There are no prerequisites for this course. However, it is helpful to have some knowledge of Core Java and SQL. We offer a free self-paced online course “Java essentials for Hadoop” if you need to reinforce your Core Java skills.
Seuraavat toteutukset
Ota yhteyttä
Adding Value Consulting AB (AVC)
Adding Value Consulting (AVC) is a leading ATO (Accredited Training Organisation). We have introduced a large number of ‘Best Practice’ methods in the Nordic countries. We are experts in training and certification. Over the years, AVC has acquired extensive knowledge...
Lue lisää kouluttajasta Adding Value Consulting AB ja katso koulutustarjonta täältä