Big Data Hadoop Training

Course Curriculum: 

Lesson 01 – Introduction to Bigdata and Hadoop 

Introduction to Big Data and Hadoop 

Introduction to Big Data 

Big Data Analytics 

What is Big Data? 

Four vs of Big Data 

Case Study Royal Bank of Scotland 

Challenges of Traditional System 

Distributed Systems 

Introduction to Hadoop 

Components of Hadoop Ecosystem Part One 

Components of Hadoop Ecosystem Part Two 

Components of Hadoop Ecosystem Part Three 

Commercial Hadoop Distributions 

Demo: Walkthrough of Simplilearn Cloudlab 

Key Takeaways 

Knowledge Check

Lesson 02 – Hadoop Architecture Distributed Storage   (HDFS) and YARN 

Hadoop Architecture Distributed Storage (HDFS) and YARN 

What is HDFS 

Need for HDFS 

Regular File System vs HDFS 

Characteristics of HDFS 

HDFS Architecture and Components 

High Availability Cluster Implementations 

HDFS Component File System Namespace 

Data Block Split 

Data Replication Topology 

HDFS Command Line 

Demo: Common HDFS Commands 

Practice Project: HDFS Command Line 

Yarn Introduction 

Yarn Use Case 

Yarn and its Architecture 

Resource Manager 

How Resource Manager Operates 

Application Master 

How Yarn Runs an Application 

Tools for Yarn Developers 

Demo: Walkthrough of Cluster Part One 

Demo: Walkthrough of Cluster Part Two 

Key Takeaways 

Knowledge Check 

Practice Project: Hadoop Architecture, distributed Storage (HDFS) and Yarn

Lesson 03 – Data Ingestion into Big Data Systems and ETL 

Data Ingestion Into Big Data Systems and Etl 

Data Ingestion Overview Part One 

Data Ingestion Overview Part Two 

Apache Sqoop 

Sqoop and Its Uses 

Sqoop Processing 

Sqoop Import Process 

Sqoop Connectors 

Demo: Importing and Exporting Data from MySQL to HDFS 

Practice Project: Apache Sqoop 

Apache Flume 

Flume Model 

Scalability in Flume 

Components in Flume’s Architecture 

Configuring Flume Components 

Demo: Ingest Twitter Data 

Apache Kafka 

Aggregating User Activity Using Kafka 

Kafka Data Model 


Apache Kafka Architecture 

Demo: Setup Kafka Cluster 

Producer Side API Example 

Consumer Side API 

Consumer Side API Example 

Kafka Connect 

Demo: Creating Sample Kafka Data Pipeline Using Producer and Consumer Key Takeaways 

Knowledge Check 

Practice Project: Data Ingestion Into Big Data Systems and ETL

Lesson 04 – Distributed Processing MapReduce   Framework and Pig 

Distributed Processing Mapreduce Framework and Pig 

Distributed Processing in Mapreduce 

Word Count Example 

Map Execution Phases 

Map Execution Distributed Two Node Environment 

Mapreduce Jobs 

Hadoop Mapreduce Job Work Interaction 

Setting Up the Environment for Mapreduce Development 

Set of Classes 

Creating a New Project 

Advanced Mapreduce 

Data Types in Hadoop 

Output formats in Mapreduce 

Using Distributed Cache 

Joins in Mapreduce 

Replicated Join 

Introduction to Pig 

Components of Pig 

Pig Data Model 

Pig Interactive Modes 

Pig Operations 

Various Relations Performed by Developers 

Demo: Analyzing Web Log Data Using Mapreduce 

Demo: Analyzing Sales Data and Solving Kpis Using Pig 

Practice Project: Apache Pig 

Demo: Wordcount 

Key Takeaways 

Knowledge Check 

Practice Project: Distributed Processing – Mapreduce Framework and Pig

Lesson 05 – Apache Hive 

Apache Hive 

Hive SQL over Hadoop Mapreduce 

Hive Architecture 

Interfaces to Run Hive Queries 

Running Beeline from Command Line 

Hive Metastore 

Hive DDL and DML 

Creating New Table 

Data Types 

Validation of Data 

File Format Types 

Data Serialization 

Hive Table and Avro Schema 

Hive Optimization Partitioning Bucketing and Sampling Non-Partitioned Table 

Data Insertion 

Dynamic Partitioning in Hive 


What Do Buckets Do? 

Hive Analytics UDF and UDAF 

Other Functions of Hive 

Demo: Real-time Analysis and Data Filtration 

Demo: Real-World Problem 

Demo: Data Representation and Import Using Hive Key Takeaways 

Knowledge Check 

Practice Project: Apache Hive 

Lesson 06 – NoSQL Databases HBase 

NoSQL Databases HBase 

NoSQL Introduction 

Demo: Yarn Tuning 

Hbase Overview 

Hbase Architecture 

Data Model 

Connecting to HBase 

Practice Project: HBase Shell 

Key Takeaways 

Knowledge Check 

Practice Project: NoSQL Databases – HBase

Lesson 07 – Basics of Functional Programming and Scala 

Basics of Functional Programming and Scala 

Introduction to Scala 

Demo: Scala Installation 

Functional Programming 

Programming With Scala 

Demo: Basic Literals and Arithmetic Programming 

Demo: Logical Operators 

Type Inference Classes Objects and Functions in Scala 

Demo: Type Inference Functions Anonymous Function and Class 


Types of Collections 

Demo: Five Types of Collections 

Demo: Operations on List 

Scala REPL 

Demo: Features of Scala REPL 

Key Takeaways 

Knowledge Check 

Practice Project: Apache Hive 

Lesson 08 – Apache Spark Next-Generation Big Data   Framework 

Apache Spark Next-Generation Big Data Framework 

History of Spark 

Limitations of Mapreduce in Hadoop 

Introduction to Apache Spark 

Components of Spark 

Application of In-memory Processing 

Hadoop Ecosystem vs Spark 

Advantages of Spark 

Spark Architecture 

Spark Cluster in Real World 

Demo: Running a Scala Programs in Spark Shell 

Demo: Setting Up Execution Environment in IDE 

Demo: Spark Web UI 

Key Takeaways 

Knowledge Check 

Practice Project: Apache Spark Next-Generation Big Data Framework

Lesson 09 – Spark Core Processing RDD 

Introduction to Spark RDD 

RDD in Spark 

Creating Spark RDD 

Pair RDD 

RDD Operations 

Demo: Spark Transformation Detailed Exploration Using Scala Examples Demo: Spark Action Detailed Exploration Using Scala 

Caching and Persistence 

Storage Levels 

Lineage and DAG 

Need for DAG 

Debugging in Spark 

Partitioning in Spark 

Scheduling in Spark 

Shuffling in Spark 

Sort Shuffle 

Aggregating Data With Paired RDD 

Demo: Spark Application With Data Written Back to HDFS and Spark UI Demo: Changing Spark Application Parameters 

Demo: Handling Different File Formats 

Demo: Spark RDD With Real-world Application 

Demo: Optimizing Spark Jobs 

Key Takeaways 

Knowledge Check 

Practice Project: Spark Core Processing RDD 

Lesson 10 – Spark SQL Processing DataFrames Spark SQL Processing DataFrames 

Spark SQL Introduction 

Spark SQL Architecture 


Demo: Handling Various Data Formats 

Demo: Implement Various Dataframe Operations 

Demo: UDF and UDAF 

Interoperating With RDDs 

Demo: Process Dataframe Using SQL Query 

RDD vs Dataframe vs Dataset 

Practice Project: Processing Dataframes 

Key Takeaways 

Knowledge Check 

Practice Project: Spark SQL – Processing Dataframes

Lesson 11 – Spark MLib Modelling BigData with Spark 

Spark Mlib Modeling Big Data With Spark 

Role of Data Scientist and Data Analyst in Big Data 

Analytics in Spark 

Machine Learning 

Supervised Learning 

Demo: Classification of Linear SVM 

Demo: Linear Regression With Real World Case Studies 

Unsupervised Learning 

Demo: Unsupervised Clustering K-means 

Reinforcement Learning 

Semi-supervised Learning 

Overview of Mlib 

Mlib Pipelines 

Key Takeaways 

Knowledge Check 

Practice Project: Spark Mlib – Modelling Big data With Spark

Lesson 12 – Stream Processing Frameworks and Spark   Streaming 

Streaming Overview 

Real-time Processing of Big Data 

Data Processing Architectures 

Demo: Real-time Data Processing 

Spark Streaming 

Demo: Writing Spark Streaming Application 

Introduction to DStreams 

Transformations on DStreams 

Design Patterns for Using Foreachrdd 

State Operations 

Windowing Operations 

Join Operations Stream-dataset Join 

Demo: Windowing of Real-time Data Processing 

Streaming Sources 

Demo: Processing Twitter Streaming Data 

Structured Spark Streaming 

Use Case Banking Transactions 

Structured Streaming Architecture Model and Its Components 

Output Sinks 

Structured Streaming APIs 

Constructing Columns in Structured Streaming 

Windowed Operations on Event-time 

Use Cases 

Demo: Streaming Pipeline 

Practice Project: Spark Streaming 

Key Takeaways 

Knowledge Check 

Practice Project: Stream Processing Frameworks and Spark Streaming

Lesson 13 – Spark GraphX Spark GraphX 

Introduction to Graph 

GraphX in Spark 

GraphX Operators 

Join Operators 

GraphX Parallel System 

Algorithms in Spark 

Pregel API 

Use Case of GraphX 

Demo: GraphX Vertex Predicate Demo: Page Rank Algorithm 

Key Takeaways 

Knowledge Check 

Practice Project: Spark GraphX 

Project Assistance

Course End Projects: 

The course includes four real-world, industry-based projects. The successful evaluation of one  of the following projects is a part of the certification eligibility criteria: 

Project 1: Analyzing Historical Insurance Claims 

Use Hadoop features to predict patterns and share actionable insights for a car insurance  company 

This project uses New York Stock Exchange data from 2010 to 2016, captured from 500+ listed  companies. The data set consists of each listed company’s intraday prices and volume traded.  The data is used in both machine learning and exploratory analysis projects for the purposes  of automating the trading process and predicting the next trading-day winners or losers. The  scope of this project is limited to exploratory data analysis. 

Domain: BFSI 

Project 2: Employee Review of Comment Analysis 

Use Hive features for data analysis and share the actionable insights with the HR team for the  purpose of taking corrective actions. 

The HR team is surfing social media to gather current and ex-employee feedback and  sentiments. This information will be used to derive actionable insights and take corrective  actions to improve the employer-employee relationship. The data is web-scraped from  Glassdoor and contains detailed reviews of 67K employees from Google, Amazon, Facebook,  Apple, Microsoft, and Netflix. 

Domain: Human Resources 

Project 3: K-Means Clustering for Telecommunication Domain LoudAcre Mobile is a mobile phone service provider whichthat has introduced a new open  network campaign. As a part of this campaign, the company has invited users to complain  about mobile phone network towers in their area if they are experiencing connectivity issues  with their present mobile network. LoudAcre has collected the dataset of users who have  complained. 

Domain: Telecommunication 

Project 4: Market Analysis in Banking Domain 

Our client, a Portuguese banking institution, ran a marketing campaign to convince potential  customers to invest in a bank term deposit promotion. The marketing campaign pitches were  delivered by phone calls. Often, however, the same customer was contacted more than once.  

You have to perform the marketing analysis of the data generated by this campaign, keeping in  mind the redundant calls