Big Data Hadoop Training - The Complete Web development company

Course Curriculum:

Lesson 01 – Introduction to Bigdata and Hadoop

Introduction to Big Data and Hadoop

Introduction to Big Data

Big Data Analytics

What is Big Data?

Four vs of Big Data

Case Study Royal Bank of Scotland

Challenges of Traditional System

Distributed Systems

Introduction to Hadoop

Components of Hadoop Ecosystem Part One

Components of Hadoop Ecosystem Part Two

Components of Hadoop Ecosystem Part Three

Commercial Hadoop Distributions

Demo: Walkthrough of Simplilearn Cloudlab

Key Takeaways

Knowledge Check

Lesson 02 – Hadoop Architecture Distributed Storage (HDFS) and YARN

Hadoop Architecture Distributed Storage (HDFS) and YARN

What is HDFS

Need for HDFS

Regular File System vs HDFS

Characteristics of HDFS

HDFS Architecture and Components

High Availability Cluster Implementations

HDFS Component File System Namespace

Data Block Split

Data Replication Topology

HDFS Command Line

Demo: Common HDFS Commands

Practice Project: HDFS Command Line

Yarn Introduction

Yarn Use Case

Yarn and its Architecture

Resource Manager

How Resource Manager Operates

Application Master

How Yarn Runs an Application

Tools for Yarn Developers

Demo: Walkthrough of Cluster Part One

Demo: Walkthrough of Cluster Part Two

Key Takeaways

Knowledge Check

Practice Project: Hadoop Architecture, distributed Storage (HDFS) and Yarn

Lesson 03 – Data Ingestion into Big Data Systems and ETL

Data Ingestion Into Big Data Systems and Etl

Data Ingestion Overview Part One

Data Ingestion Overview Part Two

Apache Sqoop

Sqoop and Its Uses

Sqoop Processing

Sqoop Import Process

Sqoop Connectors

Demo: Importing and Exporting Data from MySQL to HDFS

Practice Project: Apache Sqoop

Apache Flume

Flume Model

Scalability in Flume

Components in Flume’s Architecture

Configuring Flume Components

Demo: Ingest Twitter Data

Apache Kafka

Aggregating User Activity Using Kafka

Kafka Data Model

Partitions

Apache Kafka Architecture

Demo: Setup Kafka Cluster

Producer Side API Example

Consumer Side API

Consumer Side API Example

Kafka Connect

Demo: Creating Sample Kafka Data Pipeline Using Producer and Consumer Key Takeaways

Knowledge Check

Practice Project: Data Ingestion Into Big Data Systems and ETL

Lesson 04 – Distributed Processing MapReduce Framework and Pig

Distributed Processing Mapreduce Framework and Pig

Distributed Processing in Mapreduce

Word Count Example

Map Execution Phases

Map Execution Distributed Two Node Environment

Mapreduce Jobs

Hadoop Mapreduce Job Work Interaction

Setting Up the Environment for Mapreduce Development

Set of Classes

Creating a New Project

Advanced Mapreduce

Data Types in Hadoop

Output formats in Mapreduce

Using Distributed Cache

Joins in Mapreduce

Replicated Join

Introduction to Pig

Components of Pig

Pig Data Model

Pig Interactive Modes

Pig Operations

Various Relations Performed by Developers

Demo: Analyzing Web Log Data Using Mapreduce

Demo: Analyzing Sales Data and Solving Kpis Using Pig

Practice Project: Apache Pig

Demo: Wordcount

Key Takeaways

Knowledge Check

Practice Project: Distributed Processing – Mapreduce Framework and Pig

Lesson 05 – Apache Hive

Apache Hive

Hive SQL over Hadoop Mapreduce

Hive Architecture

Interfaces to Run Hive Queries

Running Beeline from Command Line

Hive Metastore

Hive DDL and DML

Creating New Table

Data Types

Validation of Data

File Format Types

Data Serialization

Hive Table and Avro Schema

Hive Optimization Partitioning Bucketing and Sampling Non-Partitioned Table

Data Insertion

Dynamic Partitioning in Hive

Bucketing

What Do Buckets Do?

Hive Analytics UDF and UDAF

Other Functions of Hive

Demo: Real-time Analysis and Data Filtration

Demo: Real-World Problem

Demo: Data Representation and Import Using Hive Key Takeaways

Knowledge Check

Practice Project: Apache Hive

Lesson 06 – NoSQL Databases HBase

NoSQL Databases HBase

NoSQL Introduction

Demo: Yarn Tuning

Hbase Overview

Hbase Architecture

Data Model

Connecting to HBase

Practice Project: HBase Shell

Key Takeaways

Knowledge Check

Practice Project: NoSQL Databases – HBase

Lesson 07 – Basics of Functional Programming and Scala

Basics of Functional Programming and Scala

Introduction to Scala

Demo: Scala Installation

Functional Programming

Programming With Scala

Demo: Basic Literals and Arithmetic Programming

Demo: Logical Operators

Type Inference Classes Objects and Functions in Scala

Demo: Type Inference Functions Anonymous Function and Class

Collections

Types of Collections

Demo: Five Types of Collections

Demo: Operations on List

Scala REPL

Demo: Features of Scala REPL

Key Takeaways

Knowledge Check

Practice Project: Apache Hive

Lesson 08 – Apache Spark Next-Generation Big Data Framework

Apache Spark Next-Generation Big Data Framework

History of Spark

Limitations of Mapreduce in Hadoop

Introduction to Apache Spark

Components of Spark

Application of In-memory Processing

Hadoop Ecosystem vs Spark

Advantages of Spark

Spark Architecture

Spark Cluster in Real World

Demo: Running a Scala Programs in Spark Shell

Demo: Setting Up Execution Environment in IDE

Demo: Spark Web UI

Key Takeaways

Knowledge Check

Practice Project: Apache Spark Next-Generation Big Data Framework

Lesson 09 – Spark Core Processing RDD

Introduction to Spark RDD

RDD in Spark

Creating Spark RDD

Pair RDD

RDD Operations

Demo: Spark Transformation Detailed Exploration Using Scala Examples Demo: Spark Action Detailed Exploration Using Scala

Caching and Persistence

Storage Levels

Lineage and DAG

Need for DAG

Debugging in Spark

Partitioning in Spark

Scheduling in Spark

Shuffling in Spark

Sort Shuffle

Aggregating Data With Paired RDD

Demo: Spark Application With Data Written Back to HDFS and Spark UI Demo: Changing Spark Application Parameters

Demo: Handling Different File Formats

Demo: Spark RDD With Real-world Application

Demo: Optimizing Spark Jobs

Key Takeaways

Knowledge Check

Practice Project: Spark Core Processing RDD

Lesson 10 – Spark SQL Processing DataFrames Spark SQL Processing DataFrames

Spark SQL Introduction

Spark SQL Architecture

Dataframes

Demo: Handling Various Data Formats

Demo: Implement Various Dataframe Operations

Demo: UDF and UDAF

Interoperating With RDDs

Demo: Process Dataframe Using SQL Query

RDD vs Dataframe vs Dataset

Practice Project: Processing Dataframes

Key Takeaways

Knowledge Check

Practice Project: Spark SQL – Processing Dataframes

Lesson 11 – Spark MLib Modelling BigData with Spark

Spark Mlib Modeling Big Data With Spark

Role of Data Scientist and Data Analyst in Big Data

Analytics in Spark

Machine Learning

Supervised Learning

Demo: Classification of Linear SVM

Demo: Linear Regression With Real World Case Studies

Unsupervised Learning

Demo: Unsupervised Clustering K-means

Reinforcement Learning

Semi-supervised Learning

Overview of Mlib

Mlib Pipelines

Key Takeaways

Knowledge Check

Practice Project: Spark Mlib – Modelling Big data With Spark

Lesson 12 – Stream Processing Frameworks and Spark Streaming

Streaming Overview

Real-time Processing of Big Data

Data Processing Architectures

Demo: Real-time Data Processing

Spark Streaming

Demo: Writing Spark Streaming Application

Introduction to DStreams

Transformations on DStreams

Design Patterns for Using Foreachrdd

State Operations

Windowing Operations

Join Operations Stream-dataset Join

Demo: Windowing of Real-time Data Processing

Streaming Sources

Demo: Processing Twitter Streaming Data

Structured Spark Streaming

Use Case Banking Transactions

Structured Streaming Architecture Model and Its Components

Output Sinks

Structured Streaming APIs

Constructing Columns in Structured Streaming

Windowed Operations on Event-time

Use Cases

Demo: Streaming Pipeline

Practice Project: Spark Streaming

Key Takeaways

Knowledge Check

Practice Project: Stream Processing Frameworks and Spark Streaming

Lesson 13 – Spark GraphX Spark GraphX

Introduction to Graph

GraphX in Spark

GraphX Operators

Join Operators

GraphX Parallel System

Algorithms in Spark

Pregel API

Use Case of GraphX

Demo: GraphX Vertex Predicate Demo: Page Rank Algorithm

Key Takeaways

Knowledge Check

Practice Project: Spark GraphX

Project Assistance

Course End Projects:

The course includes four real-world, industry-based projects. The successful evaluation of one of the following projects is a part of the certification eligibility criteria:

Project 1: Analyzing Historical Insurance Claims

Use Hadoop features to predict patterns and share actionable insights for a car insurance company

This project uses New York Stock Exchange data from 2010 to 2016, captured from 500+ listed companies. The data set consists of each listed company’s intraday prices and volume traded. The data is used in both machine learning and exploratory analysis projects for the purposes of automating the trading process and predicting the next trading-day winners or losers. The scope of this project is limited to exploratory data analysis.

Domain: BFSI

Project 2: Employee Review of Comment Analysis

Use Hive features for data analysis and share the actionable insights with the HR team for the purpose of taking corrective actions.

The HR team is surfing social media to gather current and ex-employee feedback and sentiments. This information will be used to derive actionable insights and take corrective actions to improve the employer-employee relationship. The data is web-scraped from Glassdoor and contains detailed reviews of 67K employees from Google, Amazon, Facebook, Apple, Microsoft, and Netflix.

Domain: Human Resources

Project 3: K-Means Clustering for Telecommunication Domain LoudAcre Mobile is a mobile phone service provider whichthat has introduced a new open network campaign. As a part of this campaign, the company has invited users to complain about mobile phone network towers in their area if they are experiencing connectivity issues with their present mobile network. LoudAcre has collected the dataset of users who have complained.

Domain: Telecommunication

Project 4: Market Analysis in Banking Domain

Our client, a Portuguese banking institution, ran a marketing campaign to convince potential customers to invest in a bank term deposit promotion. The marketing campaign pitches were delivered by phone calls. Often, however, the same customer was contacted more than once.

You have to perform the marketing analysis of the data generated by this campaign, keeping in mind the redundant calls