Introduction to Linux and Big Data Virtual Machine (VM)

Introduction/Installation of VirtualBox and the Big Data VM Introduction to Linux.

  • Why Linux?
  • Windows and the Linux equivalents
  • Different flavors of Linux
  • Unity Shell (Ubuntu UI)
  • Basic Linux Commands (enough to get started with Hadoop)

Understanding Big Data
  • 3V (Volume-Variety-Velocity) characteristics

  • Structured and Unstructured Data

  • Application and use cases of Big Data

Limitations of traditional large Scale systems

How a distributed way of computing is superior (cost and scale)

Opportunities and challenges with Big Data

HDFS (The Hadoop Distributed File System) HDFS Overview and Architecture
  • Deployment Architecture
  • Name Node, Data Node and Checkpoint Node (aka Secondary Name Node)
  • Safe mode
  • Configuration files
  • HDFS Data Flows (Read vs Write)

How HDFS addresses fault tolerance?

  • CRC Check Sum
  • Data replication
  • Rack awareness and Block placement policy
  • Small files problem

HDFS Interfaces

  • Command Line Interface
  • File System
  • Administrative
  • Web Interface

Advanced HDFS features

  • Load Balancer
  • DistCp
  • HDFS Federation
  • HDFS High Availability

MapReduce - 1 (Theoretical Concepts)

MapReduce overview

  • Functional Programming paradigms
  • How to think in a MapReduce way?

MapReduce Architecture

  • Legacy MR vs Next Generation MapReduce (aka YARN/MRv2)
  • Slots vs Containers
  • Schedulers
  • Shuffling, Sorting
  • Hadoop Data Types
  • Input and Output Formats
  • Input Splits
  • Partitioning (Hash Partitioner vs Customer Partitioner)
  • Configuration files
  • Distributed Cache

MR Algorithm and Data Flow

  • Word Count
  • Indexing
MapReduce - 2 (Practice)

Developing, debugging and deploying MR programs

  • Stand alone mode (in Eclipse)
  • Pseudo distributed mode (as in the Big Data VM)
  • Fully distributed mode (as in Production)


  • Old and the new MR API
  • Java Client API
  • Hadoop data types and custom Writables/WritableComparables
  • Different input and output formats

Hadoop Streaming (Developing and Debugging non Java MR programs - Ruby and Python)

Optimization techniques
  • Speculative execution
  • Combiners
  • JVM Reuse
  • Compression
MR algorithms
  • Sorting
  • Term Frequency – Inverse Document Frequency
  • Student Data Base
  • Max Temperature
  • Different ways of joining data

Higher Level Abstractions for MR (Pig)

  • Introduction and Architecture
  • Different Modes of executing Pig constructs
  • Data Types
  • Dynamic invokers
  • Pig streaming
  • Macros
  • Pig Latin language Constructs (LOAD, STORE, DUMP, SPLIT etc)
  • User Defined Functions
  • Use Cases
Higher Level Abstractions for MR (Hive)
  • Introduction and Architecture
  • Different Modes of executing Hive queries
  • Metastore Implementations
  • HiveQL(DDL & DML Operations)
  • External vs Managed Tables
  • Views
  • Partitions & Buckets
  • User Defined Functions
  • Transformations using Non Java
  • Use Cases

Comparison of Pig and Hive

NoSQL Databases - 1 (Theoretical Concepts)

NoSQL Concepts

  • Review of RDBMS
  • Need for NoSQL
  • rewers CAP Theorem
  • ACID vs BASE
  • Schema on Read vs. Schema on Write
  • Different levels of consistency
  • Bloom filters
Different types of NoSQL databases
  • Key Value
  • Columnar
  • Document
  • Graph

Columnar Databases concepts

NoSQL Databases - 2 (Practice)

HBase Architecture

  • Master and the Region Server
  • Catalog tables (ROOT and META)
  • Major and Minor compaction
  • Configuration files
  • HBase vs Cassandra
Interfaces to HBase (for DDL and DML operations)
  • Java API
  • Client API
  • Filters
  • Scan Caching and Batching
  • Command Line Interface

Advances HBase Features

  • HBase Data Modeling
  • Bulk loading data in HBase
  • HBase Coprocessors - EndPoints (similar to Stored Procedures in RDBMS)
  • HBase Coprocessors - Observers (similar to Triggers in RDBMS)

Setting up a Hadoop Cluster using Apache Hadoop

Brief introduction to what Cloud is and AWS

Cloudera Hadoop cluster on the Amazon Cloud (Practice)

  • Using EMR (Elastic Map Reduce)
  • Using EC2 (Elastic Compute Cloud)

SSH Configuration

Stand alone mode (Theory)

Distributed mode (Theory)
  • Pseudo distributed
  • Fully distributed

Getting started with Apache Spark

  • Limitations of the MR model and how Spark/RDD addresses them
  • Spark Installation demo
  • Different modes of running Spark
  • What are RDDs?
  • Different transformations and actions on RDD.
  • Integrating Spark with PyCharm
  • Developing Spark programs in PyCharm, Shell etc
  • Spark Streaming overview and demo
  • Spark SQL overview and demo
Hadoop Ecosystem and Use Cases
  • Hadoop industry solutions
  • Importing/exporting data across RDBMS and HDFS using Sqoop
  • Getting real-time events into HDFS using Flume
  • Creating workflows in Oozie
  • Graph processing with Neo4J
  • NoSQL databases Cassandra and Mongo
  • Distributed coordination using ZooKeeper

Proof of concepts and use cases

  • Two projects which are very close to real life projects.
  • Further ideas for data analysis
Contact for Demo
Training Enquiry Form

Online Courses Videos