Big Data Analytics
This course provides a basic introduction to big data and related quantitative research methods. The aim of the course is to familiarize students with big data analysis as a tool for processing substantial research questions. The course begins with a basic introduction to Big Data and discusses what analyzing this data entails, as well as the technical, conceptual and ethical challenges associated with it. The strengths and limits of big data research are discussed in detail using practical examples. Then the students deal with case study exercises in which small groups of students develop and present a big data concept for a specific case from practice. This includes hands-on exercises to familiarize students with the format of big data. In addition, initial practical experience in handling and analyzing large, complex data structures is imparted.
Course Details
If you ask anyone about big data, they only know: “You are talking about a huge collection of data that cannot be used for calculations unless it is provided and operated in an unconventional way.” Big data is not just about storage and extracting data, but much more. Big data itself encompasses so many technologies that it is difficult to remember which one to start with. Not really! Some of the technologies that make up big data are Hadoop, MapReduce, Apache, Pig, Hive, Flume, Sqoop, Zookeeper, Oozie, and Spark.
Companies are urgently looking for qualified big data analysts. As data is collected and stored faster than ever before, the urgency of such professionals continues to grow. Before getting into big data, we encourage you to fully understand this topic which is the full big data curriculum. So the next time you're starting a course, be sure to read all of the major big data topics. Allsoft Solutions covers every module up to the streaming of data.
Course Information
1. Provide an overview of an exciting and growing field of big data analysis.
2. Introduction to the tools required to manage and analyze big data, such as Hadoop, NoSql MapReduce.
3. Teaching the basic techniques and principles to achieve big data analytics with scalability and streaming capability.
4. To provide students with skills that will help them solve complex real world problems in decision support.
Introduction of BigData
Limitations of the existing solutions for Big Data problems.
How Hadoop solves the Big Data problem?
IBM’s 4 V’s
Types of Data
Installation Of Cloudera and VMWare
Setup of the single node hadoop cluster
Describing the functions and features of HDP
Listing the IBM value-add components
Explaining what IBM Watson Studio is
Giving a brief description to the purpose of each of the value-add components
Exploring the lab environment
Describing and compare the open source programming languages, Pig and Hive
Listing the characteristics of programming languages typically used by Data Scientists: R and Python
Understanding the challenges posed by distributed applications and how ZooKeeper is designed to handle them.
Explaining the role of ZooKeeper within the Apache Hadoop infrastructure and the realm of Big Data management.
Exploring generic use cases and some real-world scenarios for ZooKeeper.
Defining the ZooKeeper services that are used to manage distributed systems.
Exploring and using the ZooKeeper CLI to interact with ZooKeeper services.
Understanding how Apache Slider works in conjunction with YARN to deploy distributed applications and to monitor them.
HDFS Architecture
Hadoop Ecosystem
Linux based Commands(How to work with local file system commands)
Hadoop Commands
Sqoop
Sqoop intro
How to display tables from Rdbms Mysql for Sqoop?
How to display Databases from Rdbms mysql for Sqoop?
How to import all tables from a specific database from RDBMS Mysql to HDFS(Hadoop)?
How to import data from RDBMS MYSQL FROM HDFS(HADOOP)?
How to export data from HDFS TO RDBMS MYSQL?
How to import part of the table from RDBMS MYSQL TO HDFS?
HIVE
Hive concepts
Hive Data types
Hive Background
About Hive
Hive Architecture and Components
Metastore in Hive
Limitations of Hive
Comparison with Traditional Databases
PIG
What is Pig?
Pig Run Modes
Pig Latin Concepts
Pig Data Types
Pig Example
Group Operator
COGROUP Operator
Joins
COGROUP
HBASE
What is HBase
HBase Model
HBase Read
HBase Write
HBase MemStore
RDBMS vs HBase
HBase Commands
HBase Example
Map Reduce
Input Splits in MapReduce
Combiner & Partitioner
What are all the file input formats in hadoop (Mapreduce)?
What type of Key value Pair will be generated? Our file format is key value text input format?
Can we set Required no of mappers and Reducers?
Difference between Old and New api in Mapreduce
What is the importance of Record Reader in Hadoop?
Map Reduce with Word Count Example
Advanced course
Big SQL
Overview of Big SQL
Understanding how Big SQL fits in the Hadoop architecture
Start and stop Big SQL using Ambari and command line
Connecting to Big SQL using command line
Connecting to Big SQL using IBM Data Server Manager
Configuring images
Starting Hadoop components
Start up the Big SQL and DSM services
Connecting to Big SQL using JSqsh
Executing basic Big SQL statements
IBM Watson Studio
Explain what IBM Watson Studio is.
Identify industry use cases.
List Watson Studio offerings.
Create Watson Studio projects.
Describe Watson Studio and the Apache Spark environment.
Describe Watson Studio and Cloud Object Storage.
Prepare and analyze data.
Use Jupyter Notebooks.Describe Apache Spark environment options. •
List Watson Studio default Apache Spark environment definitions.
Create machine learning (ML) models with an Apache Spark runtime.
cloud storage and its features.
Define various types of cloud storage (object storage, block storage, and file storage).
Scala and spark
What is Scala?
Why Scala for Spark?
Scala in other frameworks
Introduction to Scala REPL
Basic Scala operations
Variable Types in Scala
Control Structures in Scala
Foreach loop
Functions
Procedures
Collections in Scala- Array, Array Buffer,
Map, Tuples, Lists, and more.
What is Spark? :
Spark Ecosystem
Modes of Spark
Spark installation demo
Overview of Spark on a cluster
Spark Standalone cluster
Spark Web UI
Some configurations.
Components of Spark Unified stack
Spark Streaming
MLlib
Core
Spark SQL
RDD - The core concept of Spark
RDDs,
Transformations in RDD,
Actions in RDD,
Loading data in RDD,
Saving data through RDD,
Key-Value Pair RDD,
MapReduce and Pair RDD Operations
Scala and Python shell
Word count example
Shared Variables with examples
Submitting jobs in cluster
Hands on examples
Future Opportunity
Big data is not only an important part of the future, it can be the future itself. The way companies, organizations and the IT professionals who support them approach their jobs will continue to be shaped by developments in the way we store, move and understand data.
Tutor Information
- Databases (Oracle)
- ETL tools(Informatica,ODI)
- Analytical Reporting tools (OBIEE/TABLEAU)
- Big Data ecosystem (HDFS,KAFKA,SCOOP,SPARK,SCALA,PYTHON,HIVE,IMPALA,OOZIE,HBASE)
Industry Driven projects
POC using Hadoop Technologies for Example :
1. Zero Copy Shared Memory Framework in KVM for Host Guest Data Sharing
2. Sampling Based Network Traffic Measurement Algorithm for Big Network Data
3. Novel Privacy Preserving and Efficient Protocols for Human Activity Recognition Based on Sensor
4. Credit Card Fraud Detection Project
5, Twitter dataset analysis