Tuesday, August 6, 2013

Basic Meaning of Big Data

From last many days some students and my juniors are asking about basic of Big Data and how it is related to Cloud Computing. So I thought to write an article explaining meaning Big Data.

As the name suggest Big Data is related with huge amount to data which can not be processed by using simple methods and tools. For example, modern high-energy physics experiments, such as DZero1, typically generate more than one TeraByte of data per day. The famous social network Website, Facebook, serves 570 billion page views per month, stores 3 billion new photos every month, and manages 25 billion pieces of content. Main Question is how to process this much of data in less time? Data collected from these sources is very loosely linked to each other so making decisions from this data is very complex and time consuming. Our today's conventional databases cannot process data if they don't know exact relation between terms. Now a days organizations collect data form many different sources and methods. For example, an laptop company can collect data about a product from social networking sites such as Facebook, twitter etc. from many laptop related blogging websites and even from laptop selling online sites. Data collected from this much different sources can’t be process my conventional databases to make any proper decisions. This data is too big, moves too fast, or doesn’t fit the structure of conventional databases. To gain value from this data, we must follow a new approach. This new approach is known as Big Data.

Seeing the need of Big Data on March 29, 2012 American Government announced “Big Data Research and Development Initiative” and Big Data became the national policy for first time. Recently, the definition of big data as also given by the Gartner: “Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable
enhanced decision making, insight discovery and process optimization”( I will explain three V’s of big data in coming posts in detail).

Basically technique used for getting useful information from very large unstructured data sets is known as Big Data. This data can be of any type. I hope you get very basic meaning of Big Data. Stay tuned i will post for detail about Big data architecture, 3 V’s and all in subsequent posts.

Rajinder Sandhu

Saturday, April 27, 2013

Hadoop on Multi Node

This tutorial provides step by step way to install multinode Hadoop on Ubuntu 10.04.

1. Install Single node Hadoop on all Nodes before starting to follow this tutorial.

2. If you follow my single node tutorial and install Ubuntu 10.04 on VMware Workstation. You can clone your single node machine using method as shown in Figure below.

Untitled

3. Name the cloned virtual machine as slave. This can be change from System>Administration>User and Groups.

1. Networking of Master and Slave

1.1 All nodes should be accessable from each other on the network. 
1.2 Add IP address of both master and slave machine in all the machines using following command.
  • ifconfig # on all machines to know their IP.
  • sudo nano /etc/hosts
Add following lines in the file.
  • 192.168.216.135 master
  • 192.168.216.136 slave

2. SSH Access

SSH access should be enabled form master to slave so that jobs can be transfered from master to slave and vice versa.
2.1 Add RSA key of slave to authorized users using following command.
  • su - hduser
  • ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@slave
Now check the working of SSH by following commands.
  • ssh master
  • ssh slave

3. Configuration of Hadoop

3.1 Add IP and name of masters in /conf/master file of master node only using following command.
  • su - hduser
  • cd /home/hadoop/conf
  • nano masters
Add name of masters in this case:
  • master
and then

  • nano slaves
Add name of slaves in this case:
  • master
  • slave
3.2 Change different parameter files of hadoop files on all machines as follows, as changed in single node installation.

In conf/core-site.xml:

<property> <name>fs.default.name</name> <value>hdfs://master:54310</value> Only line which require change. <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property>

In conf/mapred-site.xml

<property> <name>mapred.job.tracker</name> <value>master:54311</value> Only Line require the change <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property>

In conf/hdfs-site.xml:

set replication value to 2 which was 1 eariler.

3.3 Format the NameNode

  • /home/hadoop/bin/hadoop namenode -format
3.4 If namenode donot get formatted delete temp folder of hadoop
  • cd /app/hadoop/tmp
  • rm -R dfs
  • rm -R mapred

4. Starting the Multi node Hadoop

4.1 Run the command bin/start-dfs.sh on the machine you want the (primary) NameNode to run on. This will bring up HDFS with the NameNode running on the machine you ran the previous command on, and DataNodes on the machines listed in the conf/slaves file.

4.2 Run the command bin/start-mapred.sh on the machine you want the JobTracker to run on. This will bring up the MapReduce cluster with the JobTracker running on the machine you ran the previous command on, and TaskTrackers on the machines listed in the conf/slaves file.

Monday, April 1, 2013

Install Hadoop on Single Node

In this tutorial i will give you line by line instruction on how to setup hadoop on single node. First we will install hadoop on single node then move to multiple node in coming tutorials.

Pre-Requirements:

1. Java

* Update the system files using following command.
  • sudo add-apt-repository ppa:webupd8team/java
  • sudo apt-get update
  • sudo apt-get install oracle-java7-installer
* Now check the version of java 
  • java -version

2. Python environment.

Setup Python environment using following command.
  • sudo apt-get install python-software-properties

3. Create a new user for Hadoop.

Create a new user for hadoop. We will provide all permission to this user so that our hadoop can be secure.
  • sudo addgroup hadoop
  • sudo adduser --ingroup hadoop hduser

4. Configure SSH

SSH is required for hadoop to communicate between master and slave nodes. Configure it using following commands
  • sudo apt-get install openssh-server
  • su –  hduser
  • ssh-keygen -t rsa -P “”
  • cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
Check if ssh is working by using following command.
  • ssh localhost

5. Disable IPV6

Sometime due to IPV6 configuration master node of hadoop is not able to communicate with slaves. It is better to disable IPV6 configuration using System Control file by following command.
* Open system control file in nano editor.
  • sudo nano /etc/sysctl.conf
* Embedd below written lines to end of system control file.




  • # disable ipv6
  • net.ipv6.conf.all.disable_ipv6 = 1
    net.ipv6.conf.default.disable_ipv6 = 1
    net.ipv6.conf.lo.disable_ipv6 = 1

    * Restart the system for settings to take effect. Check if IPv6 is disabled by following command. Output should be 1 when Ipv6 is disabled.
    • cat /proc/sys/net/ipv6/conf/all/disable_ipv6

    All pre-requirements are complete. 

    We start with installation of hadoop on single node cluster.

    1. Download hadoop from here.

    * Extract the tar file. Rename it to hadoop and save on Desktop.
    * Move it to /home folder so that it canbe used by slave nodes also.
    • cd Desktop
    • sudo mv hadoop /home
    * Change owner of this folder to hduser by following command.
    • cd /home
    • sudo chown -R hduser:hadoop /home/hadoop
    * Open .bashrc file by using following command to set java environment variables.
    • su - hduser
    • nano $HOME/.bashrc
    *Add the following lines to the end of the file.
    # Set Hadoop-related environment variables

    export HADOOP_HOME=/home/hadoop
    # Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
    export JAVA_HOME=/usr/lib/jvm/java-7-oracle
    # Some convenient aliases and functions for running Hadoop-related commands
    unalias fs &> /dev/null
    alias fs="hadoop fs"
    unalias hls &> /dev/null
    alias hls="fs -ls"
    # If you have LZO compression enabled in your Hadoop cluster and
    # compress job outputs with LZOP (not covered in this tutorial):
    # Conveniently inspect an LZOP compressed file from the command
    # line; run via:
    #
    # $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
    #
    # Requires installed 'lzop' command.
    #
    lzohead () {
        hadoop fs -cat $1 | lzop -dc | head -1000 | less
    }
    # Add Hadoop bin/ directory to PATH
    export PATH=$PATH:$HADOOP_HOME/bin

    2. Configure Hadoop script file

    * The only required environment variable we have to configure for Hadoop in this tutorial is Java. Follow following commands.
    • su - hduser
    • cd /home/hadoop/conf
    • nano hadoop-env.sh
    * and set java home as following:
    • # The java implementation to use.  Required.
    • export JAVA_HOME=/usr/lib/jvm/java-7–oracle
    3. Create a temporary directory for hadoop.
    A temp folder is created so that hadoop can store its temporary files in this folder.
    • exit
    • sudo mkdir -p /app/hadoop/tmp
    • sudo chown hduser:hadoop /app/hadoop/tmp

    4. Set Configuration files

    All hadoop configuration files are set in this section.Add the following snippets between the <configuration> ... </configuration> tags in the respective configuration XML file.
    In file conf/core-site.xml.
    • su - hduser
    • cd /home/hadoop/conf
    • nano core-site.xml
    Add following between configuration tags:
    <property>
      <name>hadoop.tmp.dir</name>
      <value>/app/hadoop/tmp</value>
      <description>A base for other temporary directories.</description>
    </property>

    <property>
      <name>fs.default.name</name>
      <value>hdfs://localhost:54310</value>
      <description>The name of the default file system.  A URI whose
      scheme and authority determine the FileSystem implementation.  The
      uri's scheme determines the config property (fs.SCHEME.impl) naming
      the FileSystem implementation class.  The uri's authority is used to
      determine the host, port, etc. for a filesystem.</description>
    </property>

    In file conf/mapred-site.xml:
    <property>
      <name>mapred.job.tracker</name>
      <value>localhost:54311</value>
      <description>The host and port that the MapReduce job tracker runs
      at.  If "local", then jobs are run in-process as a single map
      and reduce task.
      </description>
    </property>
    In file conf/hdfs-site.xml:
    <property>
      <name>dfs.replication</name>
      <value>1</value>
      <description>Default block replication.
      The actual number of replications can be specified when the file is created.
      The default is used if replication is not specified in create time.
      </description>
    </property>

    9. Formatting the HDFS filesystem via the NameNode
    • /home/hadoop/bin/hadoop namenode -format
    Output will look like:

    10. Start the single node cluster

    • /home/hadoop/bin/start-all.sh 
    Output will look like:


    Hadoop is running on this machine check its working using following command.

    • cd /home/hadoop
    • jps

    Above screen shows port where hadoop API's are running.

    Thursday, March 14, 2013

    Virtualization- Basic Meaning

    Hi everyone. In my previous post i talk about three basic requirement for Cloud Computing to take hold in the market. First among them was Virtualization. From long time I want to talk about Virtualization in detail but didn't get time. Today I am free, so lets dig deeper in understanding basic meaning of virtualization and How it came to picture?

    Virtualization is not the technology of 21st  century. It was used by IBM for its Mainframe Computers in 1960's. Mainframe computers has multiple resources to consolidate these different resources and make them act like single resource requires Virtualization. Cloud Computing utilize this idea more comprehensively which makes it first requirement for implementation of Cloud Computing. First look at few definitions of Virtualization.

    According to Wikipedia “Virtualization in computing, is the creation of a virtual (rather than actual ) version of something , such as hardware platform, operating system, a storage device or a network device.

    Gartner stated that, “ Virtualization is abstraction of IT resources in a way that masks the physical nature and boundaries of those resources from resource user. An IT resource can be Server, Client, Storage, Networks, Application or Operating System

    What we pull out from these two definitions is Virtualization is conversion of Physical to Logical of everything and anything and logical doesn't know about actual physical resource. Just remember Physical to Logical. One more thing if you simply search on web for definitions on virtualization you may find some definitions saying virtualization is running multiple operating System on single hardware resource. This is not definition of virtualization it is a type of virtualization i.e. Server Virtualization. So, don't confuse yourself when you find these type of definitions. Virtualization can be done of anything i.e. Network Virtualization, Storage Virtualization etc.

    These are all formal definitions and explanations. I hardly understand any concept with formal definition. I need examples and detailed definition to grab a concept or technology. If you say we started using virtualization with mainframe computers that may be wrong. Actually we were using this technology in our day to day life in one or another way, for example.

    Suppose, I have a big shop. No one want it on rent because it is very large and no one can afford that much rent also. What we do in our day to day life. We create a partition in middle of shop and make it into two shops. Both shops are rented to 2 different shopkeepers. Each shopkeeper thinks he rented the whole shop. Actually floor and ceiling is shared. What owner of big shop did in reality–  He VIRTUALIZED the big shop and created two small virtual shops. Same concept we are using for IT resources. Physical resources (Server, Storage, Network) are shared among different virtual machines but each virtual machine is isolated from each other.

    Now to do Server Virtualization we need a software known as Hypervisor. The most crucial piece of any virtual infrastructure is the hypervisor, which is what makes server virtualization possible. A hypervisor creates a virtual host that hosts virtual machines. It is also responsible for creating the virtual hardware that VMs will use. If you look up the term hypervisor, the definition will likely say that a hypervisor is an “abstraction layer.” That’s because it abstracts the traditional server operating system (OS) from the server hardware. Another way of saying this is that the hypervisor decouples the OS from the hardware. Your server OS no longer has to be tied to physical hardware and the newly virtualized server can be hardware-independent and containerized inside a virtual machine. There are two types of Hypervisor as shown in Figure below:

    image

    Type 1 hypervisor is installed directly on physical server hardware, thus replacing the existing OS. This is the most efficient design, in that it offers the best performance as well as the most enterprise-level data center features. Examples are VMware vSphere and Microsoft Hyper-V.

    Type 2 hypervisor is installed and “hosted” by the existing OS and Virtual Machines are known as Guest OS. This is less efficient but enables you to keep existing applications already installed on the host OS. Examples are VMware Workstation, VMware Fusion and Windows Virtual PC.

    Hope you now understand basic meaning of Virtualization. If you gave a deep thought on this concept you can easily see at many places we use this concept in our life.

    See you all soon and till then happy Virtualizing.

    Google App Engine Video Tutorial Series- Part 1

    Thursday, March 7, 2013

    Setup NFS (Network File System) on Ubuntu 10.04 LTS

    NFS allows a system to share directories and files with others over a network. By using NFS, users and programs can access files on remote systems almost as if they were local files.

    Some of the most notable benefits that NFS can provide are:

    • Local workstations use less disk space because commonly used data can be stored on a single machine and still remain accessible to others over the network.

    • There is no need for users to have separate home directories on every network machine. Home directories could be set up on the NFS server and made available throughout the network.

    • Storage devices such as floppy disks, CDROM drives, and USB Thumb drives can be used by other machines on the network. This may reduce the number of removable media drives throughout the network.

    In this post I will show you how to setup NFS on ubuntu 10.04.

    Experimental Setup: For demonstration of nfs. My experimental setup includes software such as- Oracle VirtualBox, Ubuntu 10.04 iso image. I have window 7 as host operating system. I install two virtual ubuntu 10.04 using oracle virtualbox. So now i have two ubuntu machines running on Oracle virtual box. Both machines are accessible to each other on network.

    On Server Side:

    Start the server ubuntu machine.

    1. First of all install nfs kernel. Internet is required for this.

    sudo apt-get install nfs-kernel-server

    2. Make directory you want to mount. Always create a new directory never mount your operating system base directories.

    sudo mkdir –p /export/users

    3. Change mode of these directories so that anyone can read this directory. Change mode according to your need. I am changing it to 777 as least secure.

    sudo chmod 777 /export
    sudo chmod 777 /export/users

    4. Now if you want to share any base operating system folder you can bind it with mounted folder.(server is name of my machine)

    sudo mount --bind /home/server /export/users

    5. Above binding will be removed when you restart your system. To keep this binding permanently open

    sudo nano /etc/fstab

    add following line

    /home/server /export/users none bind 0 0

    6. Now open-

    sudo nano /etc/default/nfs-kernel-server

    change or make

    NEED_SVCGSSD = no

    7. Now open

    sudo nano /etc/default/nfs-common

    make or change

    NEED_IDMAPD = yes
    NEED_GSSD = no

    8. make sure value in /etc/idmapd.conf following lines are there.

    cat /etc/idmapd.conf

    check:

    Nobody_user = nobody
    Nobody_group = nogroup

    9. Most important step add this mounted folder in export file. (192.168.80.136 is my server ip)

    Open

    sudo nano /etc/exports

    add following lines at the end

    /export         192.168.80.136/24(rw,fsid=0,insecure,no_subtree_check, async)
    /export/users   192.168.80.136/24(rw,nohide,insecure,no_subtree_check,async)

    10. Now restart the nfs kernel.

    sudo /etc/init.d/nfs-kernel-server restart

    On client Side:

    1. Install nfs common on client side. It requires Internet Connection.


    sudo apt-get install nfs-common


    2. Open


    sudo nano /etc/default/nfs-common


    Set:


    NEED_IDMAPD = yes
    NEED_GSSD = no


    3. Now mount the exported folder.


    sudo mount –t nfs4 –o proto=tcp, port=2049 192.168.80.136:/ /mnt


    mnt is folder created on client side where all files will be mounted.


    That's all your nfs is complete.


    Error:


    If following error came.


    mount.nfs4: No such device


    you have to load modprobe module by following command.


    sudo modprobe nfs


    Wednesday, March 6, 2013

    Do you need the cloud? or Do you want the cloud?

    Cloud is everywhere now. It is most prominent area in research, IT industry and in academics also. Every presentation or lecture i saw in past highlighted the benefits of cloud computing. Being researcher in this field I decided to know areas where cloud is not best option. To my surprise there are few areas where cloud had an impact. If some organization says, “I want the cloud because everyone has” is not the way to start. In this post i will discuss the areas where cloud can be used and where it should not.

    Before going any further lets see Why you need the cloud?:

    • All load of IT is handled by Professionals.
    • Capital expenditure is very small compared to big investment.
    • Time to market for any service is just “now”.
    • Flexibility (as you go)
    • They always make a offer which you can not refuse.

    Above are the benefits all cloud provider determine to provide, but that not the case all the time. Lets see why some of IT industries don’t want Cloud:

    • Security: Every organization has very large data and they collected it by spending millions of dollar. Every in 21st century data is actual wealth. So, why an organization will send its data to any third party.
    • Uptime: Their is no thing like 99.99 uptime. Many IT organization blames that their cloud providers are shutdown for 3-4 days. This is not because of providers but sometime power grid fails, someone cut the fiber line.
    • We are working fine with old methods our IT can handle it.
    • Our IT will not able to handle cloud.
    • Multiple vender in cloud and no portability and interoperability.

    After studying many papers and technical reports i ended at following conclusion.

    Business function that suits cloud deployment can be low-priority business applications, for example, business intelligence against very large databases, partner-facing project sites, and low priority services. Cloud favours traditional web applications and interactive application that compromise two or more data sources and services with short life span. Based on above facts we can say that cloud is suitable for applications that are modular and loosely coupled having isolated workloads. Applications that need significantly different level of infrastructure throughout the month, or that have seasonal demands, such as increase in traffic during holiday shopping.

    cloud is not suitable for mission critical applications that depends on critical data normally restricted to organization (Private cloud are now a days are used for this purpose to some extend). Applications that run 24*7*365 with steady demand. Cloud doesn't work well with applications that scale vertically on single server.

    Saturday, March 2, 2013

    Virtualizing, Standardizing & Automating

    In a Cloud environment, people expect self-service, being able to get started very quickly, self-provisioning or rapid provisioning, scalability, better billing models. All these features demands that you have all your fundamentals well placed. You cannot expect cloud to produce what a cloud is expected to produce if it is not virtualized, standardized and automated, because people expect technology that is easily scalable, portable, interoperable and self-working. This is going to drive down the cost and improve service. Three main constituents for achieving these requirements are discussed as follows-

    • Virtualization: Virtualization isn’t a vague concept- you probably are already engaged in virtualization in one or other fashion. Virtualization technology is around 30 years old now. So, first define a simple definition of virtualization-“ Virtualization is an abstraction layer (hypervisor) that decouples the physical hardware (CPU, Storage, Networking) from the operating system to deliver greater IT resource utilization and flexibility.” I will go deep in virtualization some other day, for now just see how it helps in achieving above said goals. Using Virtualization one can easy allocate and dislocate resources to cloud user without any interaction of human. This property provides easy and trustable scalability in our system.
    • Standardization: Adoption of cloud computing by MSB’s and SSB’s is mostly rejected due to lack of standards in cloud computing. If one MSB’s uses service of one cloud provider, it cannot change its services very easily to another cloud. So, it requires collective acceptance for going in cloud and choosing a cloud provider which is very difficult. If we can achieve Standardization, uniform offerings will be readily available from different providers on a metered basis. It is also decrease Pricing because of increase in providers.
    • Automation: Cloud idea is developed and is popular because of its auto-service feature. This is back bone for acceptance of cloud computing by all fields of sciences. Automation requires Self-Service portals providing point and click access to all IT resources. Resources are provisioned on demand, helping to reduce IT resource setup and configuration cycle times.

    Friday, February 15, 2013

    Google App Engine Introduction PPT


    Google App engine is PaaS provider by Google for development of web Applications. It has many features such as Auto Scaling, Auto Configuration and Auto Deployment. Following is the presentation on Introduction of Google App Engine. Hope You Like it.

    Rajinder Sandhu
    errajindersandhu.blogspot.com

    Friday, February 8, 2013

    Cloud Computing – Not the Technology of Future

    Hope you like Cloud Computing- The Start.

    When i started studying about basics of cloud computing about 1 year ago all journals, articles, blogs were having one thing common – Cloud Computing is future generation technology even, some say it is hypothetical and it is not going to happen. Actually it is not a next generation technology it is this generation technology. All the fog gathered around the question- “What is Cloud Computing?” is now falling down and we are getting a clear sky with clouds. This became evident from the fact that people are now more interested in “HOW” rather than “WHAT” of cloud computing. Let me tell you an example. Last week I was interviewed for post of Assistant Professor in an university. I was expecting question like What is cloud computing? what is IaaS, PaaS and SaaS? What are benefits of cloud computing? Rather questions asked were How to change university ERP system on cloud? How can i change my cloud provider? How to do desktop virtualization? and many more How’s. I was surprised ans excited at same time. If at university level people are considering cloud as their IT resource provider at what level IT industries will be working. In the next paragraph i will show you some stats from current world of IT.

    Many big names in IT industry progressed toward cloud technology. MSB or SSB companies are also moving quickly on road of cloud computing. Lot of research both at academics and industry level is going on for cloud computing. Cloud Computing holds a place in Gartner top 10 technology for 4 successive years now Checkout. Report submit by IDC stated that spending on cloud is 12% of whole IT spending around the globe and it is going to 20% at the end of 2013. Survey conducted by F5 stated 70% of industries are having dedicated budget for cloud computing and there are lot for examples to prove that cloud is actually OUR generation technology.

    So, if you don’t know what is cloud computing? I am afraid to say, you are 3 years behind the current IT technologies. Tie up your seat belts and start discovering about cloud and its derivatives. I love to discuss hot trends in cloud computing some day.

    Feel free to leave Comments.

    Rajinder Sandhu

    Errajindersandhu.blogspot.com

    Tuesday, February 5, 2013

    Cloud Computing- The Start

    Cloud Computing - 
    It seems everyone is interested in this area. From security to distributing computing. Cloud has taken its place at every researchers desk across the globe. Many people ask me to explain cloud computing in naïve language. Actually it is a bit confusing what to say.  Let me think!!

    Whenever we create two or more nodes and show them on a network lets say Client-Server architecture. We often create a cloud between the two computers or machines connected across the network.

     So, now this cloud is doing all the computing. It looks funny but believe me this is the simplest way I explain What is Cloud Computing.

    Speaking of it professionally idea of cloud computing comes from the basic need of IT industry. IT industry needs just the solution to use. It didn’t want to enter into any managing and configuring part because they have to hire experts for that and of course it has budget associated with it. So basically the idea was, why not to rent services on the net and pay as per use just like taking a cab on rent. This is what cloud enables us to do. We can rent resources, computation power, storage, network, platform, software.
    I am not sure you get any idea about cloud computing. it is simplest one can explain but no need to worry i will explain it in detail in further posts. Till then Bbye and Take care.

    Rajinder Sandhu
    rajsandhu1989@gmail.com