Quotes

Tuesday, April 28, 2020

Spark - Databricks

Apache Spark is a sophisticated distributed computation framework for executing code in parallel across many different machines

Driver: Driver is a machine in which app runs, responsible for 3 main things.

1) Maintaining info about the spark app.

2)responding to the user program

3) analyzing, distributing and scheduling work across the executors.


Task:  Tasks are created by Driver and assigned a partition of data to process.
            Partition is a collection of rows that sits on the physical machine in your cluster. then tasks are assigned to slots for parellel execution. Once started each task fetchs data from it's soruce.

Slots: spark parellizes at two levels 1 is to split the work among executors. the other is slot. Each executor has a number of slots. Each slot can be assigned to task.

Job: the secret to spark 's performance is parellelism.Each parallelizeord action is called job. Each job is broken down into stages, which is a set of ordered steps that together accomplish a job.

Executor: each executor holds a chunk of data, the chunk is called partition. Executorats are resnponsible for carryout the work assigned by Driver program. responsible for two actions.
1) Execute the code assigned by driver.
2) Report the state of the computation back to the driver.


Dataframes: tables are equivalent to Apache Spark DataFrames. A DataFrame is an immutable, distributed collection of data organized into named columns. In concept, a DataFrame is equivalent to a table in a relational database, except that DataFrames carry important metadata that allows Spark to optimize queries

Lazy Evaluation: Lazy Evaluation refers to the idea that Spark waits until the last moment to execute a series of operations. Instead of modifying the data immediately when you express some operation, you build up a plan of transformations that you will apply to your source data. That plan is executed when you call an action. 

Lazy evaluation makes it easier to parallelize operations and allows Spark to apply various optimizations. 

Transformations: 
Transformations are at the core of how you express your business logic in Spark. They are the instructions you use to modify a DataFrame to get the results that you want. We call them lazy because they will not be completed at the time you write and execute the code in a cell  - they will only get executed once you have called an action. 

There are two types of transformations: Narrow and Wide.

For narrow transformations, the data required to compute the records in a single partition reside in at most one partition of the parent dataset. 
For wide transformations, the data required to compute the records in a single partition may reside in many partitions of the parent dataset. 




Remember, spark partitions are collections of rows that sit on physical machines in the cluster. Narrow transformations mean that work can be computed and reported back to the executor without changing the way data is partitioned over the system. Wide transformations require that data be redistributed over the system. This is called a shuffle. 

Shuffles are triggered when data needs to move between executors. 
Actions: Actions are statements that are computed AND executed when they are encountered  in the developer's code. They are not postponed or wait for other code constructs. While transformations are lazy, actions are eager.
Example commands that plan transformations or trigger actions. 
Pipelining: Lazy evaluation allows Spark to optimize the entire pipeline of computations as opposed to the individual pieces. This makes it exceptionally fast for certain types of computation because it can perform all relevant computations at once. Technically speaking, Spark pipelines this computation, which we can see in the image below. This means that certain computations can all be performed at once (like a map and a filter), rather than having to do one operation for all pieces of data and then the following operation. Additionally, Apache Spark can keep results in-memory as opposed to other frameworks that immediately write to disk after each task. 
Source: A Gentle Introduction to Apache Spark on Databricks
Catalyst Optimizer: The Catalyst Optimizer is at the core of Spark SQL's power and speed. It automatically finds the most efficient plan for applying your transformations and actions. 





In applications that reuse the same datasets over and over, one of the most useful optimizations is cachingCaching will place a DataFrame or table into temporary storage across the executors in your cluster and make subsequent reads faster. 

The use case for caching is simple: as you work with data in Spark, you will often want to reuse a certain dataset. It is important to be careful how you use caching, because it is an expensive operation itself. If you're only using a dataset once, for example, the cost of pulling and caching it is greater than that of the original data pull. 
Once data is cached, the Catalyst optimizer will only reach back to the location where the data was cached. 


Saturday, April 25, 2020

Azure concepts

An availability set is a logical grouping of two or more VMs that help keep your application available during planned or unplanned maintenance.
for high availability of services

scale sets: auto scale to keep with up with demand

Batch: large job processing.


Containers are light weight than VM as containerers run on host OS with some dependencies instead of another OS on top of Host OS like in VM's.


VM's  virtualize hardware(RAM, disk space etc)

Containers virtualize OS.



Friday, April 24, 2020

Azure

Cloud Computing:

cloud computing is like renting services, like storage services or cpu cycles on another company's computer. You only pay for what you use.

compute,
storage,
network,
Analytics.

it's aim is to make the running businesses more easier or efficient.







Benefits:

It's cost-effective

Cloud computing provides a pay-as-you-go or consumption-based pricing model.


capEx, OpEx



Public cloud

This is the most common deployment model. In this case, you have no local hardware to manage or keep up-to-date – everything runs on your cloud provider's hardware. In some cases, you can save additional costs by sharing computing resources with other cloud users.
Businesses can use multiple public cloud providers of varying scale. Microsoft Azure is an example of a public cloud provider.
Public cloud icon

Advantages

  • High scalability/agility – you don't have to buy a new server in order to scale
  • Pay-as-you-go pricing – you pay only for what you use, no CapEx costs
  • You're not responsible for maintenance or updates of the hardware
  • Minimal technical knowledge to set up and use - you can leverage the skills and expertise of the cloud provider to ensure workloads are secure, safe, and highly available
A common use case scenario is deploying a web application or a blog site on hardware and resources that are owned by a cloud provider. Using a public cloud in this scenario allows cloud users to get their website or blog up quickly, and then focus on maintaining the site without having to worry about purchasing, managing or maintaining the hardware on which it runs.

Disadvantages

Not all scenarios fit the public cloud. Here are some disadvantages to think about:
  • There may be specific security requirements that cannot be met by using public cloud
  • There may be government policies, industry standards, or legal requirements which public clouds cannot meet
  • You don't own the hardware or services and cannot manage them as you may want to
  • Unique business requirements, such as having to maintain a legacy application might be hard to meet


Private cloud

In a private cloud, you create a cloud environment in your own datacenter and provide self-service access to compute resources to users in your organization. This offers a simulation of a public cloud to your users, but you remain completely responsible for the purchase and maintenance of the hardware and software services you provide.
Private cloud icon

Advantages

This approach has several advantages:
  • You can ensure the configuration can support any scenario or legacy application
  • You have control (and responsibility) over security
  • Private clouds can meet strict security, compliance, or legal requirements

Disadvantages

Some reasons teams move away from the private cloud are:
  • You have some initial CapEx costs and must purchase the hardware for startup and maintenance
  • Owning the equipment limits the agility - to scale you must buy, install, and setup new hardware
  • Private clouds require IT skills and expertise that's hard to come by
A use case scenario for a private cloud would be when an organization has data that cannot be put in the public cloud, perhaps for legal reasons. An example scenario may be where government policy requires specific data to be kept in-country or privately.
A private cloud can provide cloud functionality to external customers as well, or to specific internal departments such as Accounting or Human Resources.


Hybrid cloud

A hybrid cloud combines public and private clouds, allowing you to run your applications in the most appropriate location. For example, you could host a website in the public cloud and link it to a highly secure database hosted in your private cloud (or on-premises datacenter).
Hybrid cloud icon

This is helpful when you have some things that cannot be put in the cloud, maybe for legal reasons. For example, you may have some specific pieces of data that cannot be exposed publicly (such as medical data) which needs to be held in your private datacenter. Another example is one or more applications that run on old hardware that can't be updated. In this case, you can keep the old system running locally, and connect it to the public cloud for authorization or storage.

Advantages

Some advantages of a hybrid cloud are:
  • You can keep any systems running and accessible that use out-of-date hardware or an out-of-date operating system
  • You have flexibility with what you run locally versus in the cloud
  • You can take advantage of economies of scale from public cloud providers for services and resources where it's cheaper, and then supplement with your own equipment when it's not
  • You can use your own equipment to meet security, compliance, or legacy scenarios where you need to completely control the environment

Disadvantages

Some concerns you'll need to watch out for are:
  • It can be more expensive than selecting one deployment model since it involves some CapEx cost up front
  • It can be more complicated to set up and manage



Scale refers to adding network bandwidth, memory, storage, or compute power to achieve better performance.



Saturday, April 18, 2020

FAQ's

How Do I Troubleshoot SSH Connectivity Errors?

and 
When you’re experiencing an SSH connectivity error, there are a few steps you can take to troubleshoot it depending on the cause. Here are some tips for troubleshooting the reasons for a Connection refused error that we covered above:
  • If your SSH service is down. Contact your hosting provider to see why your SSH service isn’t running. For localhost or dedicated servers, you can use the command sudo service ssh restart to try to get it running again.
  • If you entered the wrong credentials. Once you’ve double-checked the SSH port using the grep Port /etc/ssh/sshd_config command, try connecting again with the correct details.
  • If your SSH port is closed. This is usually a side effect of one of the two reasons listed below. Either install an SSH daemon on the server you want to connect to or change your firewall rules to accept connections to your SSH port.
  • If SSH isn’t installed on your server. Install an SSH tool such as OpenSSH on the server you want to connect to using the sudo apt install openssh-server command.
  • If your firewall is blocking your SSH connection. Disable the firewall rules blocking your SSH connection by changing the destination port’s settings to ACCEPT.

LDAP/AD
Nov 13, 2017
Is there any way to install package other than apt-get install?

Friday, April 17, 2020

Firewall and Misc

Firewall is a set of rules to restrict ip, port, domain names, string etc. from public n/w.


Hardware + Software firewall in general

Host specific firewalls comes along with OS like Windows Firewall.



RDS's can't be accessed

1) Check security group inbound rules
2) DNS name using dig, nslookup, nc etc or digwebinterface.com
3) Memory on EC2 or ports, credentails etc.


We need private subnet and pvt subnets to save public ip's as those are limited and security concerns too.


pvt subnets-- route tables of pvt subnet to point to NAT instance hosted in the public subnet, where route tables of public subent mapped to internet gateway.





Friday, April 10, 2020

chmod 777 hdfs commands 2

chmod OwnerGroupOthers
chmod 777

4- read
2-write
1-execute


1) Version Check

To check the version of Hadoop.
ubuntu@ubuntu-VirtualBox:~$ hadoop version
Hadoop 2.7.3
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r baa91f7c6bc9cb92be5982de4719c1c8af91ccff
Compiled by root on 2016-08-18T01:41Z
Compiled with protoc 2.5.0
From source with checksum 2e4ce5f957ea4db193bce3734ff29ff4
This command was run using /home/ubuntu/hadoop-2.7.3/share/hadoop/common/hadoop-common-2.7.3.jar

2) list Command

List all the files/directories for the given hdfs destination path.
ubuntu@ubuntu-VirtualBox:~ $ hdfs dfs -ls /
Found 3 items
drwxr-xr-x   - ubuntu supergroup          0 2016-11-07 01:11 /test
drwxr-xr-x   - ubuntu supergroup          0 2016-11-07 01:09 /tmp
drwxr-xr-x   - ubuntu supergroup          0 2016-11-07 01:09 /usr

3) df Command

Displays free space at given hdfs destination
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -df hdfs:/
Filesystem                Size   Used  Available  Use%
hdfs://master:9000  6206062592  32768  316289024    0%

4) count Command

  • Count the number of directories, files and bytes under the paths that match the specified file pattern.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -count hdfs:/
4            0                  0 hdfs:///

5) fsck Command

HDFS Command to check the health of the Hadoop file system.
ubuntu@ubuntu-VirtualBox:~$ hdfs fsck /
Connecting to namenode via http://master:50070/fsck?ugi=ubuntu&path=%2F
FSCK started by ubuntu (auth:SIMPLE) from /192.168.1.36 for path / at Mon Nov 07 01:23:54 GMT+05:30 2016
Status: HEALTHY
Total size:           0 B
Total dirs:           4
Total files:          0
Total symlinks:                 0
Total blocks (validated): 0
Minimally replicated blocks:        0
Over-replicated blocks:  0
Under-replicated blocks:              0
Mis-replicated blocks:                   0
Default replication factor:            2
Average block replication:            0.0
Corrupt blocks:                0
Missing replicas:                             0
Number of data-nodes:                1
Number of racks:                            1
FSCK ended at Mon Nov 07 01:23:54 GMT+05:30 2016 in 33 milliseconds
The filesystem under path '/' is HEALTHY

6) balancer Command

Run a cluster balancing utility.
ubuntu@ubuntu-VirtualBox:~$ hdfs balancer
16/11/07 01:26:29 INFO balancer.Balancer: namenodes  = [hdfs://master:9000]
16/11/07 01:26:29 INFO balancer.Balancer: parameters = Balancer.Parameters[BalancingPolicy.Node, threshold=10.0, max idle iteration = 5, number of nodes to be excluded = 0, number of nodes to be included = 0]
Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  Bytes Being Moved
16/11/07 01:26:38 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.1.36:50010
16/11/07 01:26:38 INFO balancer.Balancer: 0 over-utilized: []
16/11/07 01:26:38 INFO balancer.Balancer: 0 underutilized: []
The cluster is balanced. Exiting...
7 Nov, 2016 1:26:38 AM            0                  0 B                 0 B               -1 B
7 Nov, 2016 1:26:39 AM   Balancing took 13.153 seconds

7) mkdir Command

HDFS Command to create the directory in HDFS.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -mkdir /hadoop
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -ls /
Found 5 items
drwxr-xr-x   - ubuntu supergroup          0 2016-11-07 01:29 /hadoop
drwxr-xr-x   - ubuntu supergroup          0 2016-11-07 01:26 /system
drwxr-xr-x   - ubuntu supergroup          0 2016-11-07 01:11 /test
drwxr-xr-x   - ubuntu supergroup          0 2016-11-07 01:09 /tmp
drwxr-xr-x   - ubuntu supergroup          0 2016-11-07 01:09 /usr

8) put Command

File
Copy file from single src, or multiple srcs from local file system to the destination file system.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -put test /hadoop
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -ls /hadoop
Found 1 items
-rw-r--r--   2 ubuntu supergroup         16 2016-11-07 01:35 /hadoop/test
Directory
HDFS Command to copy directory from single source, or multiple sources from local file system to the destination file system.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -put hello /hadoop/
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -ls /hadoop
Found 2 items
drwxr-xr-x   - ubuntu supergroup          0 2016-11-07 01:43 /hadoop/hello
-rw-r--r--   2 ubuntu supergroup         16 2016-11-07 01:35 /hadoop/test

9) du Command

Displays size of files and directories contained in the given directory or the size of a file if its just a file.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -du /
59  /hadoop
0   /system
0   /test
0   /tmp
0   /usr

10) rm Command

HDFS Command to remove the file from HDFS.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -rm /hadoop/test
16/11/07 01:53:29 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /hadoop/test

11) expunge Command

HDFS Command that makes the trash empty.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -expunge
16/11/07 01:55:54 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.

12) rm -r Command

HDFS Command to remove the entire directory and all of its content from HDFS.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -rm -r /hadoop/hello
16/11/07 01:58:52 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /hadoop/hello

13) chmod Command

Change the permissions of files.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -chmod 777 /hadoop
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -ls /
Found 5 items
drwxrwxrwx   - ubuntu supergroup          0 2016-11-07 01:58 /hadoop
drwxr-xr-x   - ubuntu supergroup          0 2016-11-07 01:26 /system
drwxr-xr-x   - ubuntu supergroup          0 2016-11-07 01:11 /test
drwxr-xr-x   - ubuntu supergroup          0 2016-11-07 01:09 /tmp
drwxr-xr-x   - ubuntu supergroup          0 2016-11-07 01:09 /usr

14) get Command

HDFS Command to copy files from hdfs to the local file system.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -get /hadoop/test /home/ubuntu/Desktop/
ubuntu@ubuntu-VirtualBox:~$ ls -l /home/ubuntu/Desktop/
total 4
-rw-r--r-- 1 ubuntu ubuntu 16 Nov  8 00:47 test

15) cat Command

HDFS Command that copies source paths to stdout.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -cat /hadoop/test
This is a test.

16) touchz Command

HDFS Command to create a file in HDFS with file size 0 bytes.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -touchz /hadoop/sample
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -ls /hadoop
Found 2 items
-rw-r--r--   2 ubuntu supergroup          0 2016-11-08 00:57 /hadoop/sample
-rw-r--r--   2 ubuntu supergroup         16 2016-11-08 00:45 /hadoop/test

17) text Command

HDFS Command that takes a source file and outputs the file in text format.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -text /hadoop/test
This is a test.

18) copyFromLocal Command

HDFS Command to copy the file from Local file system to HDFS.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -copyFromLocal /home/ubuntu/new /hadoop
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -ls /hadoop
Found 3 items
-rw-r--r--   2 ubuntu supergroup         43 2016-11-08 01:08 /hadoop/new
-rw-r--r--   2 ubuntu supergroup          0 2016-11-08 00:57 /hadoop/sample
-rw-r--r--   2 ubuntu supergroup         16 2016-11-08 00:45 /hadoop/test

19) copyToLocal Command

Similar to get command, except that the destination is restricted to a local file reference.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -copyToLocal /hadoop/sample /home/ubuntu/
ubuntu@ubuntu-VirtualBox:~$ ls -l s*
-rw-r--r-- 1 ubuntu ubuntu         0 Nov  8 01:12 sample
-rw-rw-r-- 1 ubuntu ubuntu 102436055 Jul 20 04:47 sqoop-1.99.7-bin-hadoop200.tar.gz

20) mv Command

HDFS Command to move files from source to destination. This command allows multiple sources as well, in which case the destination needs to be a directory.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -mv /hadoop/sample /tmp
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -ls /tmp
Found 1 items
-rw-r--r--   2 ubuntu supergroup          0 2016-11-08 00:57 /tmp/sample

21) cp Command

HDFS Command to copy files from source to destination. This command allows multiple sources as well, in which case the destination must be a directory.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -cp /tmp/sample /usr
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -ls /usr
Found 1 items
-rw-r--r--   2 ubuntu supergroup          0 2016-11-08 01:22 /usr/sample

22) tail Command

Displays last kilobyte of the file "new" to stdout
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -tail /hadoop/new
This is a new file.
Running HDFS commands.

23) chown Command

HDFS command to change the owner of files.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -chown root:root /tmp
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -ls /
Found 5 items
drwxrwxrwx   - ubuntu supergroup          0 2016-11-08 01:17 /hadoop
drwxr-xr-x   - ubuntu supergroup          0 2016-11-07 01:26 /system
drwxr-xr-x   - ubuntu supergroup          0 2016-11-07 01:11 /test
drwxr-xr-x   - root   root                0 2016-11-08 01:17 /tmp
drwxr-xr-x   - ubuntu supergroup          0 2016-11-08 01:22 /usr

24) setrep Command

Default replication factor to a file is 3. Below HDFS command is used to change replication factor of a file.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -setrep -w 2 /usr/sample
Replication 2 set: /usr/sample
Waiting for /usr/sample ... done

25) distcp Command

Copy a directory from one node in the cluster to another
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -distcp hdfs://namenodeA/apache_hadoop hdfs://namenodeB/hadoop

26) stat Command

Print statistics about the file/directory at <path> in the specified format. Format accepts filesize in blocks (%b), type (%F), group name of owner (%g), name (%n), block size (%o), replication (%r), user name of owner(%u), and modification date (%y, %Y). %y shows UTC date as “yyyy-MM-dd HH:mm:ss” and %Y shows milliseconds since January 1, 1970 UTC. If the format is not specified, %y is used by default.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -stat "%F %u:%g %b %y %n" /hadoop/test
regular file ubuntu:supergroup 16 2016-11-07 19:15:22 test

27) getfacl Command

Displays the Access Control Lists (ACLs) of files and directories. If a directory has a default ACL, then getfacl also displays the default ACL.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -getfacl /hadoop
# file: /hadoop
# owner: ubuntu
# group: supergroup

28) du -s Command

Displays a summary of file lengths.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -du -s /hadoop
59  /hadoop

29) checksum Command

Returns the checksum information of a file.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -checksum /hadoop/new
/hadoop/new     MD5-of-0MD5-of-512CRC32C               000002000000000000000000639a5d8ac275be8d0c2b055d75208265

30) getmerge Command

Takes a source directory and a destination file as input and concatenates files in src into the destination local file.
ubuntu@ubuntu-VirtualBox:~$ cat test
This is a test.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -cat /hadoop/new
This is a new file.
Running HDFS commands.
ubuntu@ubuntu-VirtualBox:~$ hdfs dfs -getmerge /hadoop/new test
ubuntu@ubuntu-VirtualBox:~$ cat test
This is a new file.
Running HDFS commands.