Friday, August 30, 2019

Hadoop vs Spark tech stack options

Hadoop Spark
Batch MapReduce, Hive, Pig Spark Core
Realtime Apache STORM(Nimbus, Supervisor, Topology(Spout & Bolt) Spark Streaming
Machine learning Mahout Spark Mlib
Graph Data Processing Giraph Spark GraphX
Interactive Queries Hive Spark SQL

Friday, June 28, 2019

Python Numpy AWS Lambda


On my AWS account I created a Python 3.7 Lambda function using a deployment package with NumPy library following the below steps [5]:

1) I launched an EC2 instance with Amazon Linux AMI which is used by AWS Lambda Python 3.7 runtime [3]

2) I updated my instance with this command "sudo yum update -y" and then installed Python 3.7 following the steps in this link [6]:

               $ python3 --version
               Python 3.7.0

               $ pip3 --version
               pip 10.0.1 from /usr/local/lib/python3.7/site-packages/pip (python 3.7)

3) Created a new directory

               $ mkdir my_package_dir
               $ cd my_package_dir

4) Installed NumPy

               $ pip3 install numpy -t .

               # Successfully installed numpy-1.16.4

5) My Lambda function handler is

               $ nano lambda_function.py

               import numpy as np

               def lambda_handler(event, context):
                              a = np.arange(15).reshape(3, 5)

6) Finally I created a zip file including the NumPy library and my function's code in the handler file

               $ zip -r9 ../function.zip .

Then I used this "function.zip" file as my deployment package in my Python 3.7 Lambda function and I confirm that it works as expected. The above steps are also working with Python 3.6 which is included in the yum repository and can be installed using this command "sudo yum install -y python36", instead of pip3 you will use pip-3.6, all other steps are the same.

Wednesday, May 22, 2019

Terraform

$ terraform
Usage: terraform [--version] [--help] <command> [args]

The available commands for execution are listed below.
The most common, useful commands are shown first, followed by
less common or more advanced commands. If you're just getting
started with Terraform, stick with the common commands. For the
other commands, please read the help and docs before usage.

Common commands:
    apply              Builds or changes infrastructure
    console            Interactive console for Terraform interpolations
    destroy            Destroy Terraform-managed infrastructure
    fmt                Rewrites config files to canonical format
    get                Download and install modules for the configuration
    graph              Create a visual graph of Terraform resources
    import             Import existing infrastructure into Terraform
    init               Initialize a new or existing Terraform configuration
    output             Read an output from a state file
    plan               Generate and show an execution plan
    providers          Prints a tree of the providers used in the configuration
    push               Upload this Terraform module to Terraform Enterprise to run
    refresh            Update local state file against real resources
    show               Inspect Terraform state or plan
    taint              Manually mark a resource for recreation
    untaint            Manually unmark a resource as tainted
    validate           Validates the Terraform files
    version            Prints the Terraform version
    workspace          Workspace management

All other commands:
    debug              Debug output management (experimental)
    force-unlock       Manually unlock the terraform state
    state              Advanced state management

AWS X-Ray SDK

Installing

The AWS X-Ray SDK for Node.js is compatible with Node.js version 4 and later. The AWS X-Ray SDK for Node.js has been tested with versions 4.x through 12.x of Node.js. There may be issues when running on versions of Node.js newer than 12.x.
The SDK is available from NPM. For local development, install the SDK in your project directory with npm.
npm install aws-xray-sdk
Use the --save option to save the SDK as a dependency in your application's package.json.
npm install aws-xray-sdk --save

Thursday, May 9, 2019

DynamoDB Partitions links


1) Partitions and Data Distribution: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.Partitions.html

2) DynamoDB streams: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html

3) DynamoDB streams Boto3: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodbstreams.html

4) Queries with DynamoDB: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.html 

Tuesday, May 7, 2019

Kafka vs AWS Kinesis

Apache KafkaAmazon Kinesis
Developed/Hosted ByLinkedInAmazon
SoftwareOpen-SourceProprietary
SDK SupportAWS SDK supports Android, Java, Go, .NETKafka SDK supports Java
Configuration & FeaturesMore control on configuration and better performance.Number of days/shards can only be configured
Data Stored InKafka PartitionKinesis Shard
ReliabilityReplication factor can be configuredKinesis writes synchronously to 3 different machines/data-centers
PerformanceKafka winsKinesis writes each message synchronously to 3 different machines
Configuration StoreApache ZookeeperAmazon DynamoDB
SetupWeeksCouple Of hours
Data RetentionConfigurable7 days at max
Log CompactionSupportedOnly can store logs for 7 days
Processing EventsMore than 1000s of events/secAtmost 1000s of events/sec
CheckpointingOffsets stored in special topicDynamoDB
OrderingPartion levelShard level
Human CostsRequire human support for installing and managing their clusters, and also accounting for requirements such as high availability, durability, and recoveryKinesis is just about pay and use
Producer ThroughputKafka WinsKinesis is bit slower than Kafka
Incident Risk/MaintainenceMore In KafkaAmazon takes care
Ordered sequence of immutable data recordsKafka TopicKinesis Stream
Each record has a unique number calledOffset numberSequence number
ConceptsKafka StreamsKinesis Analytics

Saturday, January 19, 2019

Misc



Think globally and act locally

Categorize and Conquer

Customer centric, business driven arch.

Career + Impact

Distributed Computing: A Guide to Comparing Data Between Hive Tables Using Spark

In big data, efficient data comparison is essential for ensuring data integrity and validating data migrations. Apache Spark, with its in-me...