Sunday, January 7, 2024

java.lang.NoSuchMethodError: org.apache.spark.sql.AnalysisException.(Ljava/lang/String;Lscala/Option;Lscala/Option

"""
PYSPARK
Py4JJavaError: An error occurred while calling o560.save.
: java.lang.NoSuchMethodError: org.apache.spark.sql.AnalysisException.<init>(Ljava/lang/String;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;)V
at org.apac
"""
df.withColumn("key",lit(1)) #Wrong
#lit should be String
df.withColumn("key",lit("a")) #COrrect
view raw error.py hosted with ❤ by GitHub

Tuesday, December 19, 2023

Tenacity with Asyncio - Defensive Programming

from tenacity import (retry
,stop_after_attempt
,wait_fixed
,retry_if_exception_type
,stop_after_delay)
import httpx
import asyncio
import socket
@retry(stop=(stop_after_delay(3) | stop_after_attempt(5)),wait=wait_fixed(2),retry=retry_if_exception_type(Exception))
async def stop_after_7_attempts(): #stop_after_10s_or_5_retries
print("Stopping after 7 attempts")
resp=httpx.get("https://www.youtube.com/aasd")
if (resp.status_code == 404): raise Exception("404 Not working")
print(" trying ... " )
asyncio.run(stop_after_7_attempts())
view raw tenacity.py hosted with ❤ by GitHub

Saturday, October 21, 2023

Python : Find Sum using Parallel Procesing using Common Aggregator

import concurrent.futures
import threading
a=0
def fn(i):
global a
a=a+i
print(a,threading.get_native_id())
if __name__ == '__main__':
with concurrent.futures.ThreadPoolExecutor(3) as ep:
ep.map(fn, range(0, 100))
print("final " + str(a))
view raw MultiThread.py hosted with ❤ by GitHub

Thursday, September 21, 2023

Pyspark : Compression

# %% [markdown]
# Test Sample Details
#
# Format : JSON
# Count = 97
# File Size = 2.8 MB
#
# %% [markdown]
# Compression Codec | Time to Compress | Time to Decompress | Benefits | Drawbacks | Compression Size | Common Use Cases
# ----------------- | ----------------- | ------------------- | ----------------------------------- | ----------------------------------- | ---------------------- | -----------------
# None | Very Fast | Very Fast | No CPU overhead | No compression | No compression | No compression required
# LZ4 | Fast | Fast | Low compression overhead | Not the highest compression ratio | Smaller than original data | Real-time , in-memory storage
# Gzip | Slower | Slower | High compression ratio | High CPU overhead | Smaller than original data | Archiving, storage, long-term backups
# Snappy | Fast | Fast | Low compression overhead | Not as compact as Gzip or Zstd | Smaller than original data | Real-time , columnar storage
# Zstd | Balanced | Balanced | High compression ratio | Higher CPU usage compared to LZ4/Snappy | Smaller than Gzip | Batch processing, data storage
# -----------------------------------------------------------------
#
#
#
# Result
#
# parition = Default
# None(.parquet) = 6.7MB * 9 Files
# lz4(.lz4.parquet) = 3.6 MB * 9 Files
# gzip(.gzip.parquet) = 2.8 MB * 9 files (Highest)
# snappy(.snappy.parquet) = 3.6 MB * 9 FIles
# zstd(.zstd.parquet) = 2.8 MB * 9 Files
#
#
# Final Verdict :
#
# Performance / Speed Ranking :
# 1. None
# 2. LZ4
# 3. Snappy
# 4. Zstd
# 5. Gzip
#
# Compression Ranking :
# 1. Gzip
# 2. Zstd
# 3. Snappy
# 4. LZ4
# 5. None
# %%
from pyspark.sql import SparkSession
import os , pathlib , shutil
# %%
spark=SparkSession.builder.appName("test").getOrCreate()
# %%
path="/Users/deepakjayaprakash/Downloads/test"
df =spark.read.option("header","true").json(path+"/input")
# %%
NoCompression=path+"/NoCompression"
if os.path.exists(NoCompression): shutil.rmtree(NoCompression)
# Default Parquet
df.write.options(header="True",compression="none").save(NoCompression)
# %%
lz4=path+"/lz4"
if os.path.exists(lz4): shutil.rmtree(lz4)
df.write.options(header="True",compression="lz4").save(lz4)
# %%
gzip=path+"/gzip"
if os.path.exists(gzip): shutil.rmtree(gzip)
df.write.options(header="True",compression="gzip").save(gzip)
# %%
snappy=path+"/snappy"
if os.path.exists(snappy): shutil.rmtree(snappy)
df.write.options(header="True",compression="snappy").save(snappy)
# %%
zstd=path+"/zstd"
if os.path.exists(zstd): shutil.rmtree(zstd)
df.write.options(header="True",compression="zstd").save(zstd)
# %% [markdown]
# Decompression
# %%
spark.read.options(header="true").parquet(NoCompression).count()
# %%
spark.read.options(header="true").parquet(lz4).count()
# %%
spark.read.options(header="true").parquet(gzip).count()
# %%
spark.read.options(header="true").parquet(snappy).count()
# %%
spark.read.options(header="true").parquet(zstd).count()

Tuesday, September 5, 2023

PySpark Cache

# %%
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("performance").getOrCreate()
# %%
import string,random
# %%
rdd_student=spark.sparkContext.parallelize(zip(range(1,27),string.ascii_uppercase))
# %%
studentDF= spark.createDataFrame(rdd_student,["id","name"])
# %% [markdown]
# Without any Performance Enhancement
# %%
rddMaths=spark.sparkContext.parallelize(zip(range(1,27),[random.randint(1,100) for i in range(1,27)]))
mathsDF = spark.createDataFrame(rddMaths,["id","marks"])
studentDF.join(mathsDF,["id"],"leftouter").show()
# %% [markdown]
# With Checkpoint
# %%
rddScience= spark.sparkContext.parallelize(zip(range(1,27),[random.randint(1,100) for i in range(1,27)]))
scienceDF = spark.createDataFrame(rddScience,["id","marks"])
spark.sparkContext.setCheckpointDir("/tmp/spark-temp")
scienceDF=scienceDF.checkpoint()
studentDF.join(scienceDF,["id"],"leftouter").show()
# %% [markdown]
# With Cache
# %%
rddOther= spark.sparkContext.parallelize(zip(range(1,27),[random.randint(1,100) for i in range(1,27)]))
otherDF = spark.createDataFrame(rddOther,["id","marks"]).cache()
studentDF.join(otherDF,["id"],"leftouter").show()
view raw pysparkCache.py hosted with ❤ by GitHub

Saturday, September 2, 2023

About Me

Bootstrap demo

TriggerBlock is a cool tech blog that's all about helping you out with tutorials, problem-solving, hacks, and quick fixes for different coding languages and techie stuff like Java, Python, Scala, Kubernetes, and more. They've got plenty of articles on testing too. Check it out!

Thursday, March 16, 2023

Configure Pyspark

 Pyspark


Pre-Req:

  • Install Python 3.9
  • Find the location of python ($which python) and Keep it handy
  • pip3 install ipython #optional
  • pip3 install pyspark
  • Download apache spark zip > Unzip to a Path

Steps:

  • Create file ~/.bash_profile

  • Add Below Contents

      export PYSPARK_PYTHON=python3
      export PYTHONPATH="Python location"
      export PYSPARK_PATH="../spark-x.x.x-bin-hadoop3/bin/pyspark"
      alias pyspark=$PYSPARK_PATH
      export PATH=$PATH:$PYSPARK_PATH
    
      #optional
      alias ipython='python3 -m IPython' 
    

    Example :

      export PYSPARK_PYTHON=python3
      export PYTHONPATH="/Users/deepakjayaprakash/Library/Python/3.9/bin/python3"
      export PYSPARK_PATH="/Users/deepakjayaprakash/Downloads/spark-3.3.2-bin-hadoop3/bin/pyspark"
      alias pyspark=$PYSPARK_PATH
      export PATH=$PATH:$PYSPARK_PATH
    
      #optional
      alias ipython='python3 -m IPython' 
    
  • Save File

  • source ~\.bash_profile

view raw pyspark.md hosted with ❤ by GitHub

Thursday, February 16, 2023

Versioning

Versioning

  1. Calender Versioning
  2. Semantic Versioning

Calender Versioning

https://calver.org/

  • ubuntu 16.04 = Ubuntu October,2016
  • Pycharm 2022.3.2

Semantic Versioning

https://semver.org/

MAJOR_VERSION.MINOR_VERSION.PATCH_VERSION Ex:5.4.2 = Major version 5, Minor version 4, Patch version 2.

Patch versions = bug fixes Minor version = new functionality Patch version = breaking changes are introduced

Notes:

  1. Initial version of development 0.1.0 (early development ).
  2. Once the public API is stable, then a release to version 1.0.0

Pre-Release /Beta

  1. If a library was in version 2.8.9
  2. If there is a plan to release beta for 3.0.0
  3. beta Version release willbe 3.0.0-beta.1
view raw versioning.md hosted with ❤ by GitHub

Wednesday, January 18, 2023

ETL : Delta Lake

Delta Lake 


Delta Lake

quick link

Issues in Spark :

  • Cannot update /change date
  • No schema enforcement
  • No delta load
  • Data can be messed in overwrite

Adv of Delta Lake

  • delete/ Update records ( Upsert = Update and Insert)
  • Schema Enforcement
  • Time Travel Capabilities

Pre-Req:

  • spark = 3.3.2

  • pip3 install pyspark==3.3.2

      import findspark
      findspark.init('/Downloads/spark-3.3.2-bin-hadoop3/')
      
      pyspark --packages io.delta:delta-core_2.12:2.2.0 \
      --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
      --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
    

    or

      from pyspark.sql import SparkSession
    
      spark = SparkSession.builder.appName("test)\
      .config("spark.jars.packages", "io.delta:delta-core_2.12:2.2.0")\
      .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")\
      .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")\
      .getOrCreate()
    
      spark.range(1,10).write.format('delta').save('test_table')
    

Schema Enforcement

from pyspark.sql.functions import col
path="/tmp/deltatest"
spark.range(1,5).coalesce(1).write.format("delta").mode("overwrite").save(path)
new_data = spark.range(5,10).withColumn("id",col("id").cast("String"))

# Schema change caught error
new_data.coalesce(1).write.format("delta").mode("append").save(path)

Deletion and Update ( Deletes and Update Directly from Source )

from delta.tables import *
delta_df = DeltaTable.forPath(spark, path)
delta_df.delete("id=2")#auto refresh 
delta_df.update(condition = "id = 5", set = { "id": "500" })
delta_df.toDF().show()  

Merge (source is any , target is delta)

deltaTable = DeltaTable.forPath(spark, path)
newData = spark.range(0, 20)
deltaTable.alias("oldData") \
  .merge(newData.alias("newData"),"oldData.id = newData.id") \
  .whenMatchedUpdate(set = { "id": col("newData.id") }) \
  .whenNotMatchedInsert(values = { "id": col("newData.id") }) \
  .execute()	  

Delta- WaterMark Logic

deltaDF=readDelta(sourceSubFolder,"latest")
maxTS=deltaDF.selectExpr(f"max({deltaColumn})").first()["max(UPDATE_TS)"]
resultDF= srcDF2.filter(f"{deltaColumn} > '{maxTS}'")
if resultDF.count()==0:exit(1)
resultDF.write.option("mergeSchema", "true").format("delta").mode("append").save(targetPath)	

Upsert

df = spark.range(1,5).withColumn("sha",sha2(concat_ws("_",*["id"]),256))
if( fullLoad == True):
	df.coalesce(1).write.format("delta").mode("overwrite").save(path)
else:
	deltaTable = DeltaTable.forPath(spark, path)
	deltaTable.alias("oldData") \
	.merge(df.alias("newData"),"oldData.sha = newData.sha") \
	.whenMatchedUpdateAll()\
	.whenNotMatchedInsertAll()\
	.execute()	

Time Travel

spark.read.format("delta").option("versionAsOf",1).load("/mnt/data/delta")
spark.read.format("delta").option("timestampAsOf", "2019-09-22 00:57:00").load("/tmp/stars")

Utils

delta_df.history(1).show() #last operation	
delta_df.vacuum()     # recursively delete files and directories more than 7 days old
delta_df.vacuum(100)  # vacuum files not required by versions more than 100 hours old
view raw deltalake.md hosted with ❤ by GitHub

Tuesday, June 21, 2022

Azure Synapse

Azure Synapse 


Synapse

Note :

  • For JSON file make sure all new lines , blank spaces are removed.
  • Dataflow , Notebooks etc - names has to start with String.
  • blob = flat namespace , Gen2 = directories
  • Azure Active Directory (Entra ID) : Used to Create 1 Functional ID (secret ) called Serive Principle across all applications to communicate with each other
  • SP ( serivce Principle) : Managed by User ie., renew etc.,-
  • Managed identity (service principal) : managed by Azure

DATA TAB ( Lake Database , SQL Database , Linked )

  • Workspace : Based on Azure Serverless (Also Called OnDemand )
    • Lake Database: Spark to Query Data
    • SQL Database: Serverless to query Data
  • Linked

Workspace ( Source is Gen2/Blob Storage )

  • Source is Gen2/Blob Storage
  • This comes default with Synapse
  • Data is not in table format but processed as tables in Runtime
  1. Lake Database:
    • Only Default Gen2 Location which comes along with Synapse to Create Database
    • Uses Templates
    • You can create SQL like table , Add schema to Tables, Define Column Datatypes and Add Foreign to underlying Gen2 Storage
    • Can create above schema against CSV , Parq , json etc
  • Steps:
    • Open NoteBook
     %% sql
     Create Database DB;
     Use DB;
     create table T (id Int, name varchar);
     insert into T values(1,'A');
    
    • Open gen2 in Azure > create a folder "Customer" and upload csv file
    • Open synapse > Data > "+" > New Lake Database > Enter Name , linkedService ,above Folder path and Data Format > "+" to add table and Schema of Csv
    • Deploy (publish)
    • u shd be able to read with SQL / Notebooks
  1. SQL Database (Using SQL script )
    Operations :
    • read : OpenrowSet
    • write : CETAS , CET
create External table  T(c1 int) with(Datasource= xx , format='CSV', Bulk'path')  -- CET


create External table  T2 (c1,c2) as select * from T  ---CETAS ( Cannot alter External Table)

select * from openrowset(Datasource= xx , format='CSV', Bulk'path')

Example:

  • Open SQL script or Connect to Serverless using Dbeaver (Serverless endpoint in Azuer Portal > Synapse Overview Page)

  • run "create database db"

  • Go to SQL Script > Create new Script

  • run " use db"

      CREATE EXTERNAL DATA SOURCE ParentFolder WITH (LOCATION='https://path.dfs.core.windows.net');  
         CREATE EXTERNAL FILE FORMAT myFormat WITH (FORMAT_TYPE=DELTA);  
      select * from openrowset(bulk '/Country'' ,format= DELTA) as D 
    
  • This DB should be visible along with External Source and File Formats

1. Select from Table
IF NOT EXISTS (SELECT 1 FROM sys.databases WHERE name = 'Test')
		BEGIN
    		CREATE DATABASE Test;
		CREATE EXTERNAL DATA SOURCE ParentFolder WITH (TYPE = HADOOP, LOCATION='https://path.dfs.core.windows.net/temp/');
		CREATE EXTERNAL FILE FORMAT PARQUET WITH (FORMAT_TYPE=PARQUET);
	End
Go;
use Test;
select * from openrowset(bulk 'subFOlder/1.csv, format='PARQUET' ,DATA_SOURCE = 'ParentFolder') as rows

Note: Support JSON , CSV etc.,

2. Create new Table from Existing Table in Blob (destination = CSV , PARQUET formats )
CREATE EXTERNAL TABLE T 
WITH (LOCATION='subFolder/',DATA_SOURCE=ParentFolder,FILE_FORMAT=myFormat) 
AS 
Select * from OPENROWSET (LOCATION='subFolder2/',DATA_SOURCE=ParentFolder,FILE_FORMAT=myFormat)  as rows --src
GO;

DROP EXTERNAL DATA SOURCE DS;
DROP EXTERNAL FILE FORMAT myFormat;
3. Create View (CVAS)
 CREATE VIEW V AS (select * from openrowset(bulk 'https://blob.parquet',format = 'parquet') 
 select * from v ;
4. Call SQL Script in Pipeline
1. To use in pipeline
- **Add Web Activity in Pipelin
- **Call the current Synape Workspace -GET https**://myworkspace.dev.azuresynapse.net./sqlScripts/{sqlScriptName}?api-version=2020-12-01
- **Drill down to Required SQL script
5.Cosmos
SELECT * FROM OPENROWSET('CosmosDB','Account=my-cosmos-db;Database=my-db;Key=k',[my-container]) AS T

Linked

  • When u create link service and publish it . The icon should appear here.
  • Only Connected Services like Containers (Gen2 / Blob) ,Files ,Queues,Tables ,Cosmos

Dedicated SQL (Compute Power + Storage Space) -

  1. Relational DB
  2. Uses Distribution to Store Data and. Data not stored in 1 Table ie., Control Node= Compute Node 1 + Compute Node 2 .....
  3. Greater than 1TB
  4. Max 60 Nodes
  5. DW5000c = 5000/500 = 10 nodes
  6. Unique Contraints not supported (ie., duplicates need to be removed by other logic)
  7. Can Create Table in SQL DB
  8. High Cost
  9. Main Concepts: (DW = Node1 (Dist1 =partition1 ,p2..)+ Node2 (Dist2 =partition1 ,p2..) + .... + Node 60
    • Distibution :
      • Hash :
        • Data is distibute based on hash result
        • Large Tables
        • Better performance for joins uses hash col
      • Round Robin:
        • Small Table
        • Even Distribution
        • Bad performance for joins as no key col
      • Replicate : Clone data in all Nodes
    • Index
      • Non Clusterd Column : For Nvarchar(max) , varchar(max) ,Binary
      • Clusterd ColumnStore : Recommended
    • Partition : Data is distributed across many partition types
  10. Creating Tables
    • Normal create table t(i int) with (distribution= Hash(C), ColumnStore Index , Partition =C )
    • COPY into : COPY INTO T FROM "https"//xxx/*.parq WITH (FILE_TYPE = 'PARQUET' , MAXERRORS = 0, IDENTITY_INSERT = 'OFF' );
    • CTAS : Create table T With() As select * from T2
  11. Recommeded :
    • Insert : RoundRobin
    • Fact : Cluster Column Store + Hash Distribution
    • Dim : Replicated

NoteBooks link

df.createOrReplacetemView("T")
%%sql
select * from T

Delta Catalog Tables

df.write.format("delta").saveAsTable("MyManagedTable")
df.write.format("delta").option("path", "/mydata").saveAsTable("MyExternalTable")
spark.sql("CREATE TABLE MyExternalTable USING DELTA LOCATION '/mydata'")
%%sql

CREATE TABLE MyExternalTable
USING DELTA
LOCATION '/mydata'

Delta

SELECT * FROM OPENROWSET( BULK  'https://mystore.dfs.core.windows.net/folder/',FORMAT = 'DELTA' ) AS T

Connection String from Linked Service

from pyspark.sql import SparkSession
sc = SparkSession.builder.getOrCreate()
token_library = sc._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
accountKey = token_library.getConnectionStringAsMap("<LINKED SERVICE NAME>").get("<KEY NAME>")
print(accountKey)

Read and Write Cosmos

df= spark.read.format("cosmos.olap")\
.option("spark.synapse.linkedService", "nm")\
.option("spark.cosmos.container", "cont").load()

df.write.format("cosmos.oltp")\
    .option("spark.synapse.linkedService", "my_linked_service")\
    .option("spark.cosmos.container", "my-container")\
    .mode('append')\
    .save()

SQL Database (OnDemand) vs Dedicated SQL ( Relation DB)

                                Dedicated   Serverless
1. Select                       yes             Yes
2. insert,update,delete,merge   yes             No
3. CTAS                         yes             No
4. CETAS                        No             yes
5. Cost                         high            low
6. Strategy(hash/RoundRobin.)   yes             No
7. Provisioned                  Seperatly       by Default
8. Volume of Data               High            low
9. Data Stored                  Relational DB   Blog/Gen2

External Table and Views
Both are just pointers to a location in the datalake / Blob (Parquet, JSON, Blob Storage)

                    Exteranl Tables Vs Views
1. Define Schema    Yes                 No
2. Indexing         Yes                 No
3. Performance      Fast                Slow

READ VIEW from NoteBooks

token=accessToken = mssparkutils.credentials.getToken("key")
df=readTableViews("DW")
def readServerlessTableViews(token):
    df = (spark.read \
    .format("com.microsoft.sqlserver.jdbc.spark") \
    .option("url", "jdbc:sqlserver://xxx-syn-01-ondemand.sql.azuresynapse.net:1433;databaseName=db;") \
    .option("dbtable", "myschema.viewName") \
    .option("accessToken", token) \
    .load())
    return df

Dataflow

DataFlow Ref

  • Linked Service : Service to Connect to different DataSources (External And Internal)
  • Integrated Runtime : Compute which executes the DataFlow
  • Source(Linked Service) > Operation > Sink(Linked Service)

DataSet ( Connection to Source)

  • All parameters created in Dataset can be accessed in Dataflow at Debug Settings
  • The same parameters can be used at pipeline level on the Dataflow added
    • DataFlow > New DataSet > Parameters >New > test,string
    • Access : dataset().test
    • example: @concat('/',dataset().test,'/',substring(utcNow(),0,10 ),'.parquet') Note:
  • Dataflow can create DB table on fly.
  • CDC (Change Data Capture)
    • To handle delta records ( all new records including newly added duplicate records)
    • Available for SQL , Azure Dedicated , Gen2
    • Checkpoint Key (Unique marker to identlify last succesfully processed record)
      • Create a Dataflow and Enable CDC
      • Add Dataflow to pipleines
      • In pipleine >Select Dataflow> Settings > Checkpointkey
    • Disadv: If Delta set to ON on and if user wants to perform full load inbetween . User has to make changes to the flow and publish . [ref](https://learn.microsoft.com/en-us/answers/questions/838041/how-to-reset-the-change-data-capture-information-()

Data Flow Activities:

  • Uses Spark Engine

  • Src > Transform > Sink

  • Any number for Sources

  • Different Activities :

      Multiple Inputs:
      	- Source: spark.read
      	- New Branch: Clone  Data into new branch
      	- Join : Normal join (Use exists for left outer join )
      	- Conditional Split: Split records based on condition
      	- Exists: Compare two tables (innner join or left outer join)
      	- Union : Union /Append Files
      	- LookUp:  
      		- Left Outerjoin with 2 tables
      		- Verify Data between 2 tables 
      		
      Schema Modifier (x9)
      	 - DerivedColumn:  WithColumn , (Create Heirarchy for complex JSON)
      	 - Select:         Select (prune)/ WithColumnRenamed / Remove Columns
      	 - Aggregate:      GroupBY / All functions ( max ,sum , etc.,)
      	(User can give values to Agg and GroupBY in Same Activity to apply group level filter)
      	 - Surrogate Key:  Add unique primary Key
      	 - Pivot : Create pivot table
      	 - Unpivot: Uncreate pivot table
      	 - Window: 
      		- Without order(max(col),min(col) etc., ):Eg max('col).Over(Window.Partition('groupby_col))
      		- With Order(rank ,denserank :Eg rank.Over(Window.Partition('groupby_col).orderBy('sortCol))
      	  - Rank: 
      		- Used Single : Entire Dataframe is ranked ( Dense = serially with no gaps)
      		- Row Column : Name of new Column
      		- Row Condition : Existing Col - Asc / Desc
      	  - External Call: rest api
      	  
      Formatters
      	- Flatten: Json with Array ONLY Eg : {"A" : [1,2,3]}
      	- Parse: String > Json
      	- Stringify: Get particular sub-json from main json eg json > Table Cols
      
      Row modifier
      	- Filter:  Filter / Where 
      	- Sort: sort
      	- Alter Row (DB only) : Update DB, With+When , Drop , Fill , parameteried 
      	
      Delete  (on real time only)
      	- Assert: Assert + Derived Column ( isError ('col_name working'))
      
      Flowlet
      
      Destination(Sink)          
       	- spark.write
      	- Cached Sink (temp loc): Access "sink_name#outputs()[RowNo].ColumnName"  
      	- Cached Sink - Convert Data to StringDatatype  to avoid error - "undefined" is not valid JSON Eg : toString(Col1)
      	- cached Lookup (Dict): Sink> Settings> Add Key Column.Access "sink_name#outputs(columnname)" , "sink1#outputs()[1].MAX_UPDATE_TS"
      
      Misc
      	- WildCard:Normal Regex
      	- $$: All Columns
      	- $parameter: Access parameter created within same dataflow activity
      	- String Interpolaton : "folder/file-{$n}.csv" #n is a parameter
      	- ByName('city'): Any column whose name is city
    

Dataflow Script Representation :

Dataflow Transformations

Derived Columns :

  • $$ (this) : All Cols

  • $0 current column name

  • name Eg: name!="abc"

  • type Eg: type != int

  • stream represents the name associated with each stream, or transformation in your flow

  • position is the ordinal position of columns in your data flow

  • origin is the transformation where a column originated or was last updated

    1. trim any col which is a String(Pattern): - Derived Col > Add > Col Pattern > type == 'string',$$ -> trim($$) - Auto mapping on
    2. Remove special character from all columns
      • Add Select
      • Add Mapping > Rule based Mapping
      • true() , regexReplace($$,'[^0-9a-zA-Z]','')
      • or true() , regexReplace('Column','[^0-9a-zA-Z]','')
    3. String Column to Date
      • Derived COlumn
      • toDate(byName('Date'),'mm/dd/yyyy')
    4. Metadata of Column inside Dataflow
      • Derived Col > col ,Columns()
      • Flatten > unroll by col > col
    5. Calculate Column Pattern (apply logic for data in mutliple columns)
      • Derived Column >Columns >Add >Column PAttern type =='string', $$+'_clean', trim($$,'')
    6. Max : Agg > max(col)

Dataflow Json

1. Table to Json 
   - Read Source (import projection)
   - Derived Col > Create Structure 
   - Save to Json Sink
2. Handling json :
		- Parse
		- Flatten
		- Stringyfy
		- Derived Cols : Create Complex Structure
3. Stringify 
	- get particular sub-json from main json eg json > Col to Table 
	- DataFlow >REST APi Source (url="https://api.first.org/data/v1/countries" , Anonymous Request) or        https://restcountries.com/v3.1/name/united
	- Source > Projection > Import projection
	- Add Stringyfy>Cols>newcol , Expr >Under Input Schema> Select Structure to Flatten
	- Add Select Statment > filter the new_col
	- Select new column 
	- Sink
4. Parse (JSON in single Column in a table) :
	val j = List(("IN",1),("US",2)).toDF("country","medals")
	.withColumn("j", to_json(struct('country,'medals)))
        +-------+------+--------------------+
        |country|medals|                   j|
        +-------+------+--------------------+
        |     IN|     1|{"country":"IN","...|
	- Make sure the Source > projection > Import Schema
	- Add Parse
		- Document Form = Single Document
		- Columns > new col , j , (country as string,		medals as integer)
5. Flatten :
	 - Flattens  Json with Array ONLY
	 - If json has array {"A" : [1,2,3]}

Pipeline

  • Linked Service : Service to Connect to different DataSources (External And Internal)
  • Activity : Copy , Dataflow , Notebooks ,SQL etc.,
  • Private Endpoint : Endpoint with in Private Virtual network
  • Integrated Runtime : Compute which executes the DataFlow
    • Azure IR : in cloud-to-cloud scenarios.
    • Self hosted IR : to connect to on-premises data sources or services.
    • SQL Server IR : to connect to SQL Server databases in a private network.
  • Can run activities in parallel
  • Throttling under Integration Runtime 'AutoResolveIntegrationRuntime' : There is a limit of simultaneous pipelines in an integration runtime. You need to split the pipeline to run into multiple runtimes.
  • Triggers : used to start the pipeline
    • Manual
    • Tumbling Window :
      • Run Continously after finishing
      • Delays and concurrency can be set
      • Auto Retry in case of Fail
    • Event : storage Events etc.,
    • Scheduled Triggers Note:
  1. Pipeline and Synapse Live uses functions from ARM templates .
  2. Hence sync master branch with ARM template update ARM template

Activity 1 : Print statement alternate :

1. create a variable
2. Set the variable in the pipeline
3. Check the contents of variable after the run 

Activity 2 : Protoype pipeline

1. Use wait statements 
2. Set Variable 

Activity 3 : Token Refresh logic

  1. generate token 1st time
  2. Parallel Connection to
    1. until loop(flag is true)
      1. Wait
      2. Keep generating new token
    2. Copy Activity(uses above token)
      1. fail : Set flag = true
      2. Pass : Set flag = true

Activity 4: Run SQL Script saved in Datatab

  1. Web Activity > Url :exampleWorkspace.dev.azuresynapse.net/sqlScripts/exampleSqlScript?api-version=2020-12-01, Method: GET , Auth: Managed Identity , Resrc : https://devsynapse .net

HTAP (Hybrid Transactional and Analytical Processing Patterns)

  • Data Sync bw Src (SQL server , Azure SQL database )and Target - Dataverse(Datalakes) , Azure SQL Dedicated and Cosmos targets

  • Src > Synapse > Target

  • Cosmos DB (OLTP):

    • No SQL
    • Type of Cosmos DB can be selected depending on Source (SQL ,Mongo etc., )
    • Fast Read and Write
    • Auto Scales
    • Uses API
    • Expensive
    • To Enable HTAP (On Demand - video8):
      • Cosmose Homepage : To turn on Analytical Contianer in Cosmos
      • In Synapse: Homepage > Integration > Enable Synapse link
    • Synapse read cosmos: Select * from openrowset() with()as T CROSS APPLY OPENJSON(C) with()
  • SQL server

    • Use managed Identnity
    • Azure Synapse Link enabled
    • Add ips to Firewall
    • Create linked Service + Linked CONNECTION in Synapse(Piplnes > "+")
    • Consfiure Blob Storage for intermiitend result

Deployment Strategy [GIT]

Summary

  • Publish branch : ARM templates are stored and updated
  • Collaboration Branch = Master Branch.
  • Feature Branch = Individual Branch created by users.
  • Root Folder = root folder in your Azure Repos collaboration branch.
  • Pull Request =Request merge from Feature Branch > Master Branch.
  • commit all = Save All Changes to Git.
  • Publish Should only be allowed from master Branch.
  • Code is merged from Master Branch to Synapse Services. (Git Actions is configured to run pull)
  • Publich > ARM template gets created.
  • Once master gets published , the code is available in Synapse-live and workspace_publish branch get created.
  • ARM Template = Azure Resource Management Template. Note : ARM templates from workspace_publish branch is not used.

Whenever we click on button publish (master branch) . Synapse creates 2 jsons

  1. TemplateForWorkspace :
    • All work (pipelines, Notebooks, DataFlows etc., ) in 1 single file
    • This file does not contain parameters
    • It contains logics only
  2. TemplateParametersForWorkspace : Contains only Parameters

For Successful Deployment we need to :

  1. Move TemplateForWorkspace to higher Environment as is
  2. Replace "TemplateParametersForWorkspace" with values as per the higher Env

Git (parameter replace for all artifacts)

  1. When Code is merged from Master Branch to Synapse Services.

  2. This will trigger function in GitActions ( Azure/Synapse-workspace-deployment@V1.7.0 ) - which is configured in CICD

     	uses:  Azure/Synapse-workspace-deployment@V1.7.0
     	with:
     		TargetWorkspaceName:  ${{ env.SN_WORKSPACE }}  # Dev [Live]
     		ArtifactsFolder:  ${{ github.workspace }}
     		environment:  Azure Public
     		clientId:  ${{ env.CLIENT_ID }}
     		clientSecret:  ${{ env.CLIENT_SECRET }}
     		tenantId:  ${{ env.TENANT_ID }}
     		resourceGroup:  ${{ env.RESOURCEGROUP }}
     		subscriptionId:  ${{ env.SUBSCRIPTIONID }}
     		DeleteArtifactsNotInTemplate:  true
     		operation:  validateDeploy
    
  3. Two ARM template (Azure Resource Management Template) gets created.

    • TemplateParametersForWorkspace.json
    • TemplateForWorkspace.json
    • Note: ARM templates from workspace_publish branch is not used.
  4. Users Need to add 2 custom files at the root:

    • {{ synapse_workspace }}-parameteres.yaml : Value of Each Property Eg: linkedservice

    • template-parameter-definition.json :

    • Synapse will read this file and use its configuration to generate which properties get parameterized.

    • If no file is found, the default template is used.

    • file consists of [trigger, pipeline, linked service, dataset, integration runtime, and data flow]

    • "=" is current value ,"-" is don't keep the default .

    • Syntax : <action>:<name>:<stype>

           "Microsoft.Synapse/workspaces/linkedServices": {
               "*": {
                   "properties": {
                       "typeProperties": {
                           "baseUrl": "="
                       }
                   }
               }
           }
      

    If u have linked service , "parameteres.yaml" will replace the values with values in this yml and in the structure as "template-parameter-definition.json"

  5. Above Function will Deploy Synapse artifacts using templates to given env

  6. You can get this file from the link.

To Change artifacts of properties , A custom parameter template can be used {{ workspace }}-parameteres.yaml. To override the default parameter template, a custom parameter template named template-parameters-definition.json should be placed in the root folder of the Git branch.

Points to Remember in Synapse - Git

  • Collaboration Branch = Master Branch.
  • Feature Branch = Individual Branch created by users.
  • Root Folder = root folder in your Azure Repos collaboration branch.
  • Pull Request =Request merge from Feature Branch > Master Branch.
  • commit all = Save All Changes to Git.
  • Publish Should only be allowed from master Branch.
  • Code is merged from Master Branch to Synapse Services. (Git Actions is configured to run pull)
  • ARM template gets created.
  • Once master gets published , the code is available in Synapse-live and workspace_publish branch get created.
  • ARM Template = Azure Resource Management Template. Note : ARM templates from workspace_publish branch is not used.

Questions

NoteBooks Questions :

- Whats is the Use of temp tables ?  to reference data across languages
- How to reference other Notebook  ? `%run /<path>/Notebook1 { "parameterInt": 1}
- How Return Values from NotePad ?
	from notebookutils import mssparkutils		
	mssparkutils.notebook.exit("hi")
- To pass external parameters from pipelines to Notebook ?
  - Create a variable in Notebook Eg:input=""
  - Convert it into parameters ie., hover over >> ck on "..." > Toggle parameters
  - Create a new Pipeline
  - Add the notebook into pipeline
  - In Pipeline ,select Notebook > settings >Add notebook> baseparamters>"input"- string 
  - Click outside of Noteboook > variables > "input" - string
  - click on Notebook in pipeline > baseparamters > "input"  ,@variables('input')
- How to Read a CSV from Gen2 ? 
	df= spark.read.option('header','true').option('delimiter',',').csv('abfss://1.csv')	
- What are Magic Commands ?
	- Line Magic(%) - same line only Eg: ```python y = 2 if x < 3 else 4```
	- Cell Magic(%%) - entie cell Eg:
			%%timeit
			if x < 3:y=2
			else:y=4
	- How is Spark session configuration done magic command ?
     %%configure 
		{"driverMemory":"28g","driverCores":4,"executorMemory":"28g","executorCores":4
			,"jars":["abfs[s]://blob/myjar1.jar"]}
- How to Reference unpublished notebook ? Check box option on Notebook Settings
- How Python logging in Notebook ? 
	import logging
	defaultLogger = logging.getLogger('default') 
	defaultLogger.debug("default debug message")

- File operations ?mssparkutils.fs.ls('path') #head,append,ls,cp,mv.mkdirs,rm
- read parq ? ```spark.read.parquet('abfss://parquet@deltaformatdemostorage.dfs.core.windows.net/employees')```

Dataflow Questions

1. StringInterpolation (parameter inside string)? 
	"folder/file-{$n}.csv" #n is a parameter
	"My age is :{toInteger(col)}"
2. Parameters Wildcard Dataflow ?
	-dataflow add parameter > add "path"
	-Source > Source Options > Wildcard paths > 'folder/'+$path+'*.csv'
3. Send whole Col itself as parameter?
	-Dataflow > Parameter > data , string, "test"       # this is data
	-Dataflow > Parameter > varcol , double, colname    # Dynamic join
4. schema Drift (Source) ?
	 Enable when column changes in a Schema
5. BroadCast Join (Join) ?
	When whole Table can be accomodated in memory 
6. DataFlow Assert ?
	toInteger($col_name) == 1000
7. How to "select max(revenue),* from T"?  (Self Join)  
	- Src > Projection > Import Projection
	- New branch > Aggregate  > Add : maxcolumn ,max(revenue)
	-  Inner Join (main branch) when revenue = max(revenue)
8. Select max(column) from table
	- Src > Projection > Import Projection
	- New branch > Aggregate  > Add : countcolumn ,count(col_name)
9. Find Surrogate Key
	[Ref](https://www.youtube.com/watch?v=9U-0VPU2ZPU)
	- Lookup (does left outer join of different source )
	- Lookup + Derived Column(isMatch) + Conditional Split (isMatch=True/false)+ others
	- Lookup : the result of each row is stored in new column
	- Derived Column =isMatch() 
	- Agg = first
10. Cast all Cols to Integer ?
	- Select > Regex > 1==1 , $$ , toInteger($$)
11. LAD
	- Cached lookup -https://www.youtube.com/watch?v=HqCn42FaRJs)
	- Add Cache type Sink as "cachedata"
	- to call the cache : cachedata#outputs()[1].columnname
12. Multiple Sinks? Dataflow >Settings > Set order
13. Accesing sink output in Pipeline? @string(activity("name").output.runStatus.output.sink_name.value[0].colname)
14. Canonical Model (Flexible Schema)?
	Derived Col>RuleBased> locate('colname',lower('name')) !=0 -> colname
15. Selection of group of Columns from Schema? Select >type!='string' && name != 'id',$$+'_notstring'
16. Dynamic Values / Parameterization[ref](https://www.youtube.com/watch?v=CMOPPie9bXM)
	- if Default parameter is not set , the user will be forced to enter the parameter value at the start
	- Only parameters declared/created can be used 
	- Any parameter created on : 
		- Dataset will be available on Dataflow (in Debug Settings)
		- Create at Dataset ,Dataflow will be available in pipeline by clicking on the Dataflow object
17. Dataflow Parameterisation ?
	- Creation : Parameters >dataflow_name,String,defaultvalue = Blank
	- Access: $dataflow_name
	- Sending Value: 
		1. Add to a Dataflow
		2. Debug Settings > Parameters > Dataset Paramters >dataset_name as value
18. Passing parameters from Dataset >DataFlow>  Pipeline
	- If u already have set parameter in Dataset and used in Dataflow. 
	- U can use by : pipeline > Select Dataflow> Settings > u should be able to see "dataset_name" under 
19. Row to Columns (unpivot) : Unpivot > Ungroup > Unpivot Colum
20. Assert 
	- Expects true,Expects false,Expects Unique,Expects exists(col to col matching)
	- description :toString(col_name) +'working'
	- filter : col_name == 1000
	- Assert + Derived Column =( isError ('col_name working'))
	- Assert + Split =  (hasError())
21. External Call 
	 - loop the rest api for the given input -https://www.youtube.com/watch?v=dIMfbwX8r0A)
	 - source (List)
	 - Add a derived Column to construct a resource using List
	 - add external call 
		- linked Service : GET , Anonymous , the Url will be overrided by below
		- method : GET ,previousActivity
		- Relative Url : Endpoint
		- Row Relative url : resource (constructed above)
22. Create Heirarachy
	- Add Source (input a csv (flat file) )
	- Add Derived Column  > Exp Builder > Create Column > Add SubColumn Window
	- Over (Group Column) : c2
	- sort : sort column  (Window Column is arranged : rank , dense_rank ,row_number etc.,)
	- range : 
	- Window : max(col), denserank(),rank()
23. Create variable inside an Activity in Dataflow:
	- Derived col > locals
	or
	- Cache > WriteTo Activity > use in another activity
	- Dynamic Expression > Create varible > var1 ,trim(right(col,6))
	- Accessing it ":var1"
24. Join Scheme drift Table (schema less table ) with table with schema
   - Src1 (schema Drifted , allow schema drift , infer drifted col) > Derived Col: col_tojoin, toInteger(byName('col_tojoin'))
   - Src2 (Scheme Table)
   - join : src1 ,src2 on col_tojoin
   - Sink (allow schema drift turned ON)
25. remove spaces from all col names? replace($$,' ')
26. Add schema to Schema less src? Src > Data Preview > Map Drifted Column
27. Specific Partition Files?
  - Dataset : Target = Container base path only
  - Src > Src Options > 
	  - Wild Card = top_folder/""/""/*.parquet
	  - Partition root path = top_folder
	  - column to store file name = filtercolumn
28. Saving Data into specific location based on input
	- Create a Column in transformation which contains Target
	     - Derived col > month , split(col,"-")/[2]
	- Sink > Settings > Filename Option (default : generic name)
	- Sink > Optimize > Set Partioning
	- Sink > Optimize > Key > UNique value per partition > Key Col = month
29. Pipeline Trigger:
		Schedule Pipeline
		cron Job pipeline etc.,
30. Save Partition files with Name:
		- Sink > Settings > File name option > Pattern = movies[n].parquet
		- Optimize >Set Partioning , Partion TYpe =Round Robin ,10         
31. Create Complex Datatype (subcols , keyvalue ,array):
		1. Derived Col
		   - all subcols to Main Cols
		   - access > Derived Col >  col.subcol
		2. New Col > ['key1' -> col1 , 'key2' -> col2]
		3. split(Name, ' ')
32. Coalesce and FileName
		- Sink >Setting > Pattern / Single Fle > movies[n].csv
		- Sink >Optimize > Set Partitioning > Hash > nos=5 , col =primarykey
33. Partitioning? Sink >Optimize > Set Partitioning > Key > col = "year"
34. Calculate Fuzzy:
	- 2 sources
	- join > soundex(name) == soundex(name2)

35. Deduplication / Distinct Rows  	
   - Method 1
       - Aggregate > GroupBy > sha2(256,columns()) , mycols
       - Aggregate > Aggregate >Each Col matches= true() , $$ , first($$)
   - Method 2
       - Aggregate > GroupBy > sha2(256,col1,col2,soundex(col2))
       - Aggregate > Aggregate > Each Col matches= true() , $$ , first($$)
   - Method 3
       - https://docs.microsoft.com/en-us/azure/data-factory/data-flow-script
       - https://www.youtube.com/watch?v=QOi26ETtPTw
36. Delta Loading using SQl Change Tracking Technology
	- https://www.youtube.com/watch?v=IN-4v0e7UIs
	- https://docs.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-change-tracking-feature-portal
37. Distinct
   - GroupBY (Aggregate) > Group By target_column 
   - Aggregates > Column pattern>name != 'target col',$$ ,first(ss)
   (select all columns except target column)
38. Row Count on a Table
	Aggregate > GroupBY (blank) , Aggregates = colname , count(1)
39. Routing /Saving Data to Different Target in Same DF:
        - Contitional Split
        - Using Parameter
        - New Column in Run Time
40. Start SurrogateKey from next Value (https://www.youtube.com/watch?v=tc283k8CWh8)
	Method 1:
	- Dim > Agg> groupBy blank ,Agg=max(SurroagteKey) 
	- Src > Add Surrogate Key from One > Cross join with Dim Agg > key+max(SurroagteKey) 
	Method 2:
	- Dim > Agg> groupBy blank ,Agg=max(SurroagteKey)  > Sink the Max(value)
	- Src > Add Surrogate Key from One > Derived Column > surrogateKey + sink

41. Group players into each array in each row by team name (collect_list)
   - GroupBY (Aggregate ) > Groupby Team name > Agg : collect(player)
42. Datadriven Framework:
	- Create a parameterised dataflow  (ie., Dataflow which needs parameters)
	- Create a pipeline : Lookup > Foreach > above Dataflow
	- Lookup = csv file
	- Foreach: Settings >Sequential , @activity('Lookup1').output.value
	- Inside Foreach >Parameter= pipeline Expression ,@item().columnNames
	- Run Debug
	- If error: ck Pipeline screen > On output tab >  ck on see error on failure to debug
43. WARM Cluster Dataflow (to track performance)?
	-Abstract the starting of Cluster with dummy Dataflow
        ie., add a dummy dataflow at the beginning of target dataflow
	-Example: src > filter (1==2) > sink
44. SCD Type -1
	- Src > Alter Row > Sink
	- Alter Row > Upsert if
45. Create Datalake
	- Sink
	- Sink Type > Inline > Select Inline Dataset Type >  Delta
46. Replace Special Characters
	- Select > true() ,regexReplace($$,'[^a-zA-Z]','')
47. Hashing
	
	sha2(256,columns())
	or
	parameter = [c1,c2]
	sha2(256,parameter)
48. How to left outer join? Use exists
49. Remove Column 
	- Select Activity  
	- source :   name != 'Row_Number' && name != 'null' && left(name,2) != '_c'
	- and this for the output column name: $$
50: Cache Sink ("undefined" is not valid JSON)
	- Make sure the Data ur trying to save is converted to correct Data Type
	- Eg : toString(Col1)

Excercises:

  1. Find Max Row
  2. Find Distinct Rows
  3. Group by Country
  4. Parameters (To save data into file is blog with name given as paramter in pipeline)
  5. Creating Variables
  6. Regex / WildCard
  7. Save file as abcd_timestamp.parquet
  8. variable local inside Derived Column only
  9. Create Datadriven Framework ( Data is coming from csv /json into dataflow)
  10. Send function as parameter into Dataflow

PipeLine Questions

  1. Pipeline Expression should start with ? "@"
  2. Parameters in pipeline?
    • constant
    • Create = Pipeline Window > Parameters > New "param"
    • Access= @pipeline().parameters.param
  3. Access output of any activityin Pipeline ? @activity('name').output:
  4. Variables in pipeline?
    • modified
    • Create = Pipeline Window > Variabes > New "var"
    • Access= @variables('var')
  5. Access notebook return data in pipeline?
    • @activity('notebookname').output.status.Output.result.exitValue
    • How to get count of records Ingested after ETL ? @activity('name').output
  6. How Logging (Pipeline)?
    • capture the json from one activity into another
    • Dataflow (log) = Create simple csv with data "1" (dummy source)
    • Dataflow > Dataflow (log)> parameter =@activity('name').output.runStatus.metrics
    • Logging functionality directly available in Copy job
  7. How Dynamic Run time Resource Allocation (pipeline) ?
    • Pipeline > Dataflow > Settings > Core Count = @if( greater(activity('Name').output.size , 400000) ,128)
  8. Use of Lookup ??
    • Input= any data sources

    • output = single value /array

    • run Queries / SPROC

    • SELECT * FROM @{pipeline().parameters.T}

        Pipeline Lookup Example:
        	- pipeline>lookup > "Select col1 from table"
        	- Foreach > @activity('lookup').output.value
        	- Add Dataflow inside Foreach
        	- src (src dataset is parameterized)> Settings > @item().col1
      
  9. Lookup vs Script Acivity ?
    • Lookup :returns the result of executing a query or stored procedure.
    • Script: can only execute Data Manipulation Language (DML- INSERT, UPDATE, DELETE and SELECT) and Data Definition Language (DDL - CREATE, ALTER and DROP)
  10. Which Activity to Access Folder Structure in gen 2 ? getMetadata
  11. Run SQL scripts (Develop > SQL Script)are Disconnected
    • Option 1 : Add Script Activity and Directly add the SQL Code
    • Option 2 (Call the Current Synapse using REST to get all artifacts ) :
      • Save a SQL Script in Develop
      • url : https://.dev.azuresynapse.net
      • Resource : https://dev.azuresynapse.net/
      • GET , System-Managed-Identtity
      • Then Navigate in the Response Result to ge tSQL
view raw Synapse.md hosted with ❤ by GitHub

Thursday, May 26, 2022

ETL


ETL

BigData:

  • OLTP (online Transactional processing)
  • OLAP(online analytical processing)

Data processing systems

  • OLTP (OnLine Transaction Processing) : is used for managing current day to day data information.
    • ACID
      • Atomicity(entire transaction happens at once or nothing happens)
      • Consistency
      • Isolation
      • Durability (persisted)
  • OLAP (OnLine Analytical Processing): is used to maintain the past history of data and mainly used for data analysis, it can also be referred to as warehouse.

DataLake

  • large amounts of Data
  • structured, semi-structured, and unstructured data.
  • original/ raw format.
  • Any Formats - JSON, BSON, CSV, TSV, Avro, ORC, and Parquet etc.,
  • cost-effective way
  • No Transformation

Delta / WaterMarking ( water level stick in a dam or a river,)

  1. Timestamp-based Approach: Source shd have timestamp column a. Current TimeStamp

    • If current ts = 24-07-2024-01:01:01:0001
    • Next Run Select * from T where t > 24-07-2024-01:01:01:0001

    b. Source Time Stamp Based

    • Save the max timestamp from source column after each run .
    • in subsequent runs Select * from T where ts > max(saved_value)
  2. Hash-based Approach:

 hashbytes('sha256',concat_ws('',src.c1,src.c2)) as h1,
 hashbytes('sha256',concat_ws('',dest.c1,dest.c2)) as h2
from src,dest where h1!=h2
  1. Primary Key-based Approach: Comparing the primary keys Src and Target

  2. Change Tracking Technology : MS SQL and SQL server

  3. Change Data Capture (CDC) - Azure Synapse and DataFactory only:

    • One Checkbox to handle delta
    • Native to Azure
    • Azure to Blob, Azure Gen2 , Cosmos , SQL DB , SAP , PostGres
    • Reverting to Delta is Difficult

    ALTER DATABASE DB
    SET CHANGE_TRACKING = ON


Datawarehouse (Summary, Types, NF, Facts and Dim,Star vs SnowFlake, Key Types, LAD, Handle LAD)

  • structured information from various srcs.
  • Data is transformed
  • has schema
  • Use RDBMS
  • Used for reporting
  • Facts and Dim
  • Data is Normalised

Datamart

  • similar to a data warehouse.
  • Cateres for specific business unit Eg: Finance Dept

Database Normalization

Used to reduce redundancy from the database table.

1NF(atomicity):

Each attribute holds a single value.

Name,Age,Dept <- BEFORE  
Melvin,32,(Marketing, Sales)

Name,Age,Dept <- AFTER  
Melvin,32,Marketing  
Melvin,32,Sales  

2NF (2 natural keys ):

  • After 1NF

  • No Partial Dependencies.(All cols not depending on Natural Key)

      Subject |Teacher ID|Teacher Age  <- BEFORE
    
      Teacher ID|Teacher Age  <- AFTER
      Subject |Teacher ID
    

3NF (non key cols depending on other non key cols)

  • After 2NF

  • A -> B and B -> C , A-> C

  • Primary Key: EmpID

  • Non-key attributes: Name, State, Country, ZIP

    EmpID| Name| State| Country| ZIP
    
  • ZIP -- >ID , State and Country --> ZIP

    ID,NAME,ZIP
    ZIP,STATE,Country

Dimensions (Object)

  • Eg: a seller, a customer, a date, a price , Product
  • Contains Primary keys , Surrogate Keys
  • Dim = Person Details , Fact = Transaction done by the Person

Facts (Action)

  • Eg: a sale, a transaction, an access
  • contains foreign keys from Dimension

Grain:

  • Lowest level of measurement in the Fact table.

Dimension Modeling

Star Scheme :

  • Fact table uses Foreign Key from Dim Tables
  • Schema looks like a star , which 1 fact table connected to multiple dimension table

Snow Flake Scheme

  • Is an extension of star Scheme
  • multidimensional model
  • dimension tables is further sub-dimensioned

6 types of Dimensions :

  • Role Play Dim : Eg- "Date" dim can be used for "Date of Sale", "Date of Delivery","DOB"
  • Junk dimension : Table with unrelated attributes to avoid large number of foreign keys in the fact table.
  • Degenerate dimensions: Dimension attributes stored as part of fact table and not in a separate dimension table. Eg: "transaction number"
  • Stacked dimension : two or more dimensions are combined
  • Slowly Changing Dimension : dimension that would undergo changes over time
  • Slowly Growing Dimension : growth of records/elements in the dimension.
  • Conformed Dimension (Source of Truth ): is shared across multiple data mart or subject area
  • Reference Dimension: Used to joined indirectly to the fact table through a key in another dimension table.
  • Static Dimension :It not extracted from the original data source, but are created within the context of the data warehouse.

Types of Slowly Changing Dimension

Type 0 : No changes are entertained

INSERT INTO dim
SELECT * FROM src
WHERE NOT EXISTS (SELECT * FROM dim WHERE dim.id = src.id)

Type 1 : Direct Update Old Data (Upsert Original Data)

SK,ID,Name,Cellphone. <- Before
100,1,ABC,1234567890

SK,ID,Name,Cellphone <- After
100,1,ABC,9876543210

UPDATE dim
SET addr = src.addr FROM src WHERE dim.addr = src.addr;

Type 2 : Flag Old Data as 0 , Insert new Data with 1(Flag)

SK,ID,Name,Cellphone,FROM_DT,TO_DT,flag
100,1,ABC,1234567890,2016-06-01,2016-06-10,0
100,1,ABC,9876543210,2016-06-10,NULL,1

INSERT INTO dim
(SELECT * FROM src
JOIN dim
ON src.id = dim.id AND src.addr <> dim.addr)
Type 3 : Keeps history by adding new Column

SK,ID,Name,old_number,new_number
100,1,ABC,1234567890,9876543210

Type 4 : Uses separate history table

SK,ID,Name,Cellphone
100,1,ABC,9876543210

History Table
SK,ID,Name,Cellphone,CRT_DT
100,1,ABC,1234567890, 2016-06-01
100,1,ABC,9876543210, 2016-06-10

Type 6 :Combination of type 1, 2 and 3

  • New Row (Like type 2)
  • Each Active rows can be indicated with a boolean flag / start and end date.
  • using additional attributes or columns within the same record instead of creating new records

Eg: a new table record is added as in SCD type 2. Overwrite the old information with the new data as in type 1. And preserve history in a historical_record as in type 3.

**ID,Name,new_number,old_number,EFF_FR_DT,EFF_TO_DT,flag
**1,ABC,9876543210,2345678901, 2015-01-01,2016-01-20,0
1,ABC,9876543210,1234567890, 2016-01-21,Null,1

Surrogate Key vs Natural Key vs Primary Key :

Natural Key

  • Appears Naturally Eg: Registration Number ,Country Code
  • 1 or more columns needs to combined to make it unique
  • May or may not have a starting value
  • Can be Null

Surrogate Key

  • Surrogate Key is artificially generated and uniquely identifies each row in the dimension table
  • Commonly 1 Column and starts with Integer "1"
  • unique
  • local only to DataWarehouse , DataLake
  • Created using Natural Key
  • Cannot be Null
  • Skey
df\
.withColumn("sha2",sha2(concat("nat_key1","_","nat_key2"),256))\
.distinct()\
.orderBy(col("sha2").asc())\
.withColumn("skey",monotonically_increasing_id)
  • Use "ReIDL" logic, to maintain Skey and NaturalKey relation even during full load . Primary Key
  • 1 Column
  • May or may not be an Integer Column
  • Generated in RDBMS Database
  • Not Null

Foreign Key

  • Filed that references PRIMARY KEY / SURROGATE KEY
  • Used in 2NF Normalization Logic

Create Skey in Dim

  • withColumn ('pk' ,monotonically_increasing_id())
  • create table T (id int Identity (1,1) )
  • Add 2 extra Row :
    • -9 as 'NULL' : when joined with fact (after replacing NUll in fact with 'NULL') ,all null values will have -9
    • -1 as NULL (Orphan Record) : When joined with fact (NULL) , orphan record will have -1

Create Fact Relation

Note : -1 = Orphan Records , -9 for Null

  • Replace Null with 'NULL' for joining column
  • Left outer Join with (Select Skey and NaturalKey from DIm)
  • Replace null with -1 [U can also do anti join]

Late Arriving Dimension / Early Arriving Fact (LAD)

In data warehouse ETL dimensions processed and loaded before Facts

Eg : In ECommerce Website Customers can place an order as a guest.So the Fact record gets created without Foreign Key from Customer Table.

Design Approaches for LAD

Never Process Fact : never process fact record until dimension record is available.

id,Name
1,Sam
2,Jack

CustomerID,Item
1,Candy
3,Bread <- X

Park and Retry : insert the unmatched transactional record into a landing / another Fact table and retry in next batch process.

id,Name
1,Sam
2,Jack

CustomerID,Item
1,Candy

CustomerID,Item <-Landing Table
3,Bread

Inferred Flag

srcDf = select * from src where ts > delta_ts
dimDF = spark.read.option("path","dimDFPath").load()

srcDF.selectExpr("id as sID").distinct()\
.join(dimDF,dimDF.id=srcDF.sID,"left")\
.filter("id is null")
.withColumn("inferredFlag" , lit(True))
.write
.format("parquet")
.option("path","dimDFPath")
.save()

srcDF.join(
        dimDF.select("id","sKey")
        ,srcDF.id=dimDF.id
        ,"left").drop("id").save("factPath")
  • When actual dimension record available, columns marked as N/A will be updated with new data received and InferredFlag will be updated as False

Design for ETL

ETL with 3 Zones : Raw(Data Lake) , Enriched (Delta Lake )and Curated (Data Warehouse)

  1. Raw zone

    • Data Lake
    • Blob Storage Eg: Azure Gen2 , IBM cloud Object Storage
    • Data is moved from source
    • Every Load in new timestamp folder
    • InDependent ( does not depend on last load type / Data)
    • Immutable
    • Delta Load / Full Load
    • Duplicates can Exists ( Eg: If Source is Type 1 and records are deleted . Parquet files does not allow deletion / updation of existing records )
  2. Enriched Zone

    • Delta Lake Type (allows delete , update)
    • Blob Storage Eg: Azure Gen2 , IBM cloud Object Storage
    • Data moved from Raw Zone to Enriched Zone
    • Duplicates removed using Upsert Operation
    • Transformations are done here
    • Fact and Dim
  3. Curated Zone

    • Data Moved from Enriched to Here
    • DataWarehouse
    • SQL Database
    • Fact and Dim

Steps :

Step 1. Get all records from Source

maxDestTS= destDF.select(max("UPDATE_TS").alias("max")).collect()["maxDestTS"]
newDF=srcDF.select(f"UPDATE_TS>{maxDestTS}")

Step 2. When Source = SCD type 2 or Soft Delete ie.,(Flag = D) :

  • Method 1:
latestIDDF=newDF.groupBy("id").agg(max("UPDATE_TS"))
latestDF= newDF.join(latestIDDF,["id"],"inner_join")
  • Method 2:
latestDF= newDF.filter("Flag <> 'D'").withColumn("r",max("UPDATE_TS").over(Window.partitionBy("id"))\
.filter("r == UPDATE_TS")

Step 3. Upsert (Update old records , insert New Records) to Delta Lake

	deltaDF=DeltaTable.forPath(spark,"/path")
	deltaDF.alias("target")\
	.merge(latestDF.alias("src"), "src.id="target.id" ) \
	.whenMatchedInsertAll()\
	.whenNotMatchedUpdateAll()\
	.execute()

Step 4. EZ to DB

Method 1 : Full Overwrite

df.write.format("jdbc").mode("overwrite").options(url="","usn"= ..).save()

Method 2 :

  1. Fetch all records from EZ > max(timestamp) from CZ

  2. Delete all the above records in CZ ,if same naturalKeys exists in Fetched Data

     maxTS= Select max(TS) from Target
     delTarget = srcDF.filter("srcTS > {maxTS}").withColumn("key",concat_ws(",","k1","k2)).select("key").collect()
     DELETE FROM target WHERE CONCAT(c1, c2) IN (delTarget);
    
  3. ReInsert fetched Records

Performance :

  • Concat and create hash for NaturalKeys while saving for fast deletion. Eg:```hash(concat_ws_(',',"k1","k2"))
view raw ETL.md hosted with ❤ by GitHub

Monday, May 16, 2022

Build Jar : Add multiple class files in the Jar using Bazel

 Add multiple class files in the Jar using Bazel 


#---------------------------
#Bazel build parent.jar
#This would add Parent.scala , Child1.scala , Child2.scala in the jar file creatd
#---------------------------
load("@io_bazel_rules_scala//scala:scala.bzl", "scala_binary","scala_library")
package(default_visibility = ["//visibility:public"])
scala_library(
name = "parent",
srcs =
["Parent.scala",
"//src/main/child1:Child1.scala",
"//src/main/child2:Child2.scala",
],
deps = [
"//src/main/child1",
"//src/main/child2",
],
)
view raw BUILD hosted with ❤ by GitHub

Thursday, May 12, 2022

Azure Synapse - Working with Custom jar Scala (Bazel)

 Working with Custom jar in Azure 

There is 2 Jars :

  1. Big Fat Jar or uber Jar (Contains main class  )
  2. Build Dependency Jar/ Library to be used in Synapse 

Big Fat Jar or uber Jar:


Build Dependency Jar/ Library

  • Does not Contain main class
  • Does not contain dependencies
  • Note : Download external dependencies from Maven and upload it separately (refer bewlo link)
  • $ bazel build build_name.jar

Once u have jars Created based on ure need , use link - https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-manage-scala-packages to upload to Azure 

Friday, May 6, 2022

Azure Synapse : Passing and receiving Values to and from Notebooks

 Passing and receiving Values to and from Notebooks

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
view raw notebook.ipynb hosted with ❤ by GitHub

Friday, April 22, 2022

SQL : Index

Index

  • Indexes to query data faster, speed up sort operation, and enforce unique constraints.
  • A DB table each row has rowid and sequence number to identify row
  • Eg :table = list of pairs (rowid, row) )
  • Index is created on a seperate table which has opposite relationship: (row, rowid)
  • SQLite uses B-tree ie., balanced-tree ie., Actual table rows = Index table rows

Mac:

$brew install sqlite3
>sqlite3

>CREATE TABLE contacts (
first_name text NOT NULL,
last_name text NOT NULL,
email text NOT NULL
);
>CREATE UNIQUE INDEX idx_contacts_email ON contacts (email);
>INSERT INTO contacts (first_name, last_name, email) VALUES('John','Doe','john.doe@sqlitetutorial.net');
>INSERT INTO contacts (first_name, last_name, email) VALUES('Johny','Doe','john.doe@sqlitetutorial.net');

SQLite issued an error message indicating that the unique index has been violated. Note :email shd be unique across

>INSERT INTO contacts (first_name, last_name, email)
VALUES('David','Brown','david.brown@sqlitetutorial.net'),
    ('Lisa','Smith','lisa.smith@sqlitetutorial.net');
    
 >EXPLAIN QUERY PLAN 
 SELECT * FROM contacts WHERE email = 'lisa.smith@sqlitetutorial.net';
 >.quit
view raw indexing.md hosted with ❤ by GitHub

Saturday, March 12, 2022

Scala : HTTP Client using Sttp

 HTTP Client using Sttp


HttpClient in spark-shell REPL

  • open link https://mvnrepository.com/artifact/com.softwaremill.sttp.client3/core

  • open compatible scalalink

  • click on Maven tab

  • copy groupId:artifactId:version ie., com.softwaremill.sttp.client3:core_2.12:3.7.4

  • create packagename as below

  • Open terminal and run below commands :

    $spark-shell --packages com.softwaremill.sttp.client3:core_2.12:3.7.4
    
    scala> import sttp.client3._
    import sttp.client3._
    
    scala> implicit val backend = HttpURLConnectionBackend()
    backent: sttp.client.SttpBackend[sttp.client.Identity,Nothing,sttp.client.NothingT] = sttp.client.FollowRedirectsBackend@50bac431
    
    scala> val r = basicRequest.get(uri"https://pokeapi.co/api/v2/pokemon/ditto").send(backend)
    r: sttp.client.Identity[sttp.client.Response[Either[String,String]]] = Response(Right({"ab....
    
    scala> println(r)
    Response(Right({"abilities":[{"ability":{"name":"limber
    
// Ref -https://sttp.softwaremill.com/en/latest/examples.html
/*
libraryDependencies += "com.softwaremill.sttp.client3" %% "core" % "3.5.1"
Bazel Dependencies :
com.softwaremill.sttp.client3.core
com.softwaremill.sttp.client3.model
com.softwaremill.sttp.client3.shared
*/
import sttp.client3._
object SttpClient extends App{
val url= "http://0.0.0.0:9999/test/1"
val headers:Map[String,String]=Map("Content-type"->"application/json")
val payload:String="""{"name":"deepsk"}"""
post(url,headers,payload)
get(url,headers)
def get(url:String,headers:Map[String,String]):Unit={
val backend = HttpURLConnectionBackend()
val response = basicRequest.headers(headers) .get(uri"${url}").send(backend)
println(response.body)
}
def post(url:String,headers:Map[String,String],payload:String):Unit={
val backend = HttpURLConnectionBackend()
val response = basicRequest.body(payload).headers(headers) .post(uri"${url}").send(backend)
println(response.body)
}
}

Thursday, January 20, 2022

Coding Principles

 Coding Principles

Principles

  • Loose Coupling
  • High Cohesion
  • Change is Local
  • It is Easy to Remove

Smells

  • Rigidity ( A -> B -> C . something hardcoded in C )
  • Fragility
  • Immobility
  • Viscosity of Design (Taking a shortcut and introducing technical debt requires less effort than doing it right.)
  • Viscosity of Environment (skipping rigorous testing)
  • Needless Complexity
  • Needless Repetition
  • Opacity

Class Design

  • Single Responsibility Principle (SRP)
  • Open Closed Principle (OCP)
  • Liskov Substitution Principle (LSP)
  • Dependency Inversion Principle (DIP)
  • Interface Segregation Principle (ISP)
  • Classes Should be Small

General

  • Follow Standard Conventions
  • Keep It Simple Stupid
  • Boy Scout Rule
  • Avoid Multiple Languages in One Source File

Design

  • Keep Configurable Data at High Levels
  • Don’t Be Arbitrary (Have a reason for the way you structure your code)
  • Be Precise

Dependencies

  • Make Logical Dependencies Physical
  • Singletons / Service Locator
  • Base Classes Depending On Their Derivatives

Naming

  • Choose Descriptive / Unambiguous Names (Names have to reflect the entire functionality)
  • Standard Nomenclature Where Possible
  • no Encodings in Names (No prefixes, no type/scope information)

Understandability

  • maintain Consistency
  • Use Explanatory Variables
  • Prefer Dedicated Value Objects to Primitive Types (path type , instead of String , enclosing class)
  • Poorly Written Comment
  • Obscured Intent (Too dense algorithms that lose all expressiveness.)
  • Obvious Behaviour Is Unimplemented
  • Hidden Logical Dependency

Methods

  • Methods Should Do One Thing
  • Methods Should perform what is described by the name of the function.
  • Avoid Method with Too Many Input Arguments
  • Avoid Method with Too Many output Arguments (return object instead)
  • Avoid Selector / Flag Arguments
  • Avoid Inappropriate Static

Source Code Structure

  • Variables and methods should be defined close to where they are used.
  • Local variables should be declared just above their first usage ,depending on scope
  • Nesting (should be more specific)
  • Structure Code into Namespaces by Feature
  • same feature together. Eg :A feature may use another feature; a business feature may use a core feature like logging

Useless Stuff

  • Avoid Dead Comment, Code
  • Avoid Clutter
  • Inappropriate Information

Maintainability Killers

  • Avoid Duplication
  • Magic Numbers / Strings (Replace Magic Numbers and Strings with named constants)

Exception Handling

  • Catch Specific Exceptions
  • Catch Where You Can React in a Meaningful Way
  • Use Exceptions instead of Return Codes or null
  • Fail Fast
  • Avoid Using Exceptions for Control Flow
  • Avoid Swallowing Exceptions

Sunday, January 9, 2022

Azure Django-postgres Application

Azure Django-postgres Application


postgres-Django

Ref:Tutorial

Azure App Service

Creat APP

    git clone https://github.com/Azure-Samples/djangoapp
    cd djangoapp
    az webapp up \
    --resource-group DjangoPostgres-tutorial-rg \
    --location centralus \
    --plan DjangoPostgres-tutorial-plan \
    --sku F1 \
    --name djangopostgresapp

Add extension to db extension to CLI

    az extension add --name db-up   #install db-up (ref- https://gist.github.com/j-thepac/b0ab1196585cf8b9a8478e4a3c8b6aae)

Create a new Postgres instance

    az postgres up \
    --resource-group DjangoPostgres \
    --location centralus \
    --sku-name B_Gen5_1 \
    --server-name djangopostgresserver \
    --database-name pollsdb \
    --admin-user sqlusr \
    --admin-password Azure@123 \
    --ssl-enforcement Enabled

Note down DB details from console

Connect App and DB

    az webapp config appsettings set \
    --name djangopostgresapp \
    --settings DBHOST="djangopostgresserver" DBUSER="sqlusr" DBPASS="Azure@123" DBNAME="pollsdb" 

DB Migration

    az webapp ssh #open ssh 
    python manage.py migrate
    python manage.py createsuperuser # usn=root/password Pollsdb1
    #skip email 
    exit

Test :

    az webapp browse

open https://djangopostgresapp.azurewebsites.net/ (http://.azurewebsites.net/admin)

Update python file and redeploy

    Make any changes to python file
    az webapp up
    az webapp log tail

group delete

    az group delete --name DjangoPostgres --no-wait
    az group delete --name DjangoPostgres-tutorial-rg --no-wait

Saturday, January 8, 2022

Azure Errors and Solution

 Azure Errors and Solution


az extension add --name db-up --debug

Error:

Building wheel for pymssql (PEP 517): finished with status 'error'
  ERROR: Failed building wheel for pymssql
Failed to build pymssql
ERROR: Could not build wheels for pymssql which use PEP 517 and cannot be installed directly
Exception information:

pip._internal.exceptions.InstallationError: Could not build wheels for pymssql which use PEP 517 and cannot be installed directly

Environnment :

  • Mac Monetary
  • Python 3.6

Ref

Stackoverflow Github

Solution

brew install freetds openssl
export LDFLAGS="-L/usr/local/opt/freetds/lib -L/usr/local/opt/openssl@3/lib"
export CFLAGS="-I/usr/local/opt/freetds/include"
export CPPFLAGS="-I/usr/local/opt/openssl@3/include"
az extension add --name db-up --debug
view raw azdb-up.md hosted with ❤ by GitHub

Azure Linux VM

Azure Linux VM Connection

pre-Req: 

  1. For Mac Users
  2. Make sure linux machine is created

Steps:

  1. Open Linux Machine Panel
  2. Go to Settings > Connect > SSH
  3. Click on other Clients
  4. From the new page opened , Copy Create an SSH key Pair "ssh-keygen -m PEM -t rsa -b 4096"
  5. Run "ssh-keygen -m PEM -t rsa -b 4096" in Terminal
  6. Click Enter all the times

Testing Connection

  1. In the previous page , enter the path of the id_rsa.pub Eg : "~/.ssh/id_rsa.pub" or "(if u have added the file in a new folder ) ~/.ssh/vm_linux/pushpa/id_rsa.pub " 
  2. click on "test your connection"
  3. Your unsaved edits will be discarded , ck Yes
  4. Ck on Test Connection 
  5. this shd pass

Connection:

  1. Settings > Connect > SSH 
  2. copy the run example ie ., ssh -i <private key path> pushpaxxxxx@1xx.xx.1xx.xx
  3. Replace the privae key path to ids_rsa which was created
  4. Make sure u give the azure account password when asked

Install Python3

  1. sudo su
  2. <azure password>
  3. yum install python3
  4. pip3 install requests

Monday, January 3, 2022

Azure Static WebPage

 Azure Static WebPage


Azure Static page:

Azure App Service

git clone https://github.com/Azure-Samples/html-docs-hello-world.git
cd html-docs-hello-world
az webapp up --location westeurope --name <app_name> --html
go to the app URL: http://<app_name>.azurewebsites.net

Update and redeploy the app

az webapp up --location westeurope --name <app_name> --html

Clean up

az group delete --name appsvc_rg_Windows_westeurope

Reference

https://docs.microsoft.com/en-us/azure/app-service/quickstart-html

Azure API

Azure API


Create Webapp using Flask

Azure App Service

Pre-Req

  • Make Sure u have setup account created for Azure
  • You have enabled Free Subscription
  • You have created a webapp instance manullay example "flaskserverpushparanimptest2"

Setup Env

$brew update && brew install azure-cli
$az upgrade
$az version
$az login

Code Flask Api

$git clone https://github.com/Azure-Samples/python-docs-hello-world
$cd python-docs-hello-world
$python3 -m venv .venv
$source .venv/bin/activate
$pip install -r requirements.txt
$flask run

Test locally

test localy in browser : http://localhost:5000/ ctrl+C to Quit

Deploy to Azure

$az webapp up --sku F1 --name flaskserverpushparanimptest2 #sku is pricing F1 is free

Test Azure Web

open http://flaskserverpushparanimptest2.azurewebsites.net

Add new enhancement for your file and run below code

    $az webapp up
view raw AzureFlask.md hosted with ❤ by GitHub

Azure Cosmos DB

 Azure Cosmos DB 


"""
https://docs.microsoft.com/en-us/azure/cosmos-db/sql/create-sql-api-python
https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/cosmos/azure-cosmos/samples/examples.py
Pre-Req:
pip install --pre azure-cosmos
Have a Azure Account with Free Subscribption Enabled
Make sure Cosmosa DB is created along with DB and Container
#Azure portal > Azure Cosmos > ck on Existing Cosmos Db > Settings > Keys
to get Primary key and endpoint
"""
from azure.cosmos import exceptions, CosmosClient, PartitionKey
import uuid
import json
endpoint = "GET KEYS FROM : Azure portal > Azure Cosmos > ck on Existing Cosmos Db > Settings > Keys"
key = 'GET KEYS FROM : Azure portal > Azure Cosmos > ck on Existing Cosmos Db > Settings > Keys'
jsondata= {"id": "3","category": "hobby","name": "TV","description": "Pick up TV"}
dbname="ToDoDatabase"
containerid="ToDoList"
query="SELECT * FROM ITEMS"
class CosmosDb():
def __init__(self,endpoint,key,dbname,containerid):
self.endpoint=endpoint
self.key=key
self.client = CosmosClient(endpoint, key)
self.database = self.client.get_database_client(database=dbname) #create_database_if_not_exists
self.container = self.database.get_container_client(containerid) #create_container_if_not_exists
def create_item(self,jsondata):
self.container.create_item(body=jsondata)
def list_containers(self,database_name):
database = self.client.get_database_client(database_name)
return database.list_containers()
def modify_item(self,item_name,partition,key,new_value):
item = self.container.read_item(item_name, partition_key=partition)
item[key] = new_value
updated_item = container.upsert_item(item)
def query_items(self,query:str):
return self.container.query_items(query=query,enable_cross_partition_query=True)
cosmos=CosmosDb(endpoint,key,dbname,containerid)
# res=cosmos.list_containers(dbname)
res=cosmos.query_items("SELECT * FROM ITEMS")
for i in res : print(json.dumps(i,indent=True))
view raw cosmosapi.py hosted with ❤ by GitHub

Saturday, November 20, 2021

Node js

NODE JS



/*
Nodejs is javascript engine from Chrome V8 Engine .
Which runs javascript code in the backend
node js is non blocking io module ie., if any lines takes more time , than the cpu does not block other lines and continues execution of other lines
npm - package manager for js .Similar to pip in python
Note: Never push node module dependencies to Git as it is huge
docs: all functions in npm is
- https://nodejs.org/dist/latest-v17.x/docs/api/
- https://nodejs.dev/learn/introduction-to-nodejs
*/
//ubuntu Linux Install
sudo apt update
sudo apt install nodejs
node -v // --version
sudo apt install npm
//Run Node
node file.js //execute file
node //start nodejs shell
//package create and install
npm init //create a project using npm
npm install express --save
npm install -g package_name //install globally
5 //5
5+5 //10
console.log("hi") //print
console.log(__filename) //
//variable
var a=1
const c =1
//Map
const map1 = new Map();
map1.set('a', 1);
console.log(map1.get("a"))
//List
var a = [1, "two", "three", 4];
if (a[0] == 1)
console.log("equal")
else
console.log("not")
//Loop
for (i = 0; i <= a.length - 1; i++) console.log(i)
a.forEach(i => console.log(i))
//function
function f(name){console.log(name)} //f("hi")
//IMPORT OTHER FILE
//library.js
d = { "name": "deepak", "age": 100 }
module.exports = { add, d}
function add(x, y) { return x + y;}
//index.js
const lib = require("./library.js")
console.log(lib.add(4, 2), lib.d);
//common js vs ESM
const library = require("library") //commonjs
import * from library as lib //ESM - only work when you convert the current project to module ( "type": "module")
//IMPORT DEPENDENCY
const dep = require("os")
console.log(os.freemem())
//SYNC VS ASYN LIBRARIES
const fs = require("fs")
fs.writeFileSync("f.txt","data to be written")
fs.writeFile("f.txt","data to be written", () =>{console.log("done writing")}) // intimate user once data is written
//IMPORTATN LIBRARIES
path //folder path
fs //file
url
events
express //to create server alternate to "http" package
//EVENTS
const event = require('events');
class E extends event {}
const e = new E();
e.on('WaterFull', () => {
console.log('Please turn off the motor!');
setTimeout(() => {
console.log('Please turn off the motor! Its a gentle reminder');
}, 3000);
});
console.log("started")
e.emit('WaterFull'); //start event when you receive "Waterfull"
console.log("The script is still running")
// Result
started
Please turn off the motor!
The script is still running
Please turn off the motor! Its a gentle reminder
//EXPRESS - https://expressjs.com/en/starter/hello-world.html
const express = require('express')
const path = require('path')
const app = express()
const port = 3000
//http://localhost:3000/home/hi?id=5
app.get('/home/:param', (req, res) => {
res.send('Hello World!' + req.params.param + req.query.id)
})
app.get('/file', (req, res) => {
res.sendFile(path.join(__dirname, "file.html"))
})
app.post('/file', function(req, res) {
res.send(("Working"));
// console.log('req.body.name', req.body['submit']);
});
app.get('/json', (req, res) => {
res.json({ "name": "deepak" })
})
app.listen(port, () => {
console.log(`Example app listening at http://localhost:${port}`)
})
view raw node.js hosted with ❤ by GitHub

Friday, October 22, 2021

Scala Advanced

Scala Advanced

Generics

trait A
class B extends A
class C extends B
object LowerBoundGeneric extends App {
  class Test[A >: B](val x: A) //Can have of type A and B not C
  val temp = new B() // new C() = Fail
  val test: Test[B] = new Test[B](temp)
}
object CovariantGeneric extends App {
  class Test2[+A]{ def run[B >: A](element: B)=print("working") }
  val temp2 =new C() // new C() = Fail
  new Test2[B]().run(temp2)
}

Apply

//whereby the compiler converts f(a) into f.apply(a)
object Applytest extends App{
  class Foo(x: Int) { def apply(y: Int) =x+y}
  val f = new Foo(3)
  println(f(4))  // returns 25
}

Partial Function

/*
function  is f: X -> Y,
A partial function =  Does not force f to map every element of X to an element of Y
ie., several subpartial function to handle differnt elements in same data set
new PartialFunction[input , output]
if "isDefined" is true than execute "apply"
orElse, andthen
 */
object Partialtest extends App{
  val sample = 1 to 5
    val isEven = new PartialFunction[Int, String] {
      def apply(x: Int) = x + " is even"
      def isDefinedAt(x: Int) = (x != 0 && x%2 == 0)
    }
  val isOdd: PartialFunction[Int, String] = {
       case x if x % 2 == 1 => x + " is odd"
    }
  val evenNumbers = sample map (isEven orElse isOdd)
  print(evenNumbers)
}

Companion Object

/*
Companion object and its class can access each other’s private members (fields and methods)
Have same name
Same file
 */
object CompanionTest extends App{
  class Person {var name = ""}
  object Person {
    def apply(name: String): Person = {
      var p = new Person()
      p.name = name
      p
    }
  }
  print(Person("Fred Flinstone").name) //Person.apply("Fred Flinstone").
}


Future

/*
Anything inside Future {}, is run in a different thread
Application’s main thread doesn’t stop for Future to Complete
Result of Future is always  Try types: Success or Failure
To make main thread wait scala.concurrent.Await.result(future,15.seconds) is used
isComplete , value ,map , collect
 */
object FutureTest extends App{
  import scala.concurrent.Future
  import scala.concurrent.ExecutionContext.Implicits.global
  import scala.util.{Failure, Success}
  val f1:Future[Int] = Future { Thread.sleep(1000); 21 + 21 }
  while(f1.isCompleted!=true){println("future operation completed ?? -  "+f1.isCompleted)}
  println(f1.value)
  val f2:Future[Int]=f1.map(i => i+1)
  f2.onComplete {
    case Success(value) => println(s"Got the callback, value = $value")
    case Failure(e) => e.printStackTrace
  }
}



Implicit

object ImplicitTest extends App{
  case class Person(name: String) {def greet = println(s"Hi, my name is $name")}
  implicit def fromStringToPerson(name: String) = Person(name)
  "Peter".greet
}

Thursday, October 14, 2021

IBMCLOUD

 IBMCLOUD

Index:

  1. Basics
  2. Pre-Req
  3. Free CommandLine Tool
  4. Create Free Application
  5. API Keys
  6. Getting oAuth Tokens
    1. Standalone
    2. Ibm CLI tool
  7. Create AI application
  8. Cloudant Database
    1. Fetch the Clouddant Document from API
  9. Functions
  10. API GateWay
  11. Simple ETL from COS to DB2
  12. Copy ETL using REST
  13. Run Spark Job on COS 

Basics

  • IAM = Shared Account
  • Provisioning= Create an App
  • Helm Charts = Add Addons to the Provisioned App
  • There are 3 Types of App
    • Classic Infrastructure  - For Individuals
    • IAM Managed Services - For Enterprise / Resource Groups 
    • Cloud Foundary - Open Source 

Pre-Req

  • open ibmcloud
  • create a free account
  • Login as directed

Free CommandLine with python3.8+

  • Login to ibmcloud
  • On the tool bar of Landing Page, Click on IBMCloud Shell
  • $python3

Create Free Application

  • Login to ibmcloud
  • click on Catalog
  • Search for CloudFoundary
  • Click on Cloud Foundary Application >Click on Create
  • Add details : Resource ,App Name etc., 
  • Click on Create 
  • Goto homepage > Resource List > CloudFoundaryApp > Click on the app 
  • Click on link Visit app URL

API Keys

Getting oAuth Tokens 


1) Standalone installer (https://cloud.ibm.com/docs/cli?topic=cli-getting-started)

  • Run $curl -fsSL https://clis.cloud.ibm.com/install/linux | sh #Linux
  • ibmcloud login #ibmcloud login --sso
  • ibmcloud iam oauth-tokens
  • copy the result
  • SET IAM_TOKEN=<paste here>
  • Use "Authorization: Bearer IAM_TOKEN"

2) IBMCLOUD CLI

  • Log in to IBM Cloud 
  • select Manage > Security > Platform API Keys.
  • Create an API key for your own personal identity, 
  • copy the value
  • Run below
    $curl -X POST 'https://iam.cloud.ibm.com/identity/token' \
    -H 'Content-Type: application/x-www-form-urlencoded' \
    -d 'grant_type=urn:ibm:params:oauth:grant-type:apikey&apikey=<MY_APIKEY>'/

Response :
 {
        "access_token": "eyJraWQiOiIyMDExxxxxxxxxxx
  • copy access token and use as below
  • Syntax-
    • Authorization: Bearer <access_token_value_here>. 
  • example-
    • Authorization: Bearer eyJraWQiOiIyMDE3MDgwOS0wMDoxxxxxxxxx

Create a AI Application - Language Translator

  • Login to ibmcloud
  • goto to Catalog
  • filter :Pricing plan=lite 
  • Category : Select AI /MAchine Learning
  • Click on Language Translator 
  • Create
  • Check on consent on Agreement 
  • Create
  • Copy the api-key . url  under : Language Translator >Service Credentials
  • Replace api-key and url  (More REST calls : Language Translator >GettingStarted)
curl -X POST --user "apikey:{apikey}" \ --header "Content-Type: text/plain" \ --data "Language Translator translates text from one language to another"
"{url}/v3/identify?version=2018-05-01" 
  • open Ibmcloud Shell from the ibmcloud tool bar
  • Run the new Command

Cloudant Database 

  • Login to IBMCloud
  • Goto Catalog
  • Select and Create a Cloudant Instance
  • Open the Cloudant Instance provisioned from Resource List > Services and Software >Cloudant
  • Click on Manage > Launch Dashboard
  • Create Database > test >Click on Create
  • Open test DB > Design Document > New Doc > add new json key value 
eg:
{
  "_id": "ce9575de70477c932e222bf5b6bd7fea",
  "name": "deepak"
}
  • Click on Create Document

Lets fetch this document from API 

  • Under Cloudant page > Service Credentails > Create New Role > Manager >Add
  • Open the New Service Credentails Created , Note down apikey , url
  • Open ibmcli  from ibmcloud tool bar (https://cloud.ibm.com/docs/account?topic=account-iamtoken_from_apikey&interface=api)
  • $curl -X POST 'https://iam.cloud.ibm.com/identity/token' -H 'Content-Type: application/x-www-form-urlencoded' -d 'grant_type=urn:ibm:params:oauth:grant-type:apikey&apikey=<MY_APIKEY>'
  • Copy the Token generated
  • Run below commands
API_BEARER_TOKEN=<paste token here>
curl -H "Authorization: Bearer $API_BEARER_TOKEN" -X GET "{url}/test/{_id from cloudant}"

Other Api:

curl -H "Authorization: Bearer $API_BEARER_TOKEN" -X PUT /{db}" #Create DB
curl -H "Authorization: Bearer $API_BEARER_TOKEN" -X PUT /{db}/{doc_id}" Create Document
curl -H "Authorization: Bearer $API_BEARER_TOKEN" -X GET "{url}/test/{_id from cloudant}" #Read Document

Ref : 

https://cloud.ibm.com/docs/account?topic=account-iamtoken_from_apikey&interface=api
https://cloud.ibm.com/docs/Cloudant
https://cloud.ibm.com/apidocs/cloudant#getdocument

Functions

  • Login to IBMCloud
  • catalog > search and click Functions
  • Click on StartCreating
  • Select QuickStart templates > Hello World
  • select python3 > clk Deploy
Note:
TO modify the python code: Function/Actions/helloworld

Test1:

  • click Invoke:Result - {"greeting": "Hello stranger!"}
  • click Invoke with parameters:{"name":"deepak"}
  • click Invoke :Result- {"greeting": "Hello deepak!"}

Test2

  • Open ibmCloud Cli
  • curl -u xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx \
  • -X POST https://eu-gb.functions.cloud.ibm.com/api/v1/namespaces/j.thepac%40gmail.com_dev/actions/hello-world/helloworld?blocking=true

Test3

Open ibmcloudcli
$python3    #open pythonshell
url="https://eu-gb.functions.cloud.ibm.com/api/v1/namespaces/j.thepac%40gmail.com_dev/actions/hello-world/helloworld?blocking=true"
auth=("xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx","xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")
data={"name":"deepak"}
r=requests.post(url,json=data,auth=auth,verify=False)
r.content

API GateWay (Proxy) :

U can create a proxy link for "https://eu-gb.functions.cloud.ibm.com/api/v1/namespaces/j.thepac%40gmail.com_dev/actions/hello-world/helloworld?blocking=true" link by Creating ApiGateWay and providing the above url .

Simple ETL from COS to DB2

Pre- Req:

DB2:

  • Make sure u have created a DB2 instance in IBMCLoud
  • Create a table in DB2 (do not insert any records)
  • CREATE TABLE table_name (col1 int, col1 varchar(255)); -- successfully created
  • In Db2 Ui > Data icon >  Tables 
  • Click on the scheme
  • check if the table is created
    • Test it
      • Syntax : Select * from scheme.table;
      • Example:Select * from DXC02390.table_name;
  • note down the Scheme name and table name
  • Click on about icon in DB2 UI 
  • Note down from "<crn ..........::>" 

Cloudant:

  • Create a Cloudant Object Storage (COS) in IBM Cloud 
  • Create a Bucket 
  • Add a parq File , with scheme similar to the above Table created (use apache spark to create the file locally and drag and drop)
  • select the uploaded parq file > Object Details > copy Object SQL URL

Steps:

  • Create SQL Query instance in ibmcloud 
  • Run the below command to copy the data from COS to DB2
Syntax :
SELECT * FROM <Object SQL URL>  STORED AS PARQUET INTO crn:xxxxxxx:/scheme.table PARALLELISM 2

Example:
SELECT * FROM cos://jp-tok/cloud-object-storage-7d-cos-standard-gsi/test2Cols.parquet STORED AS PARQUET
INTO 
crn:v1:bluemix:public:dashdb-for-transactions:eu-gb:a/e31b7085afca4ab8b6ac9b1077cd8af9:9257e5bc-49f0-43a1-b776-f7a0ff41b2b6::/DXC02390.MONOREPO_POC PARALLELISM 2

Copy ETL using REST 

Pre-Req:  Simple ETL from COS to DB2

curl -X POST 'https://iam.cloud.ibm.com/identity/token' \
    -H 'Content-Type: application/x-www-form-urlencoded' \
    -d 'grant_type=urn:ibm:params:oauth:grant-type:apikey&apikey={Create APi Key from Manage >Access>Api keys}'

Copy Response Token and save it to 
API_TOKEN = "xxxxxx"
or 
SET API_TOKEN="xxxxxx"

Get Current Jobs

curl -XGET   \
--url "https://api.sql-query.cloud.ibm.com/v3/sql_jobs?type=batch&instance_crn=crn:v1:bluemix:public:sql-query:in-che:a/e31b7085afca4ab8b6ac9b1077cd8af9:29b693b9-b195-4549-a2b0-03c93a26e3d1::"  \
 -H "Accept: application/json"  \
 -H "Authorization: Bearer <API_TOKEN>" 

#type=batch or type=stream

#Copy from 1 parq to another
curl -XPOST  \
--url "https://api.sql-query.cloud.ibm.com/v3/sql_jobs?instance_crn=crn:v1:bluemix:public:sql-query:in-che:a/e31b7085afca4ab8b6ac9b1077cd8af9:29b693b9-b195-4549-a2b0-03c93a26e3d1::"  \
-H "Accept: application/json"  \
-H "Authorization:Bearer <API_TOKEN>"  \
-H "Content-Type: application/json"   \
-d '{"statement":"SELECT * FROM cos://jp-tok/cloud-object-storage-7d-cos-standard-gsi/test2Cols.parquet STORED AS PARQUET INTO cos://jp-tok/cloud-object-storage-7d-cos-standard-gsi/test2Cols_result"  }'

Run Spark Job on COS Data

  • login to IBMCLOUD
  • Goto Catalog > Search for Watson Studio
  • Agree to terms and conditions> Click on Create 
  • Click On next >Next > click Create Watson Studio
  • Click on Projects > New Project >Empty Project
  • Add to Project > Notebook 
  • Select Runtime > python (least configuration)
!pip -q install ibmcloudsql
import ibmcloudsql

cloud_api_key="Create api key from Manage"
sql_crn="crn of SQL Query Instance"
sql_cos_endpoint="cosendpoint of bucket/result_prefix"
query="right click on the COS parq file and click on SQL Query"

sqlClient = ibmcloudsql.SQLQuery(cloud_api_key, sql_crn, sql_cos_endpoint) 
#sqlClient =ibmcloud.sqlSQLQuery(my_ibmcloud_apikey, my_instance_crn)

res=sqlClient.run_sql(query)
  • You can create a job and run the notebook at a specific time and results can be seen in the Jobs tab.

Note :

  1. Any file you drag and drop in Notebook will automatically get saved into COS . 
  2. Click on insert code to add spark code to work on the Dataframe.


Ref:
  1. https://cloud.ibm.com/docs/sql-query
  2. https://medium.com/codait/analyzing-data-with-ibm-cloud-sql-query-bc53566a59f5
  3. https://cloud.ibm.com/docs/sql-query?topic=sql-query-data-transport-automation-to-db2-on-cloud
  4. https://www.ibm.com/cloud/blog/announcements/automate-serverless-data-pipelines-for-your-data-warehouse-or-data-lakes
  5. https://dataplatform.cloud.ibm.com/exchange/public/entry/view/4a9bb1c816fb1e0f31fec5d580e4e14d
  6. https://cloud.ibm.com/docs/sql-query?topic=sql-query-sql-reference
  7. https://video.ibm.com/playlist/633112 #https://www.youtube.com/watch?v=s-FznfHJpoU
  8. https://cloud.ibm.com/apidocs/sql-query-v3#introduction #REST
  9. https://cloud.ibm.com/apidocs/db2-on-cloud/db2-on-cloud-v4
  10. https://video.ibm.com/playlist/633075 #jupyter notebook
  11. https://cloud.ibm.com/docs/AnalyticsEngine?topic=AnalyticsEngine-working-with-sql#running-spark-sql-with-scala
  12. https://github.com/IBM-Cloud/sql-query-clients
  13. https://github.com/IBM-Cloud/sql-query-clients/tree/master/Python

Monday, October 11, 2021

Bazel

 Creating Bazel Project 

Fast Step up Guide

1.  make sure bazel is installed in your computer 
2.  create a new folder as Project
3.  cd inside the project folder
4.  create a new "WORKSPACE" file
5.  create python/Program folder
6.  cd to Program
7.  Create a new file BUILD file:

    package(default_visibility = ["//visibility:public"])
    py_binary(
        name = 'hello', #anyname
        main = 'hello.py', #reference path eg:  parentfolder.file
        srcs= ['hello.py'], #filename 
    )

8.  $echo "print('hi')" > hello.py 
9.  make sure ur in folder containing BUILD file
10. $bazel run hello

Bazel has default setting for Python and Java ie., u can start with empty WORKSPACE and run python/java source files .

Refer for other languages:
https://docs.bazel.build/versions/4.2.1/rules.html

Other Languages (example scala):


You need to configure workspace - http_archive , skylib , language specific rules , Maven .

  1. start with http_archive -support to download package from https
    1. load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")
  2. Download skylib - support for shell command 
    1. skylib_version = "0.8.0"
      http_archive(
      name = "bazel_skylib",
      type = "tar.gz",
      url = "https://github.com/bazelbuild/bazel-skylib/releases/download/{}/bazel-skylib.{}.tar.gz".format (skylib_version, skylib_version),
      sha256 = "2ef429f5d7ce7111263289644d233707dba35e39696377ebab8b0bc701f7818e",
      )
  3. Load 
    1. rules_scala  : like scala_binary,scala_test  etc., to use in BUILD file
    2. scala_config : Config scala version
    3.  scala_register_toolchain  : For using the jar file build from 1 languge as input to another
    4. scala repositories : to download default libraries for scala 
  4. Set maven as third party repo

IntelliJ Setup

1. Make sure intelliJ has bazel plugin installed
2. import above project as basel project
3. create new
4. next ( if u already have .ijwb/ folder created , make sure it is deleted)
5. done

Common Commands :

  • bazel build target #target can be name of build or //path of package:target
  • bazel run target
  • bazel test target
  • bazel coverage target
  • bazel query deps(target)
  • bazel fetch target
  • bazel version
  • bazel clean --expunge

Advantages:

  • Google product
  • Language independent
  • Platform Independent (MAc, Linux etc)
  • Hermatic ( build exactly every time )
  • Cross Langauge Dependencies  (Python libarary can call Java binary etc., )
  • Large Code base
  • Caches Dependencies
  • Parallel Builds
  • Enable Remote  (download Dependencies at remote)
  • Dependency Tree Feature
  • Query Dependencies

Cons : 

  • Network Dependency (1st time and new Dependency)
  • Enlist every Dependency (ie., If Dependency is using another Dependency. It has to be decalared)
  • Manually Decalaring all Dependencies might have version conflicts when 1 library uses 1 version and other another

Features

  • Bazel follows python syntax
  • Workspace: Workspace is a Folder with WORSPACE file also called as Bazel Repo.
  • package :Is a Folder inside Bazel Repo with BUILD file .This folder contains Src code files and other files
  • Target :Everything inside your packages can be considered target
  • Label:The nomenclature of a target is known as a label. It’s just a way to recognize different targets 
  • .bazelrc :Settings that are taken into account every time Bazel builds your project.
  • buildifier : Used to ensure that all your build files are formatted in a similar fashion

WORSPACE file

Enlists all external repo the bazel repo is dependent on

 Example :

workspace(name="intro_to_bazel") #name of the workspace

#load("filename","method")
load(“@bazel_tools//tools/builds_defs/repo:git.bzl”, "git_name") 
git_name(
    name= "com_github_xxx",
    commit="xxxxxxxxxxxxxxxx",
    remote="https://github.com/xxx" 
)

Rule Definition in WORSPACE

  • Example:load("//foo/bar:file.bzl", "some_library")
  • This code will load the file foo/bar/file.bzl and add the some_library symbol to the environment. 
  • This can be used to load new rules, functions or constants (e.g. a string, a list, etc.).
  • *_binary rules build executable programs in a given language. After a build, the executable will reside in the build tool's binary output tree 
  • *_library rules specify separately-compiled modules in the given programming language

  • *_test rules are a specialization of a *_binary rule, used for automated testing

Note :

  • https://github.com/bazelbuild/examples/tree/main/java-tutorial
  • In this project WORKSPACE is empty because  Native rules ship with the Bazel binary and do not require a load statement. Native rules are available globally in BUILD files.
  • But for scala ,python etc u need to include load statements in workspace and use them in Build files

Steps:

  • Open link https://github.com/bazelbuild
  • select repos u need for creating ur project
  • Example if u want to add "bazel-skylib" (Provides functions , file paths, and data types in build file)

####### WORSPACE ########

load("@bazel_skylib//:workspace.bzl", "bazel_skylib_workspace")
bazel_skylib_workspace()


BUILD File
load("@bazel_skylib//lib:paths.bzl", "paths")
load("@bazel_skylib//lib:shell.bzl", "shell")
p = paths.basename("foo.bar")
s = shell.quote(p)

  • Since scala does  not directly ship with bazel u need to include the "rules_scala" from bazelbuild in Workspace
  • And Use scala_binary , scala_library scala_test etc., to build and test 

BUILD

  • Folder with BUILD is called Package
  • Contains rules. scala_binary, java_binary etc.,

Example:

common/BUILD 
scala_library(
    name = "common",
    srcs = glob(["*.scala"]),
    visibility = ["//visibility:public"],
)


source/BUILD
scala_binary(
    name = "eid",
    srcs = glob(["filename.scala"]),
    main_class = "com.company.project.filename",
    deps = [
        "//path/common",
    ]
)

  • xxx_library takes sources , targets and label (ie., path for other Bazel Packages)
  • xxxx_library create a libarary 
  • //packagename:target
    • // - root
    • packagename - name of the Builds
    • target - particular target inside a package

  • srcs dependencies :Files consumed directly by the rule or rules that output source files.
  • deps dependencies: Rule pointing to separately-compiled modules providing header files, symbols, libraries, data, etc.
  • data dependencies:A build target might need some data files to run correctly.

Query Dependencies 

  • bazel query "deps(//foo)"
  • bazel query "allpaths(//foo, third_party/...)"
  • bazel query --noimplicit_deps 'deps(//package:target)' --output graph | dot-Tpng >graph.png
#if u are already inside the package
  • bazel query --noimplicit_deps 'deps(target)' --output graph | dot-Tpng >graph.png 
  • bazel query --noimplicit_deps 'deps(microservice)' --output graph | dot-Tpng >graph.png
  • bazel query --noimplicit_deps 'deps(microservice)' --output graph > simplified_graph.in

  • bazel query 'foo/...' --output package # What packages exist beneath foo?
  • bazel query 'kind(rule, foo:*)' --output label_kind #What rules are defined in the foo package?
  • bazel query 'kind("generated file", //foo:*)' #What files are generated by rules in the foo package?
  • bazel query 'attr(generator_function, foo, //path/to/search/...)' #What targets are generated by starlark macro foo?
  • bazel query 'buildfiles(deps(//foo))' | cut -f1 -d: #What's the set of BUILD files needed to build //foo?
  • bazel query 'tests(//foo:smoke_tests)' #What are the individual tests that a test_suite expands to?
  • bazel query 'kind(cc_.*, tests(//foo:smoke_tests))' #Which of those are C++ tests?
  • bazel query 'attr(size, small, tests(//foo:smoke_tests))' #Which of those are small? Medium? Large?
  • bazel query 'filter("pa?t", kind(".*_test rule", //foo/...))' #What are the tests beneath foo that match a pattern?
  • bazel query path/to/file/bar.java --output=package #What package contains file path/to/file/bar.java?
  • bazel query path/to/file/bar.java #What is the build label for path/to/file/bar.java?
  • bazel query 'buildfiles(deps(//foo:foo))' --output package #What packages does foo depend on?
  • bazel query 'deps(foo/... except foo/contrib/...)' --output package #What packages does the foo tree depend on, excluding foo/contrib
  • bazel query 'kind(genproto, deps(bar/...))' #What genproto rules does bar depend upon
  • bazel query 'kind("source file", deps(//path/to/target/foo/...))' | grep java$ #What file dependencies exist
  • bazel query 'deps(//foo) except deps(//foo:foolib)' #What targets does //foo depend on that //foo:foolib does not?
  • bazel query 'somepath(bar/...,groups2/...:*)' #Why does bar depend on groups2

Rules 

Read output of build file in another build files

Ref:


Sunday, October 10, 2021

Java Package Names and Naming Convention:

 Java Package Names and Naming Convention:

  • If you're just doing personal projects where nobody else will use the code, then you can make any name .
  • Don't make up something that starts with com. or net. or other top-level domain though, because that would imply that you own the domain name (ie. using com.john as your package name just because your name happens to be John is not a good idea).
  • The domain-name-backwards convention is there to prevent name collisions. Two different companies with the same product name will have different namespaces so everything works fine.

Ref:

  • https://stackoverflow.com/a/292175
  • https://docs.oracle.com/javase/tutorial/java/package/namingpkgs.html
  • https://stackoverflow.com/a/6247924

Thursday, September 30, 2021

Scala - Requests

Scala - Requests 


link:

https://index.scala-lang.org/com-lihaoyi/requests-scala/requests/0.6.9?target=_3.x


"com.lihaoyi" %% "requests" % "0.6.5" // sbt
compile "com.lihaoyi:requests_2.12:0.6.5" //gradle
val r = requests.get("https://api.github.com/users/lihaoyi")

r.statusCode
// 200

r.headers("content-type")

Saturday, September 11, 2021

Scala -Java : Convert String to Json

Scala -Java : Convert String to Json 

Code (Maven / https://mvnrepository.com/artifact/com.google.code.gson/gson):

import com.google.gson.{Gson, JsonParser}
val json="""{"hello": "world", "age": 42}"""
 val parser:JsonParser= new JsonParser();
val res= parser.parse(json).getAsJsonObject()
println(res.get("hello")) // world

//read from file
//val path="/path/file.json"
//val lines = scala.io.Source.fromFile(path, "utf-8").getLines.mkString

Code 2 (Requires Lift Web - Maven):

import net.liftweb.json._
import net.liftweb.Serialiazation.write
case class Address(city:String,Country:String)
case class Person(name:String ,address:Address)
implicits def formats=DefaultFormats
print(write(Person("Sam",Address("NY","US"))

python : Setup VS Code

 python : Setup VS Code

Steps:

  • Make sure VS code is installed
  • Install python extension from Microsoft
  • run this in shell

import sys 

for pth in sys.path:print(pth)

  • Goto > Preferences>Settings >users 
  • Search for Python  > Python Path >add path "/usr/lib/python39/python.exe"
  • At the bottom you should see the interpreter name

Pylint

  • Ctrl+shift+P
  • Search for Python:Select Linter
  • Choose pylint and enable

Saturday, August 21, 2021

Bootstrap

 Bootstrap


Pre-Req:

  • downloadand install VSCode
  • Start with base html 
  • https://getbootstrap.com/docs/5.1/getting-started/introduction/#starter-template
  • Copy paste individual components from Bootstrap Document

Class

  • Used to add extra features(like color , size etc)
  • classes + spaces = add new property to the element
  • <button type="button" class="btn btn-secondary btn-sm">Primary</button>

ID : Used to identify the component

Image 

  • img src="https://source.unsplash.com/400x400/?code"
  • "https://source.unsplash.com/1200x300/?van"
  • If u have a file place in the same folder and  img src="1.jpg"
  • Dont put components inside other components Eg:Button inside paragraph
  • Image Centering - <img src="..." class="rounded mx-auto d-block" alt="..."
  • https://getbootstrap.com/docs/4.0/content/images/

Image Background

<head>
<style>
.page-holder {
min-height: 100vh;
}
.bg-cover {
background-size: cover !important;
}
</style>
</head>

<div style="background: url(https://i.postimg.cc/ZnHTP71s/aircraft-airplane-boat-1575833.jpg)" class="page-holder bg-cover">
<div class="container py-5">
<header class="text-center text-white py-5">
<h1 class="display-4 font-weight-bold mb-4">Contact us</h1>
<p class="lead mb-0">Address</p>
</header>
</div>
</div>

django:

#make sure css is added
#file is inside static
settings.py :
STATIC_URL = '/static/'
STATICFILES_DIRS = ( os.path.join('static'), )

<div style="background: url('{% static 'file_inside_static.jpeg'%}')" class="page-holder bg-cover">


Color 

  •     <button type="button" class="bg-danger">name</button>
  •     bg-danger = Red
  •     bg-warning = Yellow
  •     bg-success = Green
  •     bg-primary = blue
  •     bg-light 

    text Color

  •     <p class="text-white bg-dark">.text-white</p>
  •     <p class="text-light">.text-black-50</p>
  •     <p class="text-white-50 bg-dark">.text-white-50</p>
  •     <p class="text-success">.text-success</p>

container

  •  <div class="container"></div>
  •  Bootstrap > Container > Grid
  •     Bootstrap gives max of 12 columns in a row
  •     If "col-3" = use 3 column to make 1 column out of 12 .So now user is left wtih 9 columns 

    Example :

    <div class="container">

        <div class="row bg-light">

            <div class="col-10 bg-danger"> 10 of 12</div>

             <div class="col-2 bg-danger"> remaining 2 0f 2</div>

        </div>

    </div>

  •     It can be configured for different mobile devices as well 
  •     if "col-md-10 col-sm-6" means in medium device use size 10 and small device use size 6

Margins , Spacing and Padding:

  •     Create a new container
  •     add "my-1" inside the class
  •     U can give margins 1-4
  •     my-l = left
  • <div class ="row gx-5"> = horizontal gutters
  • <div class ="row gy-5"> = vertical gutter
  • <div class ="row g-5"> =horz+vert gutter
  • <div class="container">
      <div class="row row-cols-2 row-cols-lg-5 g-2 g-lg-3">
  • <div class ="row px-5"> =padding
  • https://getbootstrap.com/docs/5.1/utilities/vertical-align/
  • https://getbootstrap.com/docs/5.1/utilities/spacing/
  • https://getbootstrap.com/docs/5.1/utilities/position/

Using Examples (https://getbootstrap.com/docs/5.1/examples/):

  •     goto Examples in Bootstrap
  •     open any Example
  •     Inspect the element in Browser
  •     Add a new container in the html 
  •     Paste the element inside the container

 Centering

    add following to class "d-flex jusify-content-center align-items-center"

Sequencing:

add this to class "order-md-1" /"order-md-2"

Linking html files :

href="about.hml" <!--about.html is a file in the same folder-->

Login Box (Modal):

  • Pops up a message box on click on of the button
  • https://getbootstrap.com/docs/5.1/components/modal/#static-backdrop

ThumbNail

https://stackoverflow.com/a/8858052

Sunday, June 13, 2021

Important Resources

Important Resources

Open Applications

  • OSX - open is a universal opener in OSX
  • Linux - has xdg-open (open bashrc > add "alias open = xdg-open" > save file >source ./bashrc)
  • Windows - use start

Online Virtual terminals

  • Google API Dashboard > Cloud Shell
  • Docker playground

ReadMe Editor

Text to Table

https://www.tablesgenerator.com/text_tables

Host website Free Server

Host website non Server

  • https://www.netlify.com/
  • github (make sure home page is index.html and linked pages have path "/repo_name/about.html") and (repo>settings >pages>select branch and theme)

Image Hosting and server:

  • Gdrive
  • https://imgbb.com/
  • Github >Open any Repo > Create Issue > Drag and Drop Image > Copy the Link created

Banner / Header

Icons

Images:

Image from URL:

Method 1

Method 2

  • Search image in Google images
  • open source website where u have the image
  • Browser > Inspect
  • Copy the url

Free BootStrap Code sniffet

Free Templates ( Resume , Poster , Letterhead etc., )

Free GIFs:

Free Api

Free Animation:

Free CSS:

https://animista.net/

Mobile View of Websites:

https://ready.mobi/

Kafka Online

https://customer.cloudkarafka.com/login

DB Client

https://hub.docker.com/_/adminer

Thanks to :

https://www.youtube.com/watch?v=qEw8qlRX05A&t=28s

Sunday, May 16, 2021

Kubernetes : Basics


Kubernetes : Basics


Kubernetes

Problem with VM:

  • Virtual machines comes with unwanted apps and services Eg: Browser
  • All process / Threads were not is user control .
  • Memory allocation was not in user control.
  • Syncing Virtual machines is not easy
  • Lot of configuration was needed
  • To delete and reconfigure was a nightmare.

Kubernetes

  • configure, automate, and manage applications using containers Image
  • Open Source
  • Facility for scheduling of containers
  • Creation, deletion, and movement of containers
  • Easy scaling of containers
  • Monitoring

kubernetes local :

Terminoloy :

- Cluster: A collection of nodes(Computer) represent a cluster.
- Pod:
	- Runs Container Image
	- unique ip address
	- Pods are the smallest unit in kubernetes
	- private, isolated network.
- Container Image : code 
- kubectl: Command creates a proxy ,forwarding communications into the cluster using API. 

SetUP (Project )

minikube start
minikube stop
kubectl create namespaces somename                                                                              
kubectl config set-context --current --namespace=somename

Basic 1 Pod :

kubectl run -it alpine --image=alpine   # Create a Pod
kubectl get pod
kubectl exec -it <pod-id> bash #exit
kubectl logs <podname>

Basic with Deployment with 2 Pods

# Create a Deployment with 2 pods
kubectl create deployment mydep --image=nginx --replicas=2  
		#configmap(env) , secret , cronJob
kubectl get deployment
kubectl describe deployment mydep

# Create a service to expose deployment
kubectl expose deployment mydep --port=80 --type=LoadBalancer
kubectl get services
kubectl describe service mydep

#Port forward to outside world
kubectl port-forward svc/mydep 8080:80&
#open browser > localhost:8080

kubectl get deployment mydep -o yaml >> mydep.yml
kubectl get service mydep -o yaml >> mydepsvc.yml

kubectl delete deployment mydep
kubectl delete service mydep

kubectl apply -f mydep.yml
kubectl apply -f mydepsvc.yml
kubectl port-forward svc/mydepsvc 8080:80&

Env Variables

kubectl create configmap env --from-literal=key=1
kubectl get configmap env -o yaml
kubectl get configmap env -o=jsonpath='{.data.key}'

Secret

kubectl create secret my-secret literal-token --from-literal pass="123"
kubectl get secret my-secret -o=jsonpath='{.data.pass}' | base64 --decode

CronJob /Job

kubectl create cronjob hello --image=busybox:1.28  --schedule="*/1 * * * *" -- echo "Hello World"
kubectl create job hello --image=busybox:1.28 -- echo "Hello World"

Export Yaml kubectl create deployment html --image=nginx --dry-run=client -o yaml > deployment.yaml # Volume Mount

Stateful Vs Stateless

Stateful

  • All Databases ie., Relational and Non-Relational
  • All replicas are synced (maintain Data Consistency)
  • Mandatory Persistant Volumes ( Fault Tolerance)
  • Have Serial ID alloted from 0 to n
  • Complex to create
  • Not Recommened to use Databases in Kubernetes*
  • Uses Master and Slave :
    • Master is id-0
    • only one allowed to change data
    • Data gets transfered from id-0 > id-1> id-2 ....

Stateless:

  • Eg: Rest Api Server
  • Not Synced
  • Not mandatory Persistant Volumes

YML

	apiVersion: v1
	kind: Pod
	metadata:
	  name: myapp
	spec:
	  containers:
	    - name: myapp
	      image: dockerimage
	      command: [ "env" ] # print env var
  • SERVICE Git:
    • Service /portForward (Testing only)
    • ClusterIp (Within Cluster)
    • Loadbalancer (Commonly Used to expose Application)
    • Ingress

Simple Service/port-forward (Used for Testing Only - )

apiVersion: v1
kind: Service
metadata:
  name: newservice
spec:
  selector:
    app: targetappname
  ports:
  - port: 8085
    targetPort: 8085
    name: httpTest

#kubectl get service
#kubectl port-forward svc/newservice 8085:80
#kubectl delete service newservice

ClusterIP - Within Cluster Only http://localhost:8080/api/v1/proxy/namespaces/<NAMESPACE>/services/<SERVICE-NAME>:<PORT-NAME>/

apiVersion: v1
kind: Service
metadata:  
  name: clusterservice
spec:
  selector:    
    app: targetappname
  type: ClusterIP
  ports:  
  - name: http
    port: 80
    targetPort: 80
    protocol: TCP

#kubectl proxy --port=8080

LoadBalancer - load balancer functionality - Only Used in Cloud Services ( AWS, GCP ,Azure ..) - Each service needs new LoadBalancer and new IP address.

apiVersion: v1
kind: Service
metadata:
  name: myloadbalancer
spec:
  type: LoadBalancer
  selector:
    app: targetappname
  ports:
    - port: 8000
      targetPort: app_port

Ingress (Similar to port-forward / Simple Service ) - Ingress is an object that allows access to your Kubernetes services from outside - Needs ingress controller running. - Ingress controller ( will not run by default ) - Ingress Resource

			kubectl get pods --all-namespaces | grep ingress
			kubectl get service --all-namespaces | grep ingress
			kubectl get Ingress ingress-name
			kubectl delete ingress ingress-name

ConfigMap (Environment variables)

apiVersion: v1
kind: ConfigMap
metadata:
  name: myconfig
data:
  TEST_ENV: test
---
apiVersion: v1
kind: Pod
metadata:
  name: myapp
spec:
  containers:
    - name: busybox
      image: k8s.gcr.io/busybox
      command: [ "env" ] # print env var
      envFrom:
        - configMapRef:
            name: myconfig

#kubectl apply -f configmap.yaml
#kubectl logs myapp
#kubectl delete -f configmap.yml

Secret

apiVersion: v1
kind: Secret
metadata:
  name: my-secret
type: Opaque
#$echo "test" | base64
data:
  testpassword: dGVzdAo=
---
apiVersion: v1
kind: Pod
metadata:
  name: env-pod
spec:
  containers:
    - name: test
      image: alpine
      command: ['env']
      env:
        - name: USER
          valueFrom:
            secretKeyRef:
              name: my-secret
              key: testpassword
		              
#kubectl describe secrets
#kubectl get secret my-secret -o jsonpath='{.data}'
#echo '[encoded-value]' | base64 --decode
#kubectl delete secret my-secret           

Volume( A Directory accessible to all containers running in a pod.)

apiVersion: v1
kind: ConfigMap
metadata:
  name: test-file
data:
  data.csv: |
    name,age
    sam,1,
    tom,2
---
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  restartPolicy: Never
  containers:
    - name: busybox
      image: k8s.gcr.io/busybox
      command: [ "sleep", "3600" ]
      volumeMounts:
        - name: config
          mountPath: "/datadir"
          readOnly: true
  volumes:
    - name: config
      configMap:
        name: test-file

#kubectl apply -f configmap-file.yaml
#kubectl get pod
#kubectl exec -it test-pod sh
# $cat datadir/data.csv
#kubectl delete -f configmap-file.yaml

Create vs Apply :

Create new Only (Cannot give pod in command) :

-	kubectl create -f manifest.yml	#create new Pod  

Create new / Overwrite existing x 3 :

-   kubectl apply -f manifest.yaml   
-   kubectl apply -f https://git.io/vPieo
-   cat <<EOF | kubectl apply -f -
        apiVersion: v1
        kind: Pod
        metadata:
        name: busybox-sleep
        spec:
        containers:
        - name: busybox
            image: busybox:1.28
            args:
            - sleep
            - "1000000"
        EOF

Examples

Pre-Req:

  • $minikube
  • make sure Docker application is running
  • run "eval $(minikube docker-env)" #make use of docker in minikube Docker Tutorial

Example Online:

labs.play-with-docker.com

  • Remove Taint :

          kubectl get nodes //copy node name
          kubectl describe node node1| grep -i taint      //copy name 
          kubectl taint node nodename pasteabove-
    
  • Create namespace:

          kubectl create namespaces somename                                                                              
          kubectl config set-context --current --namespace=somename
    

Example : Using hi.yml file (create)

apiVersion: v1
kind: Pod
# Pod / Job
metadata:
 name: hi
spec:
 containers:
   - name: hi
     image: ubuntu
     command: ['sh', '-c', 'echo "Hello, Kubernetes!" && sleep 3600']
     # imagePullPolicy: Never

kubectl create -f hi.yml
kubectl exec -it hi -- /bin/bash

Activities

Activity : Pull local Docker image into minikube

Create a file with name Dockerfile
Add below lines :
		FROM alpine:3.4
		RUN apk update
		RUN apk add vim 
		RUN apk add curl
open new terminal
minikube start
eval $(minikube docker-env)
docker build -t foo:1.0 . 
docker images #Check if foo is created
kubectl run foo -it --image=foo:1.0 
		-   $cat /proc/version
		-   $exit
kubectl get pods
kubectl get deployments 
kubectl delete deployment foo

Activity :Create "Hello world" python program push and pull from remote Docker Pre-Req:

  • Linux machine
  • Goto Docker-hub and create a account
  • Login to Docker-hub and create a repository
  • Restart your system and Open Bios
  • Enable vt-x / virtualization
  • Turn on Docker application
  • Make sure
  • Create a folder called apps
  • Create 3 files inside it .

Create deployment.yaml

	apiVersion: v1
	kind: Pod
	metadata:
		name: foo
	spec:
		containers:
		- name: whatever
			image: index.docker.io/usn/repo_name:latest
			imagePullPolicy: Always
			imagePullSecrets:
			- name: my_registry

main.py

print('hi')  

Dockerfile" (no extension)

FROM python:3.7
RUN mkdir /app
WORKDIR /app
ADD . /app/
EXPOSE 5000
CMD ["python","-u", "/app/main.py"]

Steps:

-   cd into apps
-   sudo docker images #empty table
-   sudo docker build -t any_name:v1 . # note there is  '.' at the end
-   sudo docker images 
-   sudo docker run -p 4000:80 any_name 
-   sudo docker images #note down the id /name, usually it is latest
-   sudo docker login
-   sudo docker tag TAG_id usn/repo_name:TAG_NAME_of_image
#docker tag 3a4677d31cde usn/test_repo:latest
-   sudo docker push usn/repo_name:TAG_NAME_of_image
#docker push usn/repo:latest
-   kubectl apply -f deployment.yaml #pull image from docker hub & create pod
-   kubectl logs -f foo #hi
-   kubectl get pods #shows all pods
-   kubectl delete pod pod_name #to delete pod

#Status =CrashLoopBackOff. Because we just have 1 print statement , so whenever the application was closing after "hi".The pod try to restart service and had done it mutiple times

Activity : Send arguments from different terminal

kubectl attach redis_container -i

Activity :Forward ports from Pods to your local machine

kubectl port-forward redis-izl09 6379
telnet localhost 6379 	#nc -l -p 6379  

Ref: https://kubernetesbyexample.com/ https://training.play-with-docker.com/ops-s1-hello/

Saturday, May 15, 2021

Django - Simple Registration Form

Django - Simple Registration Form 

 


<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
<p>Register</p>

<form method="POST" action="">
{% csrf_token %}
{{form.username.label}}
{{form.username}}
{{form.email.label}}
{{form.email}}
{{form.password1.label}}
{{form.password1}}
{{form.password2.label}}
{{form.password2}}
<input type="submit" name="new_user">


</form>


</head>
<body>

</body>
</html>

Monday, March 22, 2021

CORRELATED SUBQUERIES


SubQuery : 
Simple subquery doesn't use values from the outer query and is being calculated only once:

SELECT id, first_name FROM student_details
WHERE id IN (SELECT student_id FROM student_subjects
WHERE subject= 'Science');
 
CoRelated Subquery  -
Query To Find all employees whose salary is above average for their department

SELECT employee_number, name FROM employees emp
WHERE salary > ( SELECT AVG(salary) FROM employees
WHERE department = emp.department);

Saturday, March 13, 2021

FastApi : Create Production ready Api

 FastApi : Create Production ready Api (faster than Flask)


https://fastapi.tiangolo.com/

Features

1. Asynchronous

2. High Perfromance

3. Less Code

4. Data Type  and Data Models auto Conversions

5. Auto Documentation

    - swagger (/docs)

    - ReDoc   (/redoc)


Pre-Req:

Install and activate virtual environment to be safe.

Steps:

  1. pip install fastapi
  2. pip install hypercorn #server
  3. touch main.py
  4. copy paste below code
  5. hypercorn main:app --reload


from fastapi import FastAPI
from pydantic import BaseModel #model
# import requests

app = FastAPI()
db = []

class Person(BaseModel):
name: str

@app.get('/')
def index():return {'person' : 'name'}

@app.get('/persons')
def get_persons():
results = []
for person in db:results.append(person)
return results

@app.get('/persons/{person_id}')
def get_person(person_id: int):return db[person_id-1]

@app.post('/persons')
def create_person(person: Person):
db.append(person)
return db[-1]

@app.delete('/persons/{person_id}')
def delete_person(person_id: int):
db.pop(person_id-1)
return {}

Friday, February 5, 2021

Raspberry Pi Pico

Raspberry Pi Pico

Micro Python:

1. Open Source
2. Modules are cheaper 	

Circuit Python :

1. Fork of Micro Python 
2. Owned by Ada Fruit
3. Compatible with Only Ada Fruit Modules
4. Expensive 
5. No ConCurrency and No State Sharing 

Pre- Req:

  1. Download the below files Micro Python / Circuit Python
    1. Micro Python (Recommended) - MicroPython
    2. Circuit Python UF2 file from - https://circuitpython.org/board/raspberry_pi_pico/
  2. Hold Raspberry Pi Pico on-board BOOTSEL button (Button on the board)
  3. Plug it into USB (or pulling down the RUN/Reset pin to ground)
  4. It will appear as a USB disk drive
  5. you can copy paste the firmware onto drive
  6. the drive will change for Circuit python and no reaction for Micro Python

Micro Python :

  1. Install and open thonny IDE
  2. IDE should automatically detect pico

Shell

from machine import Pin
led = Pin("LED", Pin.OUT)
led.value(1)
led.value(0)

Code - save as main.py in "MicroPython" device

from machine import Pin
import time
led = Pin("LED", Pin.OUT)  # Pin(25, Pin.OUT)
for i in range(1, 5):
    print(i)
    led.value(1)
    time.sleep(1)  # Use 1 second instead of 10 seconds for better visibility
    led.value(0)
    time.sleep(1)

Circuit Python :

Configure Mu Editor so that Code can be seen running in real time., ie as soon as the code is saved , the result reflected in LEDs directly .

  1. sudo apt-get update
  2. sudo apt-get -f upgrade
  3. apt install libfuse2
  4. Download and Install Mu Editor

Run

  1. Copy paste below program into code.py
  2. Save the file in the device
  3. Open Mu Editor
  4. Should automatically recognise PICO and Opens code.py

Blink Program

import board
import time
from digitalio import DigitalInOut, Direction,Pull
led = DigitalInOut(board.LED)
led.direction = Direction.OUTPUT
#Connect LED between Pin1 ie GP0 and Pin 2
op = DigitalInOut(board.GP0)
op.direction = Direction.OUTPUT

while 1:
    if led.value==0: led.value= True
    elif led.value==1:led.value = False
    time.sleep(0.5)
    
    if op.value==0: op.value= True
    elif op.value==1:op.value = False
    time.sleep(0.5)

Input Switch

    import time
    import board
    import digitalio
    button = digitalio.DigitalInOut(board.GP0)
    button.switch_to_input(pull=digitalio.Pull.UP )
    while True:
            print(button.value)
            time.sleep(0.5)

https://learn.adafruit.com/getting-started-with-raspberry-pi-pico-circuitpython/blinky-and-a-button https://www.youtube.com/watch?v=nYA4PVljE4Q

view raw micropython.md hosted with ❤ by GitHub

Sunday, January 31, 2021

CIRCUITPYTHON :SEEEDUINO XIAO [SEED]

 CIRCUITPYTHON :SEEEDUINO XIAO

Steps:

 Uses SAMD21 Processor

1. connect SEEEDuino xiao to PC using TYPE-C cable
2. short RST pins using a cable fast , 2 times.
3. Once done successfully,Audrino drives appears
4. Go website -

https://circuitpython.org/board/seeeduino_xiao/

https://wiki.seeedstudio.com/Seeeduino-XIAO-CircuitPython/

5. Download latest .UF2 file
6. Copy and paste it inside the drive
7. Now the drive will be converted to CIRCUITPY
8. Create a file code.py
9. Copy paste below code into code.py (same for all circuit py IC)

import time
import board
from digitalio import DigitalInOut,Direction


led = DigitalInOut(board.D13) #D13 is a built in LED

#A1 - A10 can be used as well if u use a separate LED and a Resistor 100 - 400 ohms refer below for calculations
led.direction=tinker .OUTPUT

while True:
    led.value = True
    time.sleep(1)
    led.value=False
    time.sleep(1)

   


 

 

10. Save file
11. The LED should start blinking


A simple LED circuit consists of a LED and resistor. The resistor is used to limit the current that is being drawn and is called a current limiting resistor. Without the resistor the LED would run at too high of a voltage, resulting in too much current being drawn which in turn would instantly burn the LED, and likely also the GPIO port.

To calculate the resistor value we need to examine the specifications of the LED. Specifically we need to find the forward voltage (VF) and the forward current (IF). 

A regular red LED has a 

forward voltage (VF) of 1.7V 

 forward current of 20mA (IF). 

output voltage of the IC which is 3.3V.

We can then calculate the resistor needed-


RΩ=VI=3.3VFIF=3.31.720mA=80Ω




VI=3.3=3.31.7