Showing posts with label spark. Show all posts
Showing posts with label spark. Show all posts

Friday, July 10, 2020

Dataframe : agg

Spark Dataframe : agg()


To perform column related operations

  • ds.agg(...) = ds.groupBy().agg(...)
  • agg() is a DataFrame method that accepts those aggregate functions as arguments.
Example :

val df3: DataFrame =Seq(1.6,7,42).toDF("CURRENCY")
.withColumn("c2", lit(8.6).cast(DoubleType))
.withColumn("c3", lit(1).cast(DoubleType))
+--------+---+---+
|CURRENCY| c2| c3| +--------+---+---+ | 1.6|8.6|1.0| | 7.0|8.6|1.0| | 42.0|8.6|1.0| +--------+---+---+
df3.agg(max("CURRENCY")).show()
+-------------+ |max(CURRENCY)| +-------------+
| 42.0|
+-------------+
println(df3.agg(max("CURRENCY")).collect()(0))
[42.0]
println(df3.agg(sum("CURRENCY")).collect()(0))
[50.6]
df3.agg(sum("CURRENCY"),sum("c2")).show()
+-------------+------------------+
|sum(CURRENCY)| sum(c2)|
+-------------+------------------+
| 50.6|25.799999999999997|
+-------------+------------------+
In the above example In order write .sum this method has to exist. It is hardcoded on the API. Using .agg you can provide other aggregating functions, sum("column") is just one of them.


Same as :
df3.groupBy().max("CURRENCY")
[42.0]



Monday, September 9, 2019

scala : Basic Setup for Scala and Spark

Basic Setup for Scala

Scala can be run using :
  1. Spark-Shell
  2. SBT

Spark-Shell :Usually this is used as it provides scala along with spark capabilites to create RDD/DataSet and DF

Installation for Linux
  1. java -version
  2. sudo apt-get install scala #scala -version
    1. type $scala
    2. println("hi")
    3. :q
  3. goto - https://spark.apache.org/downloads.html
  4. Download latest tar file
  5. tar -zxvf spark-2.0.2-bin-hadoop2.7.tgz
  6. cd spark-2.0.2-bin-hadoop2.7.tgz
    1. ./spark-shell
    2. :q //to quit
    3. open .bashrc
  7. add line SPARK_HOME=~/Downloads/spark-3.0.0-preview2-bin-hadoop2.7
  8. $export PATH=$SPARK_HOME/bin:$PATH
  9. $source ~/.bashrc
  10. $spark-shell
Installation for mac
  1. Run homebrew installer : $/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"#(visit homebrew for more details)
  2. $xcode-select –install
  3. $brew cask install java
  4. $brew install scala
  5. $brew install apache-spark
  6. $spark-shell # to start spark-shell
  7. $brew upgrade apache-spark #to Upgrade 
Start session 
$spark-shell

To Create a Dataframe:
:imports
#copy 4th point as 
#import org.apache.spark.sql.functions._ as  
#import org.apache.spark.sql._
import org.apache.spark.sql._
val mockDF1: DataFrame = Seq((0, "A"), (1, "B"), (0, "C")).toDF("col1", "col2")
mockDF1

TO uninstall 
brew uninstall scala
--------------

SBT shell

Linux
  • Install brew
  • open terminal
  • Enter "/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
  • After brew is installed successfully
  • Enter"brew install sbt@1"
  • TO uninstall - brew uninstall scala
Linux :

Getting SBT Started in Terminal:
  1. Open Terminal
  2. enter "sbt"
  3. After sbt shell is opened
  4. Open Scala REPL session inside SBT using - "console" or "consoleQuick"
  5. Type "println("helloworld")
  6. To quit type ":q" or":quit"
  7. And "exit" to exit sbt shell
Note : You can only do basic operations here . But cannot do operations where there is dependency on libraries .For that you need a build tool like sbt or bazel .

To add library into your sbt shell :
  1. Download required jar file Eg:https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.12/2.4.4
  2. Open sbt shell
  3. run cmd ":require spark-sql_2.12.jar" #Added - file.jar
import org.apache.spark.sql.{Column, DataFrame, SaveMode, SparkSession}

Using SBT in IntelliJ :
  1. IntelliJ by default comes with sbt
  2. Install Idea intelliJ
  3. Create a new sbt project
  4. goto project /src/main/test
  5. Create a new file "test1.scala"
  6. copy below code  
  7. set SBT SDT to 2.11
  8. add contents into build file
  9. Run it (rt ck run)
build.sbt:

name := "sbt_test" version := "0.1" scalaVersion := "2.11.8"
libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "2.1.0", "org.apache.spark" %% "spark-sql" % "2.1.0")

libraryDependencies += "org.scalactic" %% "scalactic" % "3.0.0"
libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.4" % "test"
libraryDependencies += "org.mockito" % "mockito-all" % "1.8.4"
libraryDependencies += "org.scalamock" %% "scalamock" % "4.3.0" % "test"
libraryDependencies += "org.testng" % "testng" % "6.10"

//Basic Class:

object test1 extends App{
  println("Hello World")
}
or 
object test1 { def
            main(args:
              Array[String]): Unit = println("Hello, World!") }
or

import org.apache.spark.sql.SparkSession
import org.scalamock.scalatest.MockFactory
import org.apache.spark.sql.{Column, DataFrame, SaveMode, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

object test1  extends App with MockFactory {
  val spark = SparkSession.builder().master("local").appName("spark-shell").getOrCreate()
  import spark.implicits._
  val df1: DataFrame = Seq((1,1), (2,2), (3,3)).toDF("col1", "col2")
  df1.show()
}

Issues :

1. Error:scalac: Multiple 'scala-library*.jar' files (scala-library.jar, scala-library.jar, scala-library.jar) in
          Scala
            compiler classpath in
          Scala
            SDK scala-sdk-2.11.7`

Solution:
File > Project_Structure > Libraries     
Remove "SBT:org.scala-lang:scala-library:2.11.8:jar"

2. Cannot resolve App
Solution:
Set the current Scala SDT to 2.11.8
--------------------------------------------------------------------------------------------------------
Issues :

1. Error:scalac: Multiple 'scala-library*.jar' files (scala-library.jar, scala-library.jar, scala-library.jar) in
        Scala
          compiler classpath in
        Scala
          SDK scala-sdk-2.11.7`

Solution : Goto File > Project Stucture > Platform Settings > SDKs > Remove Duplicate
Solution 2: Remove "scalaVersion := "2.xx.xx" from build.sbt file
Solution 3:


2. Could not find or load main class in scala in intellij IDE
Solution: Right click on "src folder" and select Mark Directory as -> Sources root
--------------------------------------------------------------------------------------------------------

Ref: