Parallel Execution of R Scripts in Oracle DB

With Oracle R Enterprise it is possible to run R scripts inside an Oracle database. The idea is to use the performance of a database server to make running R code fast. We can also use it on an Oracle exadata database machine. We have than two ways on query data in parallel: On the one hand we can query data in parallel in the database (SQL parallel query) on the other hand  we can run more than one external R processes.  The blog below shows to run R processes parallel as external processes on DB server.  It depends on the R code, if it is possible to run more than one R process. E.g for scoring it should be possible. If you use R (ORE) from oracle you have this possibilities:

1_R_possibilities

  • Running R on your local client, using data out of oracle DB
  • Calculating R code inside oracle DB, if you use oracle R packages
  • Running R Code as an external server process out of DB Server

To get the best performance we take the way to run R code direct on DB Server. We start it out of DB and use, if it is possible, Oracle R packages (ORE). If we need other R functions, which are not implemented in ORE, we run it as external process on DB Server.

For the development we take this approach:

  • Running R Script with Data out of DB
  • Running inside Database, pulling data into script
  • Running inside DB, getting data out of cursor
  • Running inside DB, getting data in chunks out of cursor
  • Running out of SQL

First we start to make a small, very small testcase. The goal is to see the different data types used.

 

user = "rquser"
password = "rquser"
sid = "orcl"
host = "dbdev11"
port = "1521"
service_name = "orcl.ise-informatik.de"
library(ORE)
ore.connect(user,,host,password,port,service_name, all=TRUE)

##Test extproc for R
ore.doEval(function() { 123 })

##Create Tabel for tests
test <- ore.pull(iris)
ore.drop("IRIS_TABLE")
ore.create(test,table="IRIS_TABLE")
#Transperancy
IRIS <- ore.get("IRIS_TABLE")
sum <- (summary(IRIS))
#local
IRIS <- (ore.pull(IRIS_TABLE))
sum <- (summary(IRIS))

First we connect to our DB with R installed. Then we make a short test if extproc is running correct. At the end we create a small table out of iris data. At the end we do my test with summary with oracle transparency Layer and without.

SQL Script to have more rows:

--more data 150 -> 19 MIO for performance testing
SELECT COUNT(*)
FROM IRIS_TABLE;
BEGIN
FOR i IN 1..17
LOOP
INSERT INTO IRIS_TABLE
SELECT * FROM IRIS_TABLE;
END LOOP;
COMMIT;
END;
/
SELECT COUNT(*) FROM IRIS_TABLE;

 

With this SQL script we increase the data to about 19 Million rows. In my Laptop the following tests runs between 30s and 100s.

First Test: Script on DB

We run a simple script, calling one function with no paramters. We use the ORacle connection inside the DB and pull the data there.

##############################
## function to R load on server check data type
##############################

ore.scriptDrop("myFunctionDB1")
ore.scriptCreate ( "myFunctionDB1", function (){
  #we need this because we load data in R
  #db knows connection, if it start out of R client
  library(ORE)
  ore.sync()
  IRIS <- ore.get("IRIS_TABLE")
  res <- summary(IRIS)
  string1 <- class(IRIS)
  string2 <- class(res)
  string <- paste("Type of loaded data" , string1, string2, sep=" ")
  return (string)
}
)

###################
##run it
system.time(
  ore.doEval(FUN.NAME="myFunctionDB1",ore.connect=TRUE)
)
##################

We see that we use a ore.object and get only one string. In the string we find an ore.frame object, it is not a data.frame. It is an oracle type.

The output results are:

cclass(res)
[1] "ore.object"
attr(,"package")
[1] "OREembed"
> res_local <- ore.pull(res)
> class(res_local)
[1] "character"
> View(res_local)

String:
Type of loaded data ore.frame table

Second Test:

Now we try the table apply function. The input of our function would be a table. It also can be a view.

##############################
## function to check data type
##############################
ore.scriptDrop("myFunctionDB2")
ore.scriptCreate ( "myFunctionDB2", function (dataset_pointer, direction,start,stop,conflevel,threshold){
   #library(ORE)
  # not needed, running with pointer
  IRIS <- dataset_pointer ##ore.get("IRIS_TABLE")
  string1 <- class(IRIS)
  res <- summary(IRIS)
  string2 <- class(res)
  string <- paste("Result Type", string1, string2, "PARAMETER", direction, sep=" ")

  return (string)
}
)

###############################
###run it normal, no parallel no partition key
### to find the table
ore.sync()

system.time(res <- ore.tableApply(IRIS_TABLE, FUN.NAME = "myFunctionDB2",
                                  direction='S',
                                  start="201101",
                                  stop="201311",
                                  conflevel=0.995,
                                  threshold=10))

The scripts run complete inside DB, we don’t need any seperate DB connection.

> class(res)
[1] "ore.object"
attr(,"package")
[1] "OREembed"
> res_local <- ore.pull(res)
> class(res_local)
[1] "character"
> View(res_local)
Result Type data.frame table PARAMETER S

We see that we get an data.frame as the input data. We have no oracle object.

Third Test

###############################
###run it, parallel partition key
##

system.time(res <- ore.groupApply(IRIS_TABLE, IRIS_TABLE$Species, FUN.NAME = "myFunctionDB2",
                                  direction='S',
                                  start="201101",
                                  stop="201311",
                                  conflevel=0.995,
                                  threshold=10, parallel = TRUE))

Now we split the calculation based on partition key. We use the group apply function. The R base funtion doesn’t change. Now we start a number of extpro processes. The number depends on the number of different partition keys

class(res)
[1] "ore.list"
attr(,"package")
[1] "OREembed"
> res_local <- ore.pull(res)
> class(res_local)
[1] "list"
> View(res_local)

Result Type data.frame table PARAMETER S
Result Type data.frame table PARAMETER S
Result Type data.frame table PARAMETER S

We also have only a data.frame object. But at the end we have three results. One for every partition key.

Forth Test out of SQL:

To start a SQL query we first must create a package and function for our extrproc call. It is described in ORE documentation, how to do it.

CREATE OR REPLACE PACKAGE myDataType
AS
type rec
IS
  record
  (
    "Sepal.Length" BINARY_DOUBLE,
    "Sepal.Width" BINARY_DOUBLE,
    "Petal.Length" BINARY_DOUBLE,
    "Petal.Width" BINARY_DOUBLE,
    "Species" VARCHAR2(4000) );
type cur
IS
  ref
  CURSOR
    RETURN rec;
  END;
  ------
CREATE OR REPLACE FUNCTION myDataFunction(
  inp_cur myDataType.cur,
  par_cur sys_refcursor,
  out_qry VARCHAR2,
  grp_col VARCHAR2,
  exp_txt VARCHAR2)
RETURN sys.AnyDataSet pipelined parallel_enable(
  partition inp_cur BY hash(
    "Species" )) cluster inp_cur BY(
  "Species") USING rqsys.rqGroupEvalImpl;
------

This is the way we must implement a package and function in SQL. We call it with this SQL table function:

SELECT *
FROM TABLE(myDataFunction(CURSOR
  (SELECT * FROM IRIS_TABLE
  ),cursor(SELECT 'S' "direction",
  '201101' "start",
  '201311' "stop",
  0.995 "conflevel",
  10 "threshold"
FROM "SYS"."DUAL" "DUAL"),'XML','Species','myFunctionDB2'));

AS the result we get three XML blog entries. One look like this:

2_R_sql

<root><vector_obj>
<ROW-vector_obj>
<value>Result Type data.frame table PARAMETER S</value>
</ROW-vector_obj>
</vector_obj></root>

Conclusion:
We see we can query R scripts with SQL in the database. After this we can store it into a seperate table with a insert statement. This way is perfect, if we have R Scripts we can run in parallel on the data e.g. for scoring or prediction.

We can write scripts in R with e.g. RStudio and call it with SQL. All this scripts are supported on Oracle Exadata.

About these ads

2 thoughts on “Parallel Execution of R Scripts in Oracle DB

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s