Thursday, February 26, 2009

Oracle RAC - Recap

What is RAC

RAC High Availability

Transparent Application failover
  • Connection failover is very fast (in few seconds)
  • Query failover is transparent as well and resumed from starting point
  • DML (insert, update, delete) and DDL (create table, alter table, …) failover is not supported. Need application layer error handling to make it transparent to user
Rolling upgrade
  • Not for all patches and upgrades
Management Complexity
  • More complicated, more chances to fail. (cluster Ready services daemon, Cluster Synchronization Services daemon, voting disks, Oracle Cluster Registry, virtual IP address, shared file system, to name a few.)
  • Require database bounce for some database initial parameter changes that can be changes dynamically for non-RAC configuration
RAC High Scalability

Query scalability
IO intensive query, such as full table scan, is limited by disk IO. Therefore, add RAC nodes will not scale well even there are multiple query processes to do the job. RAC could scale up well with number of connections if most read is logical read. (Memory access) But it is not typical situation for Oracle database.
OLTP scalability
OLTP scalability is 30% based on our testing. The limiting factors here is memory latches and inter-node memory data transfer. Even with multiple nodes, the memory latches are shared to protect data integration. For our testing, the inter-node connections are gigabit network card and because of latency, Remote memory access (cache fusion) is about 300 times slower than local memory access. It is only few times faster than disk access.
Load balance
Connection to a node is random assigned with weight on current node load. It works will for the most of cases. The connection will not redistributed to light load node once it is connected to a instance.

RAC Cost of Ownership

Large SMP always cost more that smaller server cluster with similar number of CPU and memory. A 32 CPU SMP costs $1.25 million, UNIX server cluster with same combined number of CPUs cost 1/3. Linux cluster cost 1/5. However, RAC license for each CPU is $20,000. Unless you build a RAC with more than 8 combined CPUs, you are not going to save cost. Oracle gives away RAC lisence for Standard Edition with ASM storage for free up to 4 -CPU configuration. It is very attractive if you want a high availability system. For large system, SMP always faster than RAC if they both have the same amount of CPU and memory. You get what you pay for.


RAC web Links

Oracle RAC handson Part 9 Cost of ownership

RAC cost of ownership

The price for RAC license is $20,000 per CPU. The graph below shows the cost of server hardware and RAC license over number of CPU. As the graph indicates, The break even point is 8 CPUs. If a system is less than 8 CPUs, SMP setup is less expensive. If a system is more than 8 CPUs, RAC is less expensive. Because RAC is slow on accessing remote memory, the SMP system is always performed better than RAC if they have the same number of CPUs. To penetrate low-end market, Oracle offers RAC option for Standard Edition free, which can be installed on system with 4 CPU or less.

Oracle RAC handson Part 8 RAC on ASM

Summary:

ASM normal redundancy works but is as half of the IO speed as external redundancy. Adding or removing disk from ASM disk group will automatically start disk rebalance and will slower system IO to half of it before rebalancing. So the good way to add disk is to add small with small failure group. Stepped on few bugs related to ASM 10g release 2. ASM is clear the winner in terms fo performance, comparing with other file systems (OCFS, ext3), which is more than twice as fast as the OCFS for RAC configuration. It is 60% and 100% and faster than OCFS and ext3 for non-RAC configuration respectively

Test Case 16:

Name: ASM fail over to failure group (ASM normal mirroring)

Purpose: Test ASM mirroring

Detail:

1. Set up ASM disk group and failure group

2. Query a large table with full table scan and record the time for completion. (

select distinct species from sequences; ) It takes 74 seconds to completion on average.

Query the table again and introduce/simulate a disk failure and record the time for query to completion. (blade1:root@ca-test1 /]# /etc/init.d/oracleasm deletedisk DISK1) It takes 74 seconds to completion on average.

Result:

Mirroring disk failure is transparent to database.

Test Case 17:

Name: Impact of ASM add/remove disk operation

Purpose: Test ASM dynamic disk swapping

Detail:

1. Set up ASM disk group with normal mirroring.

2. Query a large table with full table scan (select distinct species from sequences;) and record the time for completion.

3. Query the table again and add disk to the disk group to ASM instance (ALTER DISKGROUP DISKGRP1 ADD FAILGROUP DISK5 DISK 'ORCL:DISK5' NAME DISK5 SIZE 102398 M REBALANCE POWER 5; ) at the same time. Record the completion of the query.

4. Query the table again and remove disk to the disk group (ALTER DISKGROUP DISKGRP1 DROP DISK DISK5 REBALANCE POWER 5; ) at the same time. Record the completion of the query.

Result:

The baseline query (with no add/drop disk activity) takes average 57 seconds. (56, 56, 59 seconds). With disk group rebalance activity, the query takes average of 106 seconds. (103, 109, 106 seconds.) By default, the add/drop disk is with rebalance power of 1. If disk does not have failure group available, it will hung. Bug 4747535 and Bug 4772979 make operation more complicate. It is good idea to make failure group small, such as 100 GB. One more bug in EM. When drop disk from disk group, you can select reblanace power level from “Advanced Options” button. However, it always use default (rebalance power 1) no matter what setting you select.


Test Case 18:

Name: Impact of ASM normal and external redundancy

Purpose: Test cost associated with ASM redundancy

Detail:

1. Set up ASM disk group with normal and external redundancy. (DISKGRP1 normal redundancy, DISKGRP2 external redundancy).

2. Query a large table with full table scan and record the time for completion on both external mirroring and normal redundancy. (select distinct species from sequences; select distinct species from sequences_e; )

Result:

Query identical table on normal redundancy disk group takes average of 70 seconds. (71,69,69 seconds.) Query identical table on external redundancy disk group takes average of 37 seconds. (38,36,37 seconds.), which is 53% of the time to query same table under normal redundancy.


Test Case 19:

Compare performance of ASM, OCFS2 and ext3.

Summary


  1. ASM is much faster than OCFS for Oracle RAC configuration. (1.34 times faster)
  2. ASM is faster than OCFS for Oracle non-RAC configuration. (0.59 times faster)
  3. OCFS is faster than ext3 for non-RAC configuration (0.25 times faster)
  4. OCFS for non-RAC configuration is much faster than RAC configuration (0.78 times faster)

Normalized IO comparison for Oracle Storage (time)


ASM

OCFS

ext3

RAC

1.00

2.34

NA

NON-RAC

0.861

1.37

1.72

* Normalization based on time takes to read fixed amount of data. RAC-ASM as value 1. The lower the number, the better IO performance.


Details

database

storage

normalized IO

normalized IO

(Reads)

(time)

RAC

ASM

1

1

RAC

OCFS

0.429

2.34

NORAC

ASM

1.161

0.86

NORAC

OCFS

0.729

1.37

NORAC

ext3

0.581

1.72



Oracle RAC handson Part 7 RAC High Scalability

RAC High Scalability

Summary:

Both Oracle oci and thin client support connection load balance. It also indicates that instance load at the connection time is a factor for load balance. If loading on RAC nodes are similar, Oracle randomly decides which node to connect. Besides load balance, we tested the query and OLTP operation performance against 1-node, 2-nodes, and 3-nodes RAC. Under best test case, the performance increase is 35% for adding node from 1 to 2, (Linear increase should be 100%)and 15% for adding node from 2 to 3. (Linear increase should be 50%). CPU intensive operation will benefit the most from RAC scale-up. Inter-node memory access is very slow compare with local memory access (remote memory access is 300 times slower than local memory access and is the twice of the speed of disk access). Other resource constraints such as latch contention, disk IO limitation for data file and log file prevent the near linear scale up for RAC unless everything is partitioned. (For example, Application A uses schema A connects to node A on disk array A. Application B uses schema B connects to node B on disk array B. and so on.)

Test Case 9:

Name: Load balance

Purpose: Test load balance over nodes

Detail:

  1. Connect to RAC with 100 JDBC thick connections and count the connections on each RAC instance for connections.

Test result:

Each nodes gets about 30 connections. Looks like the connection is randomly made to RAC instance. Separate test suggests the instance loading is a fact with RAC making the connection.

Test Code:

Java code:

import java.sql.*;

import java.util.Calendar;


public class RacQuery {

String host = "10.8.166.91"; //blade1.wyeth.com

String host1 = "10.8.166.91"; //blade1.wyeth.com

String host2 = "10.8.166.92"; //blade2.wyeth.com

String host3 = "10.8.166.93"; //blade3.wyeth.com

String databaseSID="rac1";

String serviceName = "rac.wyeth.com";

String databasePort = "1521";

//String url = "jdbc:oracle:thin:@" + host + ":" + databasePort + ":" + databaseSID;

/*String url = "jdbc:oracle:thin:@(DESCRIPTION=" +

"(LOAD_BALANCE=yes)" +

"(ADDRESS=(PROTOCOL=TCP)(HOST=" + host1 + ")(PORT=" + databasePort + "))" +

"(ADDRESS=(PROTOCOL=TCP)(HOST=" + host2 + ")(PORT=" + databasePort + "))" +

"(ADDRESS=(PROTOCOL=TCP)(HOST=" + host3 + ")(PORT=" + databasePort + "))" +

"(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=" + serviceName + ")))";

*/

String url = "jdbc:oracle:oci:@rac";

String userName = "test";

String password = "tt123";

Connection conn;

public RacQuery() {

}

public Connection getConnection() throws java.sql.SQLException {

DriverManager.registerDriver (new oracle.jdbc.driver.OracleDriver());

conn = DriverManager.getConnection

(url, userName, password);

return conn;

}

public static void main(String[] args) throws InterruptedException {

RacQuery racquery = new RacQuery();

try {

int loopExt = 0;

Date startTime = new Date(System.currentTimeMillis());

Date endTime = new Date(System.currentTimeMillis());

String sqlStr1 = "select instance_name from v$instance";

String mySession = args[0];

String failover;

String sqlStr2 = "insert into session_failover (session_id, time) values (?, ?)";

Connection conn = racquery.getConnection();

Statement stmt = conn.createStatement();

ResultSet rs = stmt.executeQuery(sqlStr1);

rs.next();

System.out.print("instance is " + rs.getString(1));

sqlStr1 = "select failed_over from v$session where audsid = sys_context('userenv', 'sessionid')";

PreparedStatement pstmt;

while (loopExt == 0) {

Thread.sleep(1000);

rs = stmt.executeQuery(sqlStr1);

rs.next();

failover = rs.getString(1);

if (failover.equalsIgnoreCase("yes")) {

loopExt = 1;

stmt.close();

endTime.setTime(System.currentTimeMillis());

pstmt = conn.prepareStatement(sqlStr2);

pstmt.setString(1, mySession);

pstmt.setFloat(2, new Float((endTime.getTime() - startTime.getTime())/1000).floatValue());

pstmt.executeUpdate();

pstmt.close();

} else {

startTime.setTime(System.currentTimeMillis());

}

}

conn.close();

} catch (SQLException sqle) {

System.out.print("JDBC exception: " + sqle.getMessage());

}

}

}

ksh code:

!#/usr/bin/ksh

#LD_LIBRARY_PATH must use 32 bit lib

LD_LIBRARY_PATH=/data01/app/oracle/product/9.2.0/lib32:$LD_LIBRARY_PATH

counter=0

echo `date` > runTesting.log

until [[ $counter = 100 ]] ; do

echo The counter is $counter

let counter=counter+1;

java -classpath /export/home/oracle/racTesting:/data01/app/oracle/product/9.2.0/jdbc/lib/ojdbc14_g.jar RacQuery $counter >

> runTesting.log &

done

Table DDL

CREATE TABLE TEST.SESSION_FAILOVER (SESSION_ID VARCHAR2(4), TIME NUMBER);

Test Case 10:

Name: Query scalability

Purpose: Test scalability of query with full table scan.

Detail:

1. Startup with only 1 instance.

2. Record time to full table scan a large table (3.6 GB) with parallel 6 three times. (

select /*+ parallel (sequences, 6) */ distinct species from test.sequences

UNION

select /*+ parallel (sequences, 6) */ distinct species from test.sequences

UNION

select /*+ parallel (sequences, 6) */ distinct species from test.sequences;)

3. Repeat step 2 against 2-nodes and 3-nodes RAC.

Retested with SQL (select distinct species from test.sequences) and it takes 53 seconds on single node RAC, 53 seconds on two nodes RAC, and three nodes RAC takes 49 seconds. Only 1 session on connected instance does the query. That proves the global cache service is scale well with number of nodes.

Test result:

For single node, it started 10 parallel processes. For two nodes RAC, there are 6 parallel processes on each node. For three nodes RAC, there are 4 parallel processes on each node. Adding nodes to RAC seems not speed up the query even all nodes are working on it. The full table scan is limited by IO of SAN.

One Node: 35 seconds, 35 seconds, 35 seconds. Average 35 seconds

Two Node: 36 seconds, 36 seconds, 36 seconds. Average: 36 seconds

Three Node: 36 seconds, 36 seconds, 37 seconds. Average: 36 seconds

Test Case 11:

Name: Query scalability

Purpose: Test scalability of query with full table scan of hash portioned index organized table with three partitions.

Detail:

1. Startup with only 1 instance.

2. Record time to range-scan and full table scan the large table (3.6 GB) with parallel 6. (select /*+ parallel(cellfeaure, 6) */count(*) from cellfeature where valdbl > 2)

3. Repeat step 2 against 2-nodes and 3-nodes RAC.

4.

Test result:

The query does fast full scan of primary key index. For single node, it started 6 parallel processes. For two nodes RAC, there are 6 parallel processes on remote node and one is on local node. For three nodes RAC, there are 2 parallel processes and 4 parallel processes on remote nodes. Only one process is on local node. Adding nodes to RAC seems not speed up the query even all nodes are working on it. The fast full index scan is limited by IO of SAN. Query takes 96 seconds to complete without parallel execution.

One Node: 12 seconds, 13 seconds, 12 seconds. Average 12 seconds

Two Node: 13 seconds, 13 seconds, 13 seconds. Average: 13 seconds

Three Node: 12 seconds, 13 seconds, 14 seconds. Average: 13 seconds

Test Case 12:

Name: Query remote cache

Purpose: Test buffer get from local and remote buff cache

Detail:

1. Create table sequences_small as select temp_s.nextval as T_ID, SEQUENCE_ID, ACCESSION, AUTHORITY, WYSECAT_AVA, SPECIES, DESCRIPTION from sequences where rownum <>

2. Cache table sequences_smaller with size of 300 MB into data cache buffer.

3. Record time to scan sequences_smaller table (300 MB) that connects to the same instance. (select count(*) from sequences_smaller; ) Query reads from disk takes 7 seconds. Query reads from memory takes 0.01 seconds.

4. Connect to other node and run the same query and record the time. It takes 3.04 seconds for other node to run query first time. Auto trace shows all reads are consistent gets.

Test result:

The query that access remote cache buffer is 1.3 times faster than disk reads (7 seconds vs. 3.04) but 300 times slower than local memory access (0.01 seconds vs. 3.04)

Issues:

1. set sga_target to non zero value enables the sga auto management. But the SGA will not change of full table scan of cached table and oracle does disk read every time.

Test Case 13:

Name: OLTP scalability

Purpose: RAC scalability for online transaction processes (OLTP)

Detail:

1. Set up a table with 1 million rows and has a number field beside primary key. (sequences_smaller)

2. Connect to a node with 100 JDBC connections and each connection updates random row with a sequence number for 5 minutes. Record the sequence number before the test and after test.

3. Connect to 2-node RAC and 3-node RAC with 100 JDBC connections and each connection updates random row with a sequence number for 10 minutes. Record the sequence number before the test and after test.

Test result:

The transaction throughput with one node is 4307 per second. (43 per-connection per second). The transaction throughput with two node is 4550 per second. (46 per-connection per second). The transaction throughput with three node is 5306 per second. (53 per-connection per second). Java server CPU is at 100% and some connections generate shared pool too small error. The throughput is limited by client side CPU and shared_pool_size is too small as well.

Re-run the test with 30 concurrent connections and more powerful client server (cela166052). The transaction throughput with one node is 7813 per second. (260 per-connection per second). The transaction throughput with two node is 9853 per second. (328 per-connection per second). The transaction throughput with three node is 10000 per second. (333 per-connection per second). Looks like both client server and RAC have some room for more load.

Re-run the test with 100 concurrent connections. The transaction throughput with one node is 9658 per second. (96 per-connection per second). The transaction throughput with two node is 13098 per second. (131 per-connection per second). The transaction throughput with three node is 15081 per second. (151 per-connection per second). 20% time on cluster overhead. (interconnect block transfer) 40% on commit (memory latches, log file sync) and 40% on CPU.


Because RAC’s share everything architecture, It is impossible to scale up to near linear unless partition everything. (applications, instance, schema, disks) For our best test cases, the scalability is about 35% from 1-node to 2 – nodes, and 15% from 2-nodes to 3-nodes.

Test Code:

Java code:

public class RacQuery {

String host = "10.8.166.91"; //blade1.wyeth.com

String host1 = "10.8.166.91"; //blade1.wyeth.com

String host2 = "10.8.166.92"; //blade2.wyeth.com

String host3 = "10.8.166.93"; //blade3.wyeth.com

String databaseSID="rac1";

String serviceName = "rac.wyeth.com";

String databasePort = "1521";

//String url = "jdbc:oracle:thin:@" + host + ":" + databasePort + ":" + databaseSID;

/*String url = "jdbc:oracle:thin:@(DESCRIPTION=" +

"(LOAD_BALANCE=yes)" +

"(ADDRESS=(PROTOCOL=TCP)(HOST=" + host1 + ")(PORT=" + databasePort + "))" +

"(ADDRESS=(PROTOCOL=TCP)(HOST=" + host2 + ")(PORT=" + databasePort + "))" +

"(ADDRESS=(PROTOCOL=TCP)(HOST=" + host3 + ")(PORT=" + databasePort + "))" +

"(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=" + serviceName + ")))";

*/

String url = "jdbc:oracle:oci:@rac";

String userName = "test";

String password = "";

Connection conn;

public RacQuery() {

}

public Connection getConnection() throws java.sql.SQLException {

DriverManager.registerDriver (new oracle.jdbc.driver.OracleDriver());

conn = DriverManager.getConnection

(url, userName, password);

return conn;

}

public static void main(String[] args) throws InterruptedException {

RacQuery racquery = new RacQuery();

try {

int loopExt = 0;

Date startTime = new Date(System.currentTimeMillis());

Date endTime = new Date(System.currentTimeMillis());

//String sqlStr1 = "commit";

String mySession = args[0];

String failover;

String sqlStr2 = "update sequences_smaller set t_value = temp_s.nextval where t_id = ?";

Connection conn = racquery.getConnection();

ResultSet rs;

PreparedStatement pstmt;

Integer randomValue = new Integer(0);

pstmt = conn.prepareStatement(sqlStr2);

startTime = new Date(System.currentTimeMillis());

while (loopExt == 0) {

randomValue = new Integer(new Double(Math.random()*1000000).intValue());

//System.out.println("random number = " + randomValue);

endTime.setTime(System.currentTimeMillis());

pstmt.setInt(1, randomValue.intValue());

pstmt.executeUpdate();

conn.commit();

endTime.setTime(System.currentTimeMillis());

if (endTime.getTime() - startTime.getTime() > 300000) { //5 minutes 300000

loopExt = 1;

}

}

System.out.println("Time passed: " + (endTime.getTime() - startTime.getTime())/1000);

pstmt.close();

conn.close();

} catch (SQLException sqle) {

System.out.print("JDBC exception: " + sqle.getMessage());

}

}

}

ksh code:

compileTesting.sh

#!/usr/bin/ksh

LD_LIBRARY_PATH=/opt/app/oracle/product/10.2.0/lib32:.

ORACLE_HOME=/opt/app/oracle/product/10.2.0

/opt/app/oracle/product/10.2.0/jdk/bin/javac -classpath /home/oracle/racTesting:$ORACLE_HOME/jdbc/lib/ojdbc14_g.jar:$ORACLE_HOME/jdk/lib /home/oracle/racTesting/RacQuery.java

runTesting.sh

#!/usr/bin/ksh

ORACLE_HOME=/opt/app/oracle/product/10.2.0

LD_LIBRARY_PATH=$ORACLE_HOME/lib32:$ORACLE_HOME/jdbc/lib:$ORACLE_HOME/jdk/lib:.

counter=0

echo `date` > runTesting.log

until [[ $counter = 100 ]] ; do

echo The counter is $counter

let counter=counter+1;

$ORACLE_HOME/jdk/bin/java -classpath /home/oracle/racTesting:$ORACLE_HOME/jdbc/lib/ojdbc14_g.jar RacQuery $counter >> runTesting.log &

done

Database Objects:

SQL> desc sequences_smaller

Name Null? Type

---------------------------------- --------------- ---------------

T_ID NOT NULL NUMBER

SEQUENCE_ID NOT NULL VARCHAR2(50)

ACCESSION NOT NULL VARCHAR2(30)

AUTHORITY NOT NULL VARCHAR2(30)

WYSECAT_AVA VARCHAR2(60)

SPECIES VARCHAR2(200)

DESCRIPTION VARCHAR2(4000)

T_VALUE NUMBER

create sequence temp_s start with 1 cache 50000;

Test Case 14:

Name: Table join scalability

Purpose: RAC scalability for join operation

Detail:

1. Set up a table with 12 million rows.

2. Connect to a node with sqlplus and do a query with hash join of the table. (

select /*+ parallel (a,4) parallel (b,4) */ b.accession, a.accession from sequences a, (select sequence_id, accession from sequences union select sequence_id, accession from sequences_small) b

where a.sequence_id = b.sequence_id and a.accession like 'NM%' and b.accession like '%_017067.1'

) Record the time to complete the query, which is 30 seconds.

3. Repeat step 2 with 2-nodes RAC and 3-nodes RAC. Since no remote parallel processes are started, add node will not speed up the query.

4. Repeat step 2 and 3 with nested loop join operation. (

select /*+ use_nl(a,b) parallel (a,4) parallel (b,4) */ b.accession, a.accession from sequences a, (select sequence_id, accession from sequences union select sequence_id, accession from sequences_small) b

where a.sequence_id = b.sequence_id and a.accession like 'NM%' and b.accession like '%_017067.1'

). Record the time to complete the query, which is 16 seconds. No remote parallel processes are started.

Test result:

The table join operation will not benefit from multi-nodes RAC directly.