Friday, July 24, 2009

IO Measurement

It is not news that disk IO is slow. Comparing with the speed of CPU to process the data, or memory access of the data, disk IO is usually 1000 times slower. Because Oracle database relies on underneath computer system functions, it is common that disk IO is the limiting factor for database performance, especially it is true for large data warehouse. Therefore, measuring the disk IO throughput is important for database performance tuning, database capacity planning. With the trend of virtualization of data storage layer, IO throughput measuring is more import that before for Oracle database tuning and troubleshooting.

First let us measure the disk IO at server level. I use Linux as example and hdparm is a good starting point, such as

hdparm –tT /dev/sda

You need root privilege to do so.

#oracle >sudo /sbin/hdparm -tT /dev/cciss/c0d0p3

/dev/cciss/c0d0p3:

Timing cached reads: 2264 MB in 2.00 seconds = 1130.48 MB/sec

Timing buffered disk reads: 130 MB in 3.00 seconds = 43.30 MB/sec

#oracle >

If you do not have root privilege to run hdparm, try dd

#oracle>ls -l *

-rw-r--r-- 1 oracle dba 268435456 Jul 24 12:52 testfile

#oracle>time dd if=testfile of=/dev/null bs=16k

16384+0 records in

16384+0 records out

7.70s real 0.01s user 0.39s system

7.7 seconds to read 256 MB file so the reads is 33 MB/s. One catch is that data may read from file system buffer. Run same command again and the reading is totally different.

#oracle>time dd if=testfile of=/dev/null bs=16k

16384+0 records in

16384+0 records out

0.19s real 0.00s user 0.18s system

0.19 second to read 256 MB file so the reads is 1347 MB/s. That is the cache access speed. To clean out file system buffer, you can write a large file with double size of physical memory of the server using “dd” or use “/proc/sys/vm/drop_caches” to clear cache as root. New version dd command can support option “iflag=direct” that bypass file system buffer.

To benchmark the server write speed, one can use dd command like:

#oracle>time dd if=/dev/zero of=testfile bs=16k count=16384

16384+0 records in

16384+0 records out

8.70s real 0.03s user 0.65s system

It is easy to calculate that the write speed is 256 MB / 8.7 = 29 MB/s. Similarly, you can use option “oflag=direct” to bypass file system buffer cache for new version of dd.

Next, let us test how oracle utilizes the disk IO. In order to show true disk IO, the initial parameter filesystemio_options need to be set to “setall”, which is using directio that bypasses file system buffer cache and is using “async io” thst allowes cocurrent IO to disk. Before each test case, do “alter system flush buffer_cache;” to flush the data cache buffer.

It is easy to test large read and write by using “create table as select” and using optimal parallelism and nologging. For example:

SQL> create table gene tablespace mm_data parallel (degree 4) nologging as select * from gene2 ;

Table created.

Elapsed: 00:00:25.90

Table gene2 is cached so the reading of gene table does not compete IO with creating table gene. The size of gene table is 450 MB so the write speed is 17 MB/s.

To test bulk reading speed, the gene table is queried by full table scan with optimized parallelism. Autotrace is set to on in SQLPLUS to make sure of full table scan and physical read is accounted for majority of query time.

SQL> select /*+ parallel (gene, 4) */count(*) from gene;

COUNT(*)

----------

6311685

Elapsed: 00:00:12.40

Simple calculation shows that the reading speed is 36 MB/s. The calculation ignores memory sort for to get the count. It is also hard to timing physical IO for other database operations, such as index read, DML and so on.

To timing IO for database operation other than bulk read and write, SQL trace is used. To start a sql trace, run “alter session set events '10046 trace name context forever, level 12';” in sql plus. The trace file is under UDUMP directory and tkprof is used to format the report. Below is an sample output in report.

SELECT PROTEIN_ACCESSION

FROM

GENE WHERE ID = :B1

call count cpu elapsed disk query current rows

------- ------ -------- ---------- ---------- ---------- ---------- ----------

Parse 1 0.00 0.00 0 0 0 0

Execute 100001 8.26 7.87 0 0 0 0

Fetch 100001 5.38 114.56 37085 418332 0 100001

------- ------ -------- ---------- ---------- ---------- ---------- ----------

total 200003 13.64 122.43 37085 418332 0 100001

Misses in library cache during parse: 0

Optimizer mode: ALL_ROWS

Parsing user id: 59 (recursive depth: 1)

Rows Row Source Operation

------- ---------------------------------------------------

100001 TABLE ACCESS BY INDEX ROWID GENE (cr=418332 pr=37085 pw=0 time=114797430 us)

100001 INDEX UNIQUE SCAN GENE_PK (cr=300003 pr=6875 pw=0 time=21453148 us)(object id 142926)

As shown in execution plan, it is an index read followed by table row read. We will calculate the small physical read of table. From execution plan, we know that there are 418332 memory read and 37085 disk read and total time for table access is 114797430 micro-seconds. From other section of report, we know that memory access is 30 microseconds, so disk read is 114797430 – 30 * 418332 = 102247470 micro-seconds or 2757 microseconds per IO read. We know the data block size is 16K, so the read speed is 6 MB/s for small read.

Oracle 11g has new package called DBMS_RESOURCE_MANAGER.CALIBRATE_IO that measure and calibrate disk IO for cost based optimizer. It measures max_iops, max_mbps and max_latency.

max_iops : Maximum number of I/O requests per second that can be sustained. The I/O requests are randomly-distributed, database-block-sized reads. It is similar to small read.

max_mbps: Maximum throughput of I/O that can be sustained, expressed in megabytes per second. The I/O requests are randomly-distributed, 1 megabyte reads. It is like large read.

max_latency: Average latency of database-block-sized I/O requests at max_iops rate, expressed in milliseconds

Because caching in storage devices, such as disk, NAS and SAN, it is very hard to get measurement for disk IO and disk IO only. But from practical point of view, it is close to what IO speed for an average database system. The actual measurement can vary based percentage of data read from storage cache and other system load if the storage is shared.

Monday, July 20, 2009

Data pump sample parameter

Data pump sample parameter can be used for creating a subset of data by table or schema. For example, if sample=10, only 10% of data is exported by table or by schema. It is great if you want to create a testing or development database from a production database and scale it down for various reasons. However, there is one big catch for doing so. The referential constraint is not enforced by sampling. So when you do import (dpimp) of the sampled data dump, you will get ORA-02298 error if you have foreign key constraints. Here is an example of log file for the data pump import:


;;;

Import: Release 10.2.0.2.0 - 64bit Production on Friday, 17 July, 2009 16:29:45

Copyright (c) 2003, 2005, Oracle. All rights reserved.

;;;

Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.2.0 - 64bit Production

With the Partitioning, OLAP and Data Mining options

Master table "SYSTEM"."SYS_IMPORT_FULL_01" successfully loaded/unloaded

Starting "SYSTEM"."SYS_IMPORT_FULL_01": system/******** directory=WTC_EXP_DIR DUMPFILE=dbpump_sample10_exp.dmp remap_schema=dbpump:dbpump2 logfile=dbpump_sample10_imp.log

Processing object type SCHEMA_EXPORT/USER

ORA-31684: Object type USER:"DBPUMP2" already exists

Processing object type SCHEMA_EXPORT/SYSTEM_GRANT

Processing object type SCHEMA_EXPORT/ROLE_GRANT

Processing object type SCHEMA_EXPORT/DEFAULT_ROLE

Processing object type SCHEMA_EXPORT/PRE_SCHEMA/PROCACT_SCHEMA

Processing object type SCHEMA_EXPORT/SEQUENCE/SEQUENCE

Processing object type SCHEMA_EXPORT/TABLE/TABLE

Processing object type SCHEMA_EXPORT/TABLE/TABLE_DATA

. . imported "DBPUMP2"."CHILD" 2.132 MB 101427 rows

. . imported "DBPUMP2"."PARENT" 20.92 KB 979 rows

Processing object type SCHEMA_EXPORT/TABLE/CONSTRAINT/CONSTRAINT

Processing object type SCHEMA_EXPORT/TABLE/CONSTRAINT/REF_CONSTRAINT

ORA-39083: Object type REF_CONSTRAINT failed to create with error:

ORA-02298: cannot validate (DBPUMP2.SYS_C0052496) - parent keys not found

Failing sql is:

ALTER TABLE "DBPUMP2"."CHILD" ADD FOREIGN KEY ("P_ID") REFERENCES "DBPUMP2"."PARENT" ("P_ID") ENABLE

Job "SYSTEM"."SYS_IMPORT_FULL_01" completed with 2 error(s) at 16:29:52


Looks like I still need my PLSQL package for doing data dump of subset of data from production database.

Oracle direct IO

By default, init parameter filesystemio_options value is set to none, which means that system file cache buffer is used for data file io and alike. Many Oracle professional think that it should be set to “directio” or “setall” (directio and async io). Metalink note 462072.1, “File System's Buffer Cache versus Direct I/O” and Steve Adams’s web site (http://www.ixora.com.au/tips/avoid_buffered_io.htm) are such examples. The reasons directio is preferred are:
  • Cache data in Oracle data buffer cache and again in file system buffer cache is wasting of memory by double caching.
  • DBA can manage how the data block to be cached inside Oracle but not for file system cache buffer.
  • System buffer cache can mislead the Oracle system statistics for physical IO, since it maybe read from file system cache but as physical read from Oracle point of view.
  • It also complicates database tuning because the cost of physical IO is the not same because it may came from file system buffer cache or true disk IO.
  • On some operating system, it is possible that system buffer cache keeps expanding until all physical memory is consumed, resulting system crash.
I did some testing for Oracle 10g with RedHat AS4 update 3 on NFS (networked file system) and found out the read (full table scan) is similar for system buffer cache or Oracle data buffer cache at rate of about 400 MB/second. Because file system buffer cache read and Oracle data buffer cache read are all from local memory, the result will be the same no matter what kind of file system the data is sitting on. Because the cache age out methods for Oracle data buffer cache and Linux file system buffer cache are different, and the size of Oracle cache vs. file system cache, it is possible data is in one cache but not in the other one. For write, it was surprise to see the io with file system buffer is 3 times faster.

dirctio=yes
SQL> create table gene_drop tablespace mm_data_temp nologging as select * from gene;
Table created.
Elapsed: 00:00:50.46


dirctio=none
SQL> create table gene_drop tablespace mm_data_temp nologging as select * from gene;
Table created.
Elapsed: 00:00:12.40

The table gene is about 500 MB in size and is cached so not physical io is needed to read it. From Linux document, the write actually is on file system buffer cache until it is more than 30 seconds old or more that 10% of the cache. Any way, file system buffer cache can smooth out the disk write load.

Whether Oracle uses direct IO is not only dependent on the filesystemio_options parameter, but the OS, file system type, storage device, mount option, device driver and so on. The following procedure can be used to check if direct IO is used in Linux.
  1. startup database mount.
  2. ind the db writer by “select spid, program from v$process where program like '%DBW%';”, write down the spid number.
  3. do strace of db writer process, such as strace -o /tmp/strace.out -e trace=open -p 2260
  4. alter database open.
  5. Open the strace output file and looking for “O_DIRECT”, the sample file is some thing like open("/.../tst01.dbf", O_RDWR|O_SYNC|O_DIRECT) = 22.
Linux by default, use all memory that is not used by processes as file system buffer cache and allocate it out to processes as needed. However, it is possible that the memory used by Oracle is paged out to virtual memory and large size of physical memory is used for file system buffer.

In conclusion, to set filesystemio_options to directio or not is dependent on you system, hardware, database, application and load pattern. This is why DBA is paid for and is needed to make such decision.

Tuesday, July 7, 2009

Oracle Lock and Latch comparison

Oracle uses both lock and latch to protect data integrity and provides concurrency for user processes. But there are a number of differences between lock and latch.

• Lock protects disk data structure and latch protects memory data structure
• Lock is complex data structure and has shared access mode, latch is simple data structure with exclusive mode only
• Lock is slow because it is complex and involves disk operation, latch is fast because it is simple and memory operation only.
• Lock is queued so dead lock is possible. Latch is hit and run that uses spin and sleep for waiting.
• Lock is not needed for read because of Oracle’s multiversion concurrency control (MVCC), latch is needed for all data access in memory, including read.