It is not news that disk IO is slow. Comparing with the speed of CPU to process the data, or memory access of the data, disk IO is usually 1000 times slower. Because Oracle database relies on underneath computer system functions, it is common that disk IO is the limiting factor for database performance, especially it is true for large data warehouse. Therefore, measuring the disk IO throughput is important for database performance tuning, database capacity planning. With the trend of virtualization of data storage layer, IO throughput measuring is more import that before for Oracle database tuning and troubleshooting.
First let us measure the disk IO at server level. I use Linux as example and hdparm is a good starting point, such as
hdparm –tT /dev/sda
You need root privilege to do so.
#oracle >sudo /sbin/hdparm -tT /dev/cciss/c0d0p3
/dev/cciss/c0d0p3:
Timing cached reads: 2264 MB in 2.00 seconds = 1130.48 MB/sec
Timing buffered disk reads: 130 MB in 3.00 seconds = 43.30 MB/sec
#oracle >
If you do not have root privilege to run hdparm, try dd
#oracle>ls -l *
-rw-r--r-- 1 oracle dba 268435456 Jul 24 12:52 testfile
#oracle>time dd if=testfile of=/dev/null bs=16k
16384+0 records in
16384+0 records out
7.70s real 0.01s user 0.39s system
7.7 seconds to read 256 MB file so the reads is 33 MB/s. One catch is that data may read from file system buffer. Run same command again and the reading is totally different.
#oracle>time dd if=testfile of=/dev/null bs=16k
16384+0 records in
16384+0 records out
0.19s real 0.00s user 0.18s system
0.19 second to read 256 MB file so the reads is 1347 MB/s. That is the cache access speed. To clean out file system buffer, you can write a large file with double size of physical memory of the server using “dd” or use “/proc/sys/vm/drop_caches” to clear cache as root. New version dd command can support option “iflag=direct” that bypass file system buffer.
To benchmark the server write speed, one can use dd command like:
#oracle>time dd if=/dev/zero of=testfile bs=16k count=16384
16384+0 records in
16384+0 records out
8.70s real 0.03s user 0.65s system
It is easy to calculate that the write speed is 256 MB / 8.7 = 29 MB/s. Similarly, you can use option “oflag=direct” to bypass file system buffer cache for new version of dd.
Next, let us test how oracle utilizes the disk IO. In order to show true disk IO, the initial parameter filesystemio_options need to be set to “setall”, which is using directio that bypasses file system buffer cache and is using “async io” thst allowes cocurrent IO to disk. Before each test case, do “alter system flush buffer_cache;” to flush the data cache buffer.
It is easy to test large read and write by using “create table as select” and using optimal parallelism and nologging. For example:
SQL> create table gene tablespace mm_data parallel (degree 4) nologging as select * from gene2 ;
Table created.
Elapsed: 00:00:25.90
Table gene2 is cached so the reading of gene table does not compete IO with creating table gene. The size of gene table is 450 MB so the write speed is 17 MB/s.
To test bulk reading speed, the gene table is queried by full table scan with optimized parallelism. Autotrace is set to on in SQLPLUS to make sure of full table scan and physical read is accounted for majority of query time.
SQL> select /*+ parallel (gene, 4) */count(*) from gene;
COUNT(*)
----------
6311685
Elapsed: 00:00:12.40
Simple calculation shows that the reading speed is 36 MB/s. The calculation ignores memory sort for to get the count. It is also hard to timing physical IO for other database operations, such as index read, DML and so on.
To timing IO for database operation other than bulk read and write, SQL trace is used. To start a sql trace, run “alter session set events '10046 trace name context forever, level 12';” in sql plus. The trace file is under UDUMP directory and tkprof is used to format the report. Below is an sample output in report.
SELECT PROTEIN_ACCESSION
FROM
GENE WHERE ID = :B1
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 100001 8.26 7.87 0 0 0 0
Fetch 100001 5.38 114.56 37085 418332 0 100001
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 200003 13.64 122.43 37085 418332 0 100001
Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: 59 (recursive depth: 1)
Rows Row Source Operation
------- ---------------------------------------------------
100001 TABLE ACCESS BY INDEX ROWID GENE (cr=418332 pr=37085 pw=0 time=114797430 us)
100001 INDEX UNIQUE SCAN GENE_PK (cr=300003 pr=6875 pw=0 time=21453148 us)(object id 142926)
As shown in execution plan, it is an index read followed by table row read. We will calculate the small physical read of table. From execution plan, we know that there are 418332 memory read and 37085 disk read and total time for table access is 114797430 micro-seconds. From other section of report, we know that memory access is 30 microseconds, so disk read is 114797430 – 30 * 418332 = 102247470 micro-seconds or 2757 microseconds per IO read. We know the data block size is 16K, so the read speed is 6 MB/s for small read.
Oracle 11g has new package called DBMS_RESOURCE_MANAGER.CALIBRATE_IO that measure and calibrate disk IO for cost based optimizer. It measures max_iops, max_mbps and max_latency.
max_iops : Maximum number of I/O requests per second that can be sustained. The I/O requests are randomly-distributed, database-block-sized reads. It is similar to small read.
max_mbps: Maximum throughput of I/O that can be sustained, expressed in megabytes per second. The I/O requests are randomly-distributed, 1 megabyte reads. It is like large read.
max_latency: Average latency of database-block-sized I/O requests at max_iops rate, expressed in milliseconds
Because caching in storage devices, such as disk, NAS and SAN, it is very hard to get measurement for disk IO and disk IO only. But from practical point of view, it is close to what IO speed for an average database system. The actual measurement can vary based percentage of data read from storage cache and other system load if the storage is shared.