philtortoise System/DBA+

Saturday, September 14, 2013

Oracle: partitioning a huge table _online_

My customer has an application in production 24/24 with tables getting huge (billions of rows... ) we'd like to partition it to compress and move older data.

Not a problem -as long as we have disk space. With Oracle 10g/11g we can partition it online, without any interruption.

The package to use is DBMS_REDEFINITION, and the basic idea is:
-Create a new table how you like it (partitionned... )
-This table becomes a materialized view and is synchronized with the existing table (need double space!)
-Add indexes etc..
-When ready: Lock shortly the table while it does a last resync and switch the segments in the data dictionary

The segments formerly belonging to the MV now belongs to the new table, and it was completely transparent.

( I think in Oracle 12c, this is getting easier with a single command. -will check)

-- In this example we have a table: BIGAPP1.BIGTABLE

-- create the new table as we want it (this will be transormed to a Materialized View)
drop table BIGAPP1.BIGTABLE_TMPPARTNING ;

create table BIGAPP1.BIGTABLE_TMPPARTNING
PARTITION BY RANGE (TIME_STAMP_UTC)
(
PARTITION P2010 VALUES LESS THAN (TO_DATE('2011-01-01','YYYY-MM-DD' )) TABLESPACE TS_DATA110,
PARTITION P2011 VALUES LESS THAN (TO_DATE('2012-01-01','YYYY-MM-DD' )) TABLESPACE TS_DATA110,
PARTITION P2012 VALUES LESS THAN (TO_DATE('2013-01-01','YYYY-MM-DD' )) TABLESPACE TS_DATA110,
PARTITION P2013 VALUES LESS THAN (TO_DATE('2014-01-01','YYYY-MM-DD' )) TABLESPACE TS_DATA110,
PARTITION P2014 VALUES LESS THAN (TO_DATE('2015-01-01','YYYY-MM-DD' )) TABLESPACE TS_DATA110
) as select * from BIGAPP1.BIGTABLE where 0=1 ;

-- call dbms_redefintion
-- there are two methods, by PK or by ROWID which we had to use here
-- ref: Admin guide / chapter 15 Managing Tables / Redefining Tables Online

set serveroutput on;
exec dbms_output.put_line('check');

BEGIN
DBMS_REDEFINITION.CAN_REDEF_TABLE(
   uname        => 'BIGAPP1',
   tname        => 'BIGTABLE',
   options_flag => DBMS_REDEFINITION.CONS_USE_ROWID);
END;
/

-- The following will actually copy the data to the materialized view, and add a temporary index (because of USE_ROWIS)
-- on the (slow) test system it took 25 minutes for 11M rows
BEGIN
DBMS_REDEFINITION.START_REDEF_TABLE('BIGAPP1', 'BIGTABLE','BIGTABLE_TMPPARTNING',
       null, dbms_redefinition.cons_use_rowid);
END;
/

-- now our table has become a MV, but has only been synced once (an INSERT at this point is not copied )
-- now we may create the indexes

-- local index
CREATE INDEX BIGAPP1.IX_BIGTABLE_AUDOBJID ON BIGAPP1.BIGTABLE_TMPPARTNING(SURV_OBJ_ID) LOCAL
( PARTITION P2010 TABLESPACE TS_DATA110,
PARTITION P2011 TABLESPACE TS_DATA110,
PARTITION P2012 TABLESPACE TS_DATA110,
PARTITION P2013 TABLESPACE TS_DATA110,
PARTITION P2014 TABLESPACE TS_DATA110 );
-- 3 minutes

...

-- a global index because it is unique and does not contain the partition key
-- we partition it with another range

CREATE UNIQUE INDEX BIGAPP1.IX_BIGTABLE_OBJID ON BIGAPP1.BIGTABLE_TMPPARTNING(OB_ID) GLOBAL PARTITION BY RANGE (OB_ID)
(
PARTITION P10M VALUES LESS THAN (10000000) TABLESPACE TS_DATA110,
PARTITION P20M VALUES LESS THAN (20000000) TABLESPACE TS_DATA110,
PARTITION P30M VALUES LESS THAN (30000000) TABLESPACE TS_DATA110
PARTITION PMAX VALUES LESS THAN (MAXVALUE) TABLESPACE TS_DATA110
) ;

--process grants, triggers, constraints and privileges
-- here we do not set the 3r param to CONS_ORIG_PARAMS, because we re-creates the index manually

DECLARE
num_errors PLS_INTEGER;
BEGIN
DBMS_REDEFINITION.COPY_TABLE_DEPENDENTS('BIGAPP1', 'BIGTABLE','BIGTABLE_TMPPARTNING',
   0, TRUE, TRUE, TRUE, TRUE, num_errors);
END;
-- 1 minute

--We can check for errors here, typically: already existing constraints
select object_name, base_table_name, ddl_txt from DBA_REDEFINITION_ERRORS;

-- Optional re-synchro : I prefer to do that since it reduces the last operation, which is the only one doing some locking

BEGIN
DBMS_REDEFINITION.SYNC_INTERIM_TABLE('BIGAPP1', 'BIGTABLE','BIGTABLE_TMPPARTNING');
END;
/

-- And right after this, call the final step
BEGIN
DBMS_REDEFINITION.FINISH_REDEF_TABLE('BIGAPP1', 'BIGTABLE','BIGTABLE_TMPPARTNING');
END;
/

-- 20 sec

--this final step did a lock, a final sync and switched the segments references in the data dictionary
-- BIGTABLE is now partitioned and BIGTABLE_TMPPARTNING is actually pointing to the old table (including its data: we have a copy)

-- We may keep a little the older version just to verify everything is OK

-- Verify we do have data in different partitions now:

select count(*) from BIGAPP1.BIGTABLE partition (P2010);
select count(*) from BIGAPP1.BIGTABLE partition (P2011);
select count(*) from BIGAPP1.BIGTABLE partition (P2012);
select count(*) from BIGAPP1.BIGTABLE partition (P2013);

-- OK DONE!

References:
Admin Guide - redefinition example
This example demonstrates redefining a single partition. It moves the oldest partition of a range-partitioned sales table to a tablespace named TBS_LOW_FREQ

Thursday, September 12, 2013

Generate stuff with SQL 'connect by'

'connect by' was added in Oracle to build hierarchical queries, but is really usefull to generate STUFF.
(I think I first saw this usage from Johnathan Lewis... )

SQL> select rownum from dual connect by level <= 4;
         1
         2
         3
         4

Create a huge table:

With the help of RPAD:

SQL> select rpad('A',50,'B') from dual;
ABBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

We can create quickly a big table with data:

SQL> create table testbig as
             select rownum as n,
                    rpad('A',50,'B') as txt
             from dual connect by level <= 100000;

Somewhat more varied with modulo:

create table testbig2 as
select rownum as n,
         mod(rownum,100) as m,
         rpad('A',50,'B') as txt
from dual connect by level <= 10000;

Generate days:
SQL> select TO_DATE('2012-01-01','YYYY-MM-DD')+rownum from dual connect by level <=3;
2012-01-02
2012-01-03
2012-01-04

Generate months:
SQL> select ADD_MONTHS( TO_DATE('2012-01-01','YYYY-MM-DD'),rownum) from dual connect by level <=3;
2012-02-01
2012-03-01
2012-04-01

Generate PARTITION clauses...

SQL> set pagesize 0
SQL> select 'PARTITION P'
||TO_CHAR(ADD_MONTHS( TO_DATE('2012-01-01','YYYY-MM-DD'),rownum),'YYYYMM')
||' VALUES LESS THAN (TO_DATE('''||TO_CHAR(ADD_MONTHS( TO_DATE('2011-01-01','YYYY-MM-DD'),rownum),'YYYYMMDD') ||''',''YYYYMMDD'')),'
from dual connect by level <=3;
PARTITION P201202 VALUES LESS THAN (TO_DATE('20110201','YYYYMMDD')),
PARTITION P201203 VALUES LESS THAN (TO_DATE('20110301','YYYYMMDD')),
PARTITION P201204 VALUES LESS THAN (TO_DATE('20110401','YYYYMMDD')),

Sadly I know no equivalent in MySQL,
(discussed here: stackoverflow.com/questions/701444 )

Of course in bash you can also generate things like so:

$ for i in {1..5}
> do
> echo PARTITION$i
> done
PARTITION1
PARTITION2
PARTITION3
PARTITION4

But I don't know an easy way to process dates though...

MySQL SHOW SLAVE STATUS Explained

Sometimes I don't have to deal with MySQL slaves for months and again I forgot which entry means what.

Here is a SHOW SLAVE STATUS explained while catching up (I just restarted the slave after an error)

On the slave:
10:24 [vega]:> show slave status\G
...
      Master_Log_File: vega20-bin.000740 just current log and pos on the master
Read_Master_Log_Pos: 3396922             == show master status
                                           (sometimes takes time to be refreshed)

       Relay_Log_File: vega21-relay-bin.000074 relay file on the slave we're
                                               currently applying

        Relay_Log_Pos: 274423937          position in bytes in the relay file.

Relay_Master_Log_File: vega20-bin.000727 corresponding log file on the master to
                                          what we're applying
                                          (727 -> 740 is left to be done)
...
Exec_Master_Log_Pos: 274423781          position on master log corresponding to
                                          what we're applying
...
Seconds_Behind_Master: 110997             estimated time to complete applying,
                                          here 727->740
Files on slave:
        219 Sep 10 04:08 vega21-relay-bin.000076
     510344 Sep 10 04:08 vega21-relay-bin.000075
1073741921 Sep 10 04:08 vega21-relay-bin.000074 <- applying this one
   193266 Sep 10 07:18 vega21-relay-bin.000079 (a bit later, are deleted)
...
196984797 Sep 11 09:03 vega21-relay-bin.000112
    3397078 Sep 11 10:24 vega21-relay-bin.000114 <- last redo info we copied from the master (data from master's log is copied to a slave relay log. Size may not be same as master ex:if we stop/start etc..)

Files on the master:
1073904414 Sep 7 09:59 vega20-bin.000720
1073774868 Sep 10 02:21 vega20-bin.000721
...
1074251834 Sep 10 04:07 vega20-bin.000727<- applying data which is here
1073934723 Sep 10 07:17 vega20-bin.000728 (copied on slave in relay log 74)
1073993761 Sep 10 07:27 vega20-bin.000729
...
1091352292 Sep 11 09:02 vega20-bin.000739
    3396922 Sep 11 10:23 vega20-bin.000740<- current log written by the master

Status on the master:

10:24 [(none)]:> show master status;
+----------------------------+----------+--------------+------------------+
| File                       | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+----------------------------+----------+--------------+------------------+
| mysql-clappdb50-bin.000740 | 3396922 |              |                  |
+----------------------------+----------+--------------+------------------+

Other notes:

11:00 [vega]:> show processlist;
+--+-----------+----+----+-------+------+--------------------------------+------+
|Id|User       |Host|db |Command|Time |State                           |Info |
+--+-----------+----+----+-------+------+--------------------------------+------+
| 5|system user|    |NULL|Connect| 6979|Waiting for master to send event|NULL |
|69|system user|    |NULL|Connect|111664|Reading event from the relay log|NULL |

Waiting for master to send event: currently the Master isn't writing much, so nothing to send to the slave relay log

Reading event from the relay log: means we're reading the relay file. If this is seen often, it means we're spending more time reading than applying... (slow I/O!)

Sometimes we see the action replicated on the slave:

|97|system user |    |STOCKS|Connect|101289| NULL      |TRUNCATE TABLE `TOTA01`|

Monday, September 2, 2013

simulating a slow drive

I want to simulate a slow drive, to better grasp the symptoms of a database suffering from a slow disk.

Until now I used to follow up on suspicions by testing with IOzone or sysbench/fileio , but:

This is sometimes difficult to do if the sysadmin is not me, and will not/can not cooperate
This is imperfect because we would typically do such tests at night when the SAN which I suspect is less used ;-)

I could use a SD card or something, but most of my test setups are on VM, I would prefer a software solution.
I almost started to write my own slow FUSE filesystem, when I finally found a nice trick from this post:
http://askubuntu.com/questions/13590/can-i-simulate-a-slow-hard-drive

Use nbd, the Network Block Device, and then rate limit access to it using say trickle.
sudo apt-get install nbd-client nbd-server trickle

Knowing neither trickle nor nbd, I tested each  before doing the setup:

1) Trickle test

Trickle will limit the network speed of a command:

trickle -u <upload speed> -d <download speed> command

Example:$ wget http://192.168.0.6/videos2007/CIMG0500-small.avi

=> 11M/s

$ trickle -d 20 wget http://192.168.0.6/videos2007/CIMG0500-small.avi

=> 22 K/s

2) NBD test

NBD stands for Network Block Device. I understand it to be a simple way to use a block device from a remote machine (simpler than drbd or iscsi targets/initiators implementations)


Blocks may be files or partitions, logical volumes... Let's try to create a file :

dd if=/dev/zero of=/home/phil/apps/NBD/nbd-drive bs=1024 count=10240

Then run the nbd server on that file:

nbd-server 9500 /home/phil/apps/NBD/nbd-drive 

It seems to launch a daemon 'nbd-server', which do listen on 9500:


$ netstat -plnt
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
...          
tcp        0      0 0.0.0.0:9500            0.0.0.0:*               LISTEN      2195/nbd-server 



Let's try to connect the client:

sudo nbd-client localhost 9500 /dev/nbd0

Seems to launch a daemon as well:
root      2276     1  0 07:04 ?        00:00:00 nbd-client localhost 9500 /dev/nbd0
root      2277     2  0 07:04 ?        00:00:00 [nbd0]

We can now format this device and mount it:
sudo mkfs /dev/nbd0 
sudo mount /dev/nbd0 /mnt

$ df -h
Filesystem            Size  Used Avail Use% Mounted on
...
/dev/nbd0             9.7M   92K  9.1M   1% /mnt

The test is OK. Let's umount and release:

sudo umount /mnt

sudo nbd-client -d /dev/nbd0

killall nbd-server

3) Trickle + NBD to make a slow drive

I succeeded by running the server through trickle this way:

$ trickle -d 20 -u 20 nbd-server 9500 /home/phil/apps/NBD/nbd-drive bs=1024 count=10240

trickle: Could not reach trickled, working independently: No such file or directory

Then started the client again with:

sudo nbd-client localhost 9500 /dev/nbd0

And indeed this block device was really slow :

$ sudo dd if=/dev/nbd0 of=/dev/null bs=65536 skip=100 count=10

10+0 records in

10+0 records out

655360 bytes (655 kB) copied, 28.5783 s, 22.9 kB/s

Finally I wrote that script to make the drive ready when I need it:

$ cat mountslowdrive.sh

#!/bin/bash

FILE_OR_PARTITION=/home/phil/apps/NBD/nbd-drive

MOUNTPOINT=/home/phil/apps/NBD/mnt

if [ $# -eq 0 ] ; then

echo making/mounting slow drive

trickle -d 20 -u 20 nbd-server 9500 $FILE_OR_PARTITION bs=1024 count=10240

sudo nbd-client localhost 9500 /dev/nbd0

sudo mount /dev/nbd0 $MOUNTPOINT

else

if [ $ $# -eq 1 $ -a $ $1 == "-u" $ ]; then

echo unmountint slow drive

sudo umount $MOUNTPOINT

sudo nbd-client -d /dev/nbd0

killall nbd-server

else

echo USAGE: $0 [-u]

echo -u : umount the slow drive

Todo:
Use the -l or /etc/nbd_server.allow option of the server to restrict the clients (Otherwise any client on the network can use the block device... )

Thursday, August 29, 2013

proving bad CPU

Acting as DBA today I experience bad CPU performance it seems
The Oracle database is running on a Oracle/Sparc virtualization system.

AWR/Statspack shows:

DB CPU		11,825		87.51
log file sync	21,933	1,292	59	9.56	Commit
db file parallel read	16,277	268	16	1.98	User I/O

But looking at the top "SQL by CPU Time" I see no demanding statement,
and all seem to be using a lot of CPU evenly.

Working on the system everything is slow, for instance 'autotrace' and 'explain plan'

I have a heavy suspicion it's the system's fault. Maybe some other VM on the machine is sucking all CPU
I reported to the syadmin team, but I'd like proofs.

I thought about writing a PL/SQL CPU-intensive job, but it wont help with the system guys. It maybe Oracle's fault for all they know.

So I quickly wrote a BASH prime computation script, and compared results to an old PC at home

$ cat p.sh
#!/bin/bash

if [ -z "$1" ]; then
echo usage: $0 MAXNUM; exit 1
fi
nmax=$1

primes=(2)
n=3

while [ $n -le $nmax ]; do
#echo verifier $n
isprime=1
for d in ${primes[*]} ; do
        #echo check div par $d
        r=$(( $n % $d ))
        if [ $r -eq 0 ]
        then
          isprime=0
          #echo No: $d div $n
          break
        fi
done

if [ $isprime -eq 1 ]; then
    primes=(${primes[*]} $n)   #primes+=($n)    dont work on older bash(solaris 10)
fi

#echo primes = ${primes[*]}
n=$(( $n + 2 ))
done

echo primes = ${primes[*]}

Results are indeed bad:

customer_system$ time ./p.sh 2000
primes = 2 3 5 7 11 ...
real    0m9.376s
user    0m8.496s
sys     0m1.182s

my_old_pc$ time ./p.sh 2000
primes = 2 3 5 7 11 ...
real    0m2.594s
user    0m2.348s
sys     0m0.226s

A VM on a 5 years old PC is 3x faster ! wow!

Then I thought: These are T-series Sparc system. I imagine they might tell me this could be normal, it is designed for parallel-intensive stuff (Still, 5y old PC... )
But to be sure I also tried some parallel work:

$ cat para.sh
#!/bin/bash

if [ -z "$2" ]; then
echo usage: $0 parallel_degree primemax; exit 1
fi

max=$1
i=0

while [ $i -le $max ]; do
./p.sh $2 &

i=$(( $i + 2 ))
done
wait
echo fin

Results:

my_old_pc$ time ./para.sh 10 500
primes = 2 3 5 7 ...
primes = 2 3 5 7 ...
primes = 2 3 5 7 ...
...
fin
real    0m1.487s
user    0m1.288s
sys     0m0.184s

customer_system$ time ./para.sh 10 500
primes = 2 3 5 7 ...
primes = 2 3 5 7 ...primes = 2 3 5 7 ...
...
fin
real    0m2.427s
user    0m5.027s
sys     0m4.148s

Less bad, but still BAD. There is definitely weak a CPU on this system causing these bad DB performances.

Monday, November 19, 2012

mySQL fillup fake data & autocommit

Playing around with MySQL performance, I had to fill up tables with test data

As with Oracle, how much you write before a commit has huge effects on performance. If you don't pay attention it takes AGES:

delimiter //

create table tbad (n integer, n2 integer, t varchar(20000), primary key(n))//

create procedure fill1 ()
begin
    declare x integer;
    set x=0;
    while x<50000 do
        insert into tbad values(x,x,'bobobobobobobobobo');
        set x=x+1;
    end while;
end;

//

mysql> call fill1 //
Query OK, 1 row affected (31 min 39.88 sec)

OUCH! 31 minute for just 50K rows!

That's because MySQL has autocommit on by default

Solution is to add something like:

begin
    declare x integer;
    SET AUTOCOMMIT=0;
    set x=0;
    while x<50000 do
        insert into tbad values(x,x,'bobobobobobobobobo');
        set x=x+1;
        if mod(x,1000) = 0 THEN
            commit;
        end if;
    end while;

    commit;
end;

And this is of course much faster
Query OK, 0 rows affected (3.49 sec)

I often commit about every 1000, but then how much is good enough? I did a test, by writing a procedure to create a test table and fill it up with varying commit bulk sizes.

delimiter //

# this is a result table
drop table if exists zztresults;
create table zztresults(testno int auto_increment primary key,nbrows int,commit_every int, time int);
//

drop procedure testperf1//

create procedure testperf1 (nbrows integer, commit_every integer)
begin
    declare x integer;
    declare t_start timestamp;
    declare totaltime integer;

    drop table if exists zztperf1;
    create table zztperf1 (n integer, n1 integer, v varchar(2000));

    select now() into t_start;

    SET AUTOCOMMIT=0;
    set x=0;
    while x<nbrows do
        insert into zztperf1 values(x,x,'bobobobobobobobobo');
        set x=x+1;
        if mod(x,commit_every) = 0 THEN
            commit;
        end if;
    end while;
    commit;

    select (now()-t_start) into totaltime;
insert into zztresults values (null,nbrows,commit_every,totaltime);
select totaltime;
end;
//call testperf1(1024000,1)//
call testperf1(1024000,10)//
call testperf1(1024000,50)//
call testperf1(1024000,100)//
call testperf1(1024000,500)//
call testperf1(1024000,1000)//
call testperf1(1024000,5000)//
call testperf1(1024000,10000)//
call testperf1(1024000,50000)//

mysql> select * from zztresults //
+--------+---------+--------------+--------+
| testno | nbrows | commit_every | time   |
+--------+---------+--------------+--------+
|      1 | 1024000 |            1 | 870397 |
|      2 | 1024000 |           10 | 14597 |
|      3 | 1024000 |           50 |   1394 |
|      4 | 1024000 |          100 |    703 |
|      5 | 1024000 |          500 |    201 |
|      6 | 1024000 |         1000 |    120 |
|      7 | 1024000 |         5000 |     87 |
|      8 | 1024000 |        10000 |     83 |
|      9 | 1024000 |        50000 |     79 |
+--------+---------+--------------+--------+

So it can take from days, down to a large minute to have a million rows...

Put it in CSV file like this:

SELECT commit_every,time from zztresults
into outfile '/tmp/speed.csv'
fields terminated by ','
lines terminated by '\n'
//

Actually numbers are so exponentially disparate, I used the logarithmic scale on both axis:

-->

commit_every	time(s)	log(commitnb)	log(time)
1	870397	0.00	5.94
10	14597	1.00	4.16
50	1394	1.70	3.14
100	703	2.00	2.85
500	201	2.70	2.30
1000	120	3.00	2.08
5000	87	3.70	1.94
10000	83	4.00	1.92
50000	79	4.70	1.90

So in this case with a default MySQL install, I'd go with 5000 commits at once.

What are the cons of using larger sets of commit? In my opinion: Increasing undo space.

On Oracle that would mean the UNDO tablespace could grow or at least internal undo segments.
On MySQL/innodb alike, the rollback segment will grow to hold all uncomitted data. In 5.5 and earlier it is by default inside the one and only tablespace (ibdata)

Sunday, October 28, 2012

Test HA 3: Oracle RAC setup

Following the setup of an iSCSI shared disk available to 2 VMs, this time I am installing a test Oracle RAC
- Again without any specific hardware (everything runs on a simple PC)

Interconnect network:

In RAC parlance it is the private net between the nodes (==heartbeat net)
On the virtual machine I add a 'Host-only network' interface.
In my case I took the 192.168.160.0/24 network, I changed the DHCP to static as follows: (using system-config-network for example)

RAC1: eth1 192.168.160.101 / 255.255.255.0
RAC2: eth1 192.168.160.101 / 255.255.255.0

#service network restart
RAC1 $ ping 192.168.160.102
OK

Users & system setup

Reference: Database Installation Guide for Linux

On each node:

# yum install gcc elfutils-libelf-devel glibc-devel libaio-devel libstdc++-devel unixODBC unixODBC-devel gcc-c++

# groupadd dba
# groupadd oinstall
# useradd -m oracle -g oinstall -G dba,asmdba
# passwd oracle

# cat >> /etc/security/limits.conf
oracle          hard    nofiles         65536
oracle          soft    nofiles         65536

# cat >> /etc/sysctl.conf
kernel.sem = 250        32000   100      128
fs.file-max = 6815744
net.ipv4.ip_local_port_range = 9000    65500
net.core.rmem_default = 262144
net.core.rmem_max = 4194304
net.core.wmem_default = 262144
net.core.wmem_max = 4194304
fs.aio-max-nr = 1048576

# /sbin/sysctl -p

# mkdir /opt/oracle
# chown oracle.oinstall /opt/oracle/

ssh connection without password

RAC1# su - oracle

RAC1$ ssh-keygen -b 2048

(type enter for an empty passphrase)

Repeat on RAC2 (this creates .ssh directory and private/pub keys)

RAC1$ scp .ssh/id_rsa.pub rac2:/home/oracle/.ssh/authorized_keys

RAC2$ chmod 600 /home/oracle/.ssh/authorized_keys

RAC2$ scp .ssh/id_rsa.pub rac1:/home/oracle/.ssh/authorized_keys

RAC1$ chmod 600 /home/oracle/.ssh/authorized_keys

Then ssh works without password from one node to the other

RAC1$ ssh RAC2

OUI (Oracle Installer) also needs login from itself, so we also need on each node our own public key:

$ cd ~/.ssh && cat id_rsa.pub >> authorized_keys

Choose some config names and IPs

in my case:

cluster name= raccluster
public hostname1 = rac1             192.168.0.201
public hostname2 = rac2             192.168.0.202
virtual hostname1 = rac1-vip       192.168.0.211
virtual hostname2 = rac2-vip       192.168.0.212

virtual IP:                 racvip          192.168.0.203
SCAN addresses:     rac-scan 192.168.0.213 192.168.0.214 192.168.0.215
(defined though the DNS, see my DNS post if like me you forgot...)

vi /etc/nsswitch.conf

hosts: dns files

# service nscd restart

Created some directories:

# mkdir -p /u01/app/11.2.0/grid

# chown -R oracle:oinstall /u01/app/11.2.0/grid

# mkdir -p /u01/app/oracle/

# chown -R oracle:oinstall /u01/app/oracle/

# chmod -R 775 /u01/app/oracle/

NTPD:

ntpd is needed and need special slewing option:

# vi /etc/sysconfig/ntpd 

    
OPTIONS="-x -u ntp:ntp -p /var/run/ntpd.pid"

# service ntpd restart 

# chkconfig ntpd on

Setup ASMLib

In this example we use ASM (alternatives are: ocfs2, GFS..)
We insall ASMlib which is just the lower level software (kernel driver and low level utils). The rest of ASM is installed through the 'grid'

rac1 & rac2:

wget http://oss.oracle.com/projects/oracleasm/dist/files/RPMS/rhel5/x86/2.0.5/2.6.18-238.el5/oracleasm-2.6.18-238.el5-2.0.5-1.el5.i686.rpm

wget http://oss.oracle.com/projects/oracleasm-support/dist/files/RPMS/rhel5/x86/2.1.7/oracleasm-support-2.1.7-1.el5.i386.rpm
wget http://download.oracle.com/otn_software/asmlib/oracleasmlib-2.0.4-1.el5.i386.rpm

rpm -i oracleasm-support-2.1.7-1.el5.i386.rpm oracleasmlib-2.0.4-1.el5.i386.rpm oracleasm-2.6.18-238.el5-2.0.5-1.el5.i686.rpm

ASMlib configuration: (note the documentation is missing the '-i' option)

# /usr/sbin/oracleasm configure -i

Default user to own the driver interface []: oracle
Default group to own the driver interface []: dba
Start Oracle ASM library driver on boot (y/n) [n]: y
Scan for Oracle ASM disks on boot (y/n) [y]: y
Writing Oracle ASM library driver configuration: done

rac1# /usr/sbin/oracleasm createdisk ASMDISK1 /dev/sdd1

rac1# /usr/sbin/oracleasm listdisks
ASMDISK1

rac2# /usr/sbin/oracleasm scandisks
rac2# /usr/sbin/oracleasm listdisks
ASMDISK1

I can the ASM disk on both nodes. Good !

Grid Installation

The grid software contains ASM and Oracle Clusterware.
In this test setup I used the same 'oracle' user (with hindsight I should have used 'grid', much cleaner to separate the grid/clusterware from DB itself)

export ORACLE_BASE=/u01/app/oracle/

export ORACLE_HOME=/u01/app/11.2.0/grid

./runInstaller

I met this error: [INS-40910] Virtual IP: entered is invalid.
misleading in my case it was due to bad reverse DNS resolution...

run the root script, which started a bunch of stuff and used the ASM disk

 ohasd is starting

CRS-2672: Attempting to start 'ora.gipcd' on 'rac1'

CRS-2672: Attempting to start 'ora.mdnsd' on 'rac1'

CRS-2672: Attempting to start 'ora.gpnpd' on 'rac1'

CRS-2672: Attempting to start 'ora.cssdmonitor' on 'rac1'

CRS-2672: Attempting to start 'ora.cssd' on 'rac1'

CRS-2672: Attempting to start 'ora.diskmon' on 'rac1'

CRS-2672: Attempting to start 'ora.ctssd' on 'rac1'

CRS-2672: Attempting to start 'ora.crsd' on 'rac1'

CRS-2672: Attempting to start 'ora.evmd' on 'rac1'

CRS-2672: Attempting to start 'ora.asm' on 'rac1'

CRS-2672: Attempting to start 'ora.DATA.dg' on 'rac1'

CRS-2672: Attempting to start 'ora.registry.acfs' on 'rac1'

Verifications:

$ ./crsctl check cluster -all
**************************************************************
rac1:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************
rac2:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************

Oracle processes after clusterware + ASM (grid) install

Oh man, Oracle it works but it is not really lightweight... We haven't installed any real DB yet !

root      8521     1 0 07:47 ?        00:00:00 /bin/sh /etc/init.d/init.ohasd run
root      8544     1 0 07:47 ?        00:00:04 /u01/app/11.2.0/grid/bin/ohasd.bin reboot
root      9842     1 0 07:48 ?        00:00:01 /u01/app/11.2.0/grid/bin/orarootagent.bin
oracle   10624     1 0 07:51 ?        00:00:03 /u01/app/11.2.0/grid/bin/oraagent.bin
oracle   10639     1 0 07:51 ?        00:00:00 /u01/app/11.2.0/grid/bin/mdnsd.bin
oracle   10651     1 0 07:51 ?        00:00:00 /u01/app/11.2.0/grid/bin/gipcd.bin
oracle   10662     1 0 07:51 ?        00:00:01 /u01/app/11.2.0/grid/bin/gpnpd.bin
root     10677     1 0 07:51 ?        00:00:01 /u01/app/11.2.0/grid/bin/cssdmonitor
root     10694     1 0 07:51 ?        00:00:01 /u01/app/11.2.0/grid/bin/cssdagent
oracle   10696     1 0 07:51 ?        00:00:00 /u01/app/11.2.0/grid/bin/diskmon.bin -d -f
oracle   10715     1 0 07:51 ?        00:00:03 /u01/app/11.2.0/grid/bin/ocssd.bin
root     10792     1 0 07:52 ?        00:00:00 /u01/app/11.2.0/grid/bin/octssd.bin
oracle   10852     1 0 07:52 ?        00:00:00 asm_pmon_+ASM1
oracle   10854     1 0 07:52 ?        00:00:00 asm_vktm_+ASM1
oracle   10858     1 0 07:52 ?        00:00:00 asm_gen0_+ASM1
oracle   10860     1 0 07:52 ?        00:00:00 asm_diag_+ASM1
oracle   10862     1 0 07:52 ?        00:00:00 asm_ping_+ASM1
oracle   10864     1 0 07:52 ?        00:00:00 asm_psp0_+ASM1
oracle   10866     1 0 07:52 ?        00:00:00 asm_dia0_+ASM1
oracle   10868     1 0 07:52 ?        00:00:00 asm_lmon_+ASM1
oracle   10870     1 0 07:52 ?        00:00:00 asm_lmd0_+ASM1
oracle   10873     1 0 07:52 ?        00:00:00 asm_lms0_+ASM1
oracle   10877     1 0 07:52 ?        00:00:00 asm_lmhb_+ASM1
oracle   10879     1 0 07:52 ?        00:00:00 asm_mman_+ASM1
oracle   10881     1 0 07:52 ?        00:00:00 asm_dbw0_+ASM1
oracle   10883     1 0 07:52 ?        00:00:00 asm_lgwr_+ASM1
oracle   10885     1 0 07:52 ?        00:00:00 asm_ckpt_+ASM1
oracle   10887     1 0 07:52 ?        00:00:00 asm_smon_+ASM1
oracle   10889     1 0 07:52 ?        00:00:00 asm_rbal_+ASM1
oracle   10891     1 0 07:52 ?        00:00:00 asm_gmon_+ASM1
oracle   10893     1 0 07:52 ?        00:00:00 asm_mmon_+ASM1
oracle   10895     1 0 07:52 ?        00:00:00 asm_mmnl_+ASM1
oracle   10897     1 0 07:52 ?        00:00:00 /u01/app/11.2.0/grid/bin/oclskd.bin
oracle   10900     1 0 07:52 ?        00:00:00 asm_lck0_+ASM1
root     10912     1 0 07:52 ?        00:00:08 /u01/app/11.2.0/grid/bin/crsd.bin reboot
oracle   10928     1 0 07:52 ?        00:00:01 /u01/app/11.2.0/grid/bin/evmd.bin
oracle   10930     1 0 07:52 ?        00:00:00 asm_asmb_+ASM1
oracle   10932     1 0 07:52 ?        00:00:00 oracle+ASM1_asmb_+asm1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
root     10958     1 0 07:52 ?        00:00:00 /u01/app/11.2.0/grid/bin/oclskd.bin
oracle   10960     1 0 07:52 ?        00:00:01 oracle+ASM1_ocr (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle   11017 10928 0 07:52 ?        00:00:00 /u01/app/11.2.0/grid/bin/evmlogger.bin -o /u01/app/11.2.0/grid/evm/log/evmlogger.info -l /u01/app/11.2.0/grid/evm/log/ev
mlogger.log
oracle   11220     1 0 07:53 ?        00:00:02 /u01/app/11.2.0/grid/bin/oraagent.bin
root     11388     1 0 07:53 ?        00:00:11 /u01/app/11.2.0/grid/bin/orarootagent.bin
oracle   11415     1 0 07:53 ?        00:00:00 /u01/app/11.2.0/grid/opmn/bin/ons -d
oracle   11416 11415 0 07:53 ?        00:00:00 /u01/app/11.2.0/grid/opmn/bin/ons -d
oracle   11467     1 0 07:53 ?        00:00:03 /u01/app/11.2.0/grid/jdk/jre//bin/java -Doracle.supercluster.cluster.server=eonsd -Djava.net.preferIPv4Stack=true -Djava
.util.logging.config.file=/u01/app/11.2.0/grid/srvm/admin/logging.properties -classpath /u01/app/11.2.0/grid/jdk/jre//lib/rt.jar:/u01/app/11.2.0/grid/jlib/srvm.jar:/u01
/app/11.2.0/grid/jlib/srvmhas.jar:/u01/app/11.2.0/grid/jlib/supercluster.jar:/u01/app/11.2.0/grid/jlib/supercluster-common.jar:/u01/app/11.2.0/grid/ons/lib/ons.jar orac
le.supercluster.impl.cluster.EONSServerImpl
oracle   11609     1 0 07:53 ?        00:00:00 /u01/app/11.2.0/grid/bin/tnslsnr LISTENER_SCAN2 -inherit
oracle   11620     1 0 07:54 ?        00:00:00 /u01/app/11.2.0/grid/bin/tnslsnr LISTENER_SCAN3 -inherit
oracle   12474     1 0 08:05 ?        00:00:00 /u01/app/11.2.0/grid/bin/tnslsnr LISTENER -inherit

ASM peek:

With: export ORACLE_SID="+ASM1" and PATH to OH/bin

$ asmcmd

ASMCMD> ls

DATA/

ASMCMD> du

Used_MB      Mirror_used_MB

    263                 263

...

Database install

Again with same 'oracle' user, but at a different 'home'

$ export ORACLE_BASE=/u01/app/oracle/

$ export ORACLE_HOME=/u01/app/oracle/product/11.2.0/dbhome_1

$ ./runInstaller

-Create & configure
-Server Class
-RAC type
-Typical
-Storage type ASM, and location on DATA

-Global name: DTST

went Ok except
problem1)
ora-845 memory_target not supported (actually not enough shm!)
added to the fstab:
shmfs /dev/shm tmpfs size=1200m 0

and did it manually:
# mount -t tmpfs shmfs -o size=1200m /dev/shm

problem2)
strange error
CRS-5804: Communication error with agent process
CRS-2632: There are no more servers to try to place resource 'ora.dtst.db' on that would satisfy its placement policy

RAC2$ ./srvctl status database -d DTST

Instance DTST1 is running on node rac1

Instance DTST2 is not running on node rac2

Tried to restart, expecting to see the error...

$ ./srvctl stop database -d DTST

$ ./srvctl start database -d DTST

$ ./srvctl status database -d DTST

Instance DTST1 is running on node rac1

Instance DTST2 is running on node rac2

but it went OK this time, should have investigated this, but skipped for now...

Verifications

Documentation suggests this:

$ cd /u01/app/11.2.0/grid/bin
$ ./crsctl status resource -w "TYPE co ’ora’" -t
what an intuitive command!!!

( alternative: "./crsctl stat resource" is less nice, but I'm having difficulties remembering the other one )

--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
               ONLINE  ONLINE       rac1                                         
               ONLINE  ONLINE       rac2                                         
ora.LISTENER.lsnr
               ONLINE  ONLINE       rac1                                         
               ONLINE  ONLINE       rac2                                         
ora.asm
               ONLINE  ONLINE       rac1                     Started             
               ONLINE  ONLINE       rac2                     Started             
ora.eons
               ONLINE  ONLINE       rac1                                         
               ONLINE  ONLINE       rac2                                         
ora.gsd
               OFFLINE OFFLINE      rac1                                         
               OFFLINE OFFLINE      rac2                                         
ora.net1.network
               ONLINE  ONLINE       rac1                                         
               ONLINE  ONLINE       rac2                                         
ora.ons
               ONLINE  ONLINE       rac1                                         
               ONLINE  ONLINE       rac2                                         
ora.registry.acfs
               ONLINE  ONLINE       rac1                                         
               ONLINE  ONLINE       rac2                                         
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       rac2                                         
ora.LISTENER_SCAN2.lsnr
      1        ONLINE  ONLINE       rac1                                         
ora.LISTENER_SCAN3.lsnr
      1        ONLINE  ONLINE       rac1                                         
ora.dtst.db
      1        ONLINE  ONLINE       rac1                     Open                
      2        ONLINE  ONLINE       rac2                     Open                
ora.oc4j
      1        OFFLINE OFFLINE                                                   
ora.rac1.vip
      1        ONLINE  ONLINE       rac1                                         
ora.rac2.vip
      1        ONLINE  ONLINE       rac2                                         
ora.scan1.vip
      1        ONLINE  ONLINE       rac2                                         
ora.scan2.vip
      1        ONLINE  ONLINE       rac1                                         
ora.scan3.vip
      1        ONLINE  ONLINE       rac1         

Proper RAC shutdown

# ./crsctl stop cluster -all

"This command attempts to gracefully stop resources managed by Oracle Clusterware while attempting to stop the Oracle Clusterware stack."

Conclusion

Ok this was simplistic, but we do have our test RAC system working without any special hardware, using the iSCSI target from the previous post.

At this point we can backup the 2 Virtual machines (rac1, rac2), as well as the file used for the iSCSI disk. And experiment at will...