Saturday, September 14, 2013

Oracle: partitioning a huge table _online_

My customer has an application in production 24/24 with tables getting huge (billions of rows... )  we'd like to partition it to compress and move older data.

Not a problem -as long as we have disk space. With Oracle 10g/11g we can partition it online, without any interruption.

The package to use is DBMS_REDEFINITION, and the basic idea is:
-Create a new table how you like it (partitionned... )
-This table becomes a materialized view and is synchronized with the existing table (need double space!)
-Add indexes etc..
-When ready: Lock shortly the table while it does a last resync and switch the segments in the data dictionary

The segments formerly belonging to the MV now belongs to the new table, and it was completely transparent.

( I think in Oracle 12c, this is getting easier with a single command. -will check)



-- In this example we have a table: BIGAPP1.BIGTABLE


-- create the new table as we want it  (this will be transormed to a Materialized View)
drop table BIGAPP1.BIGTABLE_TMPPARTNING ;

create table BIGAPP1.BIGTABLE_TMPPARTNING
PARTITION BY RANGE (TIME_STAMP_UTC)

  PARTITION P2010 VALUES LESS THAN (TO_DATE('2011-01-01','YYYY-MM-DD' )) TABLESPACE TS_DATA110,
  PARTITION P2011 VALUES LESS THAN (TO_DATE('2012-01-01','YYYY-MM-DD' )) TABLESPACE TS_DATA110,
  PARTITION P2012 VALUES LESS THAN (TO_DATE('2013-01-01','YYYY-MM-DD' )) TABLESPACE TS_DATA110,
  PARTITION P2013 VALUES LESS THAN (TO_DATE('2014-01-01','YYYY-MM-DD' )) TABLESPACE TS_DATA110,
  PARTITION P2014 VALUES LESS THAN (TO_DATE('2015-01-01','YYYY-MM-DD' )) TABLESPACE TS_DATA110
) as select * from BIGAPP1.BIGTABLE where 0=1 ;


-- call dbms_redefintion
-- there are two methods, by PK or by ROWID which we had to use here
-- ref: Admin guide / chapter 15 Managing Tables / Redefining Tables Online

set serveroutput on;
exec dbms_output.put_line('check');

BEGIN
DBMS_REDEFINITION.CAN_REDEF_TABLE(
   uname        => 'BIGAPP1',
   tname        => 'BIGTABLE',
   options_flag => DBMS_REDEFINITION.CONS_USE_ROWID);
END;
/

-- The following will actually copy the data to the materialized view, and add a temporary index (because of USE_ROWIS)
-- on the (slow) test system it took 25 minutes for 11M rows
BEGIN
DBMS_REDEFINITION.START_REDEF_TABLE('BIGAPP1', 'BIGTABLE','BIGTABLE_TMPPARTNING',
       null, dbms_redefinition.cons_use_rowid);
END;
/

--  now our  table has become a MV, but has only been synced once (an INSERT at this point is not copied )
-- now we may create the indexes

-- local index
CREATE INDEX BIGAPP1.IX_BIGTABLE_AUDOBJID ON BIGAPP1.BIGTABLE_TMPPARTNING(SURV_OBJ_ID) LOCAL
( PARTITION P2010  TABLESPACE TS_DATA110, 
  PARTITION P2011  TABLESPACE TS_DATA110, 
  PARTITION P2012  TABLESPACE TS_DATA110, 
  PARTITION P2013  TABLESPACE TS_DATA110, 
  PARTITION P2014  TABLESPACE TS_DATA110 );
-- 3 minutes

...

-- a global index because it is unique and does not contain the partition key
-- we partition it with another range

CREATE UNIQUE INDEX BIGAPP1.IX_BIGTABLE_OBJID ON BIGAPP1.BIGTABLE_TMPPARTNING(OB_ID) GLOBAL PARTITION BY RANGE (OB_ID)

  PARTITION P10M VALUES LESS THAN (10000000) TABLESPACE TS_DATA110,
  PARTITION P20M VALUES LESS THAN (20000000) TABLESPACE TS_DATA110,
  PARTITION P30M VALUES LESS THAN (30000000) TABLESPACE TS_DATA110
  PARTITION PMAX VALUES LESS THAN (MAXVALUE) TABLESPACE TS_DATA110
) ;


--process grants, triggers, constraints and privileges
-- here we do not set the 3r param to CONS_ORIG_PARAMS, because we re-creates the index manually

DECLARE
num_errors PLS_INTEGER;
BEGIN
DBMS_REDEFINITION.COPY_TABLE_DEPENDENTS('BIGAPP1', 'BIGTABLE','BIGTABLE_TMPPARTNING', 
   0, TRUE, TRUE, TRUE, TRUE, num_errors);
END;
-- 1 minute

--We can check for errors here, typically: already existing constraints
select object_name, base_table_name, ddl_txt from    DBA_REDEFINITION_ERRORS;

-- Optional re-synchro : I prefer to do that since it reduces the last operation, which is the only one doing some locking

BEGIN
DBMS_REDEFINITION.SYNC_INTERIM_TABLE('BIGAPP1', 'BIGTABLE','BIGTABLE_TMPPARTNING');
END;
/

-- And right after this, call the final step
BEGIN
DBMS_REDEFINITION.FINISH_REDEF_TABLE('BIGAPP1', 'BIGTABLE','BIGTABLE_TMPPARTNING');
END;
/

-- 20 sec

--this final step did a lock, a final sync and switched the segments references in the data dictionary
-- BIGTABLE is now partitioned and  BIGTABLE_TMPPARTNING is actually pointing to the old table (including its data: we have a copy)

-- We may keep a little the older version just to verify everything is OK

-- Verify we do have data in different partitions now:

select count(*) from BIGAPP1.BIGTABLE partition (P2010);
select count(*) from BIGAPP1.BIGTABLE partition (P2011);
select count(*) from BIGAPP1.BIGTABLE partition (P2012);
select count(*) from BIGAPP1.BIGTABLE partition (P2013);


-- OK DONE!  




References:
Admin Guide - redefinition example
This example demonstrates redefining a single partition. It moves the oldest partition of a range-partitioned sales table to a tablespace named TBS_LOW_FREQ

Thursday, September 12, 2013

Generate stuff with SQL 'connect by'


'connect by' was added in Oracle to build hierarchical queries, but is really usefull to generate STUFF.
(I think I first saw this usage from Johnathan Lewis... )

SQL> select rownum from dual connect by level <= 4;
         1
         2
         3
         4


Create a huge table:

With the help of RPAD:

SQL> select rpad('A',50,'B') from dual;
ABBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB


We can create quickly a big table with data:

SQL> create table testbig as
             select rownum as n,  

                    rpad('A',50,'B') as txt 
             from dual connect by level <= 100000;


Somewhat more varied with modulo:

create table testbig2 as
  select rownum as n,  

         mod(rownum,100) as m, 
         rpad('A',50,'B') as txt 
  from dual connect by level <= 10000;



Generate days:
SQL> select TO_DATE('2012-01-01','YYYY-MM-DD')+rownum from dual connect  by level <=3;
2012-01-02
2012-01-03
2012-01-04





Generate months:
SQL> select ADD_MONTHS( TO_DATE('2012-01-01','YYYY-MM-DD'),rownum) from dual connect  by level <=3;
2012-02-01
2012-03-01
2012-04-01



Generate PARTITION clauses...



SQL> set pagesize 0
SQL> select 'PARTITION P'

||TO_CHAR(ADD_MONTHS( TO_DATE('2012-01-01','YYYY-MM-DD'),rownum),'YYYYMM')
||' VALUES LESS THAN (TO_DATE('''||TO_CHAR(ADD_MONTHS( TO_DATE('2011-01-01','YYYY-MM-DD'),rownum),'YYYYMMDD') ||''',''YYYYMMDD'')),'

from dual connect  by level <=3; 
PARTITION P201202 VALUES LESS THAN (TO_DATE('20110201','YYYYMMDD')),
PARTITION P201203 VALUES LESS THAN (TO_DATE('20110301','YYYYMMDD')),
PARTITION P201204 VALUES LESS THAN (TO_DATE('20110401','YYYYMMDD')),






Sadly I know no equivalent in MySQL,
(discussed here: stackoverflow.com/questions/701444 )


Of course in bash you can also generate things like so:

$ for i in {1..5}
> do
> echo PARTITION$i
> done
PARTITION1
PARTITION2
PARTITION3
PARTITION4


But I don't know an easy way to process dates though...

MySQL SHOW SLAVE STATUS Explained

Sometimes I don't have to deal with MySQL slaves for months and again I forgot which entry means what.

Here is a SHOW SLAVE STATUS explained  while catching up (I just restarted the slave after an error)





On the slave:
10:24 [vega]:> show slave status\G
...
      Master_Log_File: vega20-bin.000740  just current log and pos on the master
  Read_Master_Log_Pos: 3396922             == show master status
                                           (sometimes takes time to be refreshed)

       Relay_Log_File: vega21-relay-bin.000074 relay file on the slave we're

                                               currently applying

        Relay_Log_Pos: 274423937          position in bytes in the relay file.

Relay_Master_Log_File: vega20-bin.000727  corresponding log file on the master to
                                          what we're applying 
                                          (727 -> 740  is left to be done)
...
  Exec_Master_Log_Pos: 274423781          position on master log corresponding to
                                          what we're applying
...
Seconds_Behind_Master: 110997             estimated time to complete applying,
                                          here 727->740

Files on slave:
        219 Sep 10 04:08 vega21-relay-bin.000076
     510344 Sep 10 04:08 vega21-relay-bin.000075
 1073741921 Sep 10 04:08 vega21-relay-bin.000074 <- applying this one
     193266 Sep 10 07:18 vega21-relay-bin.000079
  (a bit later, are deleted)
...
  196984797 Sep 11 09:03 vega21-relay-bin.000112
    3397078 Sep 11 10:24 vega21-relay-bin.000114 <- last redo info we copied from the master (data from master's
log is copied to a slave relay log. Size may not be same as master ex:if we stop/start etc..)

Files on the master:
 1073904414 Sep  7 09:59 vega20-bin.000720
 1073774868 Sep 10 02:21 vega20-bin.000721
...
 1074251834 Sep 10 04:07 vega20-bin.000727<- applying data which is here 
 1073934723 Sep 10 07:17 vega20-bin.000728 
(copied on slave in relay log 74)
 1
073993761 Sep 10 07:27 vega20-bin.000729
...
 1091352292 Sep 11 09:02 vega20-bin.000739
    3396922 Sep 11 10:23 vega20-bin.000740<- current log written by the master





Status on the master:

10:24 [(none)]:> show master status;
+----------------------------+----------+--------------+------------------+
| File                       | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+----------------------------+----------+--------------+------------------+
| mysql-clappdb50-bin.000740 |  3396922 |              |                  |
+----------------------------+----------+--------------+------------------+




Other notes:

11:00 [vega]:> show processlist;
+--+-----------+----+----+-------+------+--------------------------------+------+
|Id|User       |Host|db  |Command|Time  |State                           |Info  |
+--+-----------+----+----+-------+------+--------------------------------+------+
| 5|system user|    |NULL|Connect|  6979|Waiting for master to send event|NULL  |
|69|system user|    |NULL|Connect|111664|Reading event from the relay log|NULL  |


Waiting for master to send event: currently the Master isn't writing much, so nothing to send to the slave relay log

Reading event from the relay log: means we're reading the relay file. If this is seen often, it means we're spending more time reading than applying... (slow I/O!)

Sometimes we see the action replicated on the slave:

|97|system user |    |STOCKS|Connect|101289| NULL      |TRUNCATE TABLE `TOTA01`|

Monday, September 2, 2013

simulating a slow drive


I want to simulate a slow drive, to better grasp the symptoms of a database suffering from a slow disk.

Until now I used to follow up on suspicions by testing with  IOzone or sysbench/fileio , but:
  • This is sometimes difficult to do if the sysadmin is not me, and will not/can not cooperate
  • This is imperfect because we would typically do such tests at night when the SAN which I suspect is less used ;-)
I could use a SD card or something, but most of my test setups are on VM, I would prefer a software solution.
I almost started to write my own slow FUSE filesystem, when I finally found a nice trick from this post:
http://askubuntu.com/questions/13590/can-i-simulate-a-slow-hard-drive

Use nbd, the Network Block Device, and then rate limit access to it using say trickle.
sudo apt-get install nbd-client nbd-server trickle


Knowing neither trickle nor nbd, I tested each  before doing the setup:


1) Trickle test


Trickle will limit the network speed of a command:
trickle -u <upload speed> -d <download speed> command

Example:$ wget http://192.168.0.6/videos2007/CIMG0500-small.avi
=> 11M/s

$ trickle -d 20 wget http://192.168.0.6/videos2007/CIMG0500-small.avi

=> 22 K/s


2) NBD test

NBD stands for Network Block Device. I understand it to be a simple way to use a block device from a remote machine (simpler than drbd or iscsi targets/initiators implementations)


Blocks may be files or partitions, logical volumes... Let's try to create a file :



dd if=/dev/zero of=/home/phil/apps/NBD/nbd-drive bs=1024 count=10240
Then run the nbd server on that file:
nbd-server 9500 /home/phil/apps/NBD/nbd-drive 

It seems to launch a daemon 'nbd-server', which do listen on 9500:


$ netstat -plnt
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
...          
tcp        0      0 0.0.0.0:9500            0.0.0.0:*               LISTEN      2195/nbd-server 



Let's try to connect the client:

sudo nbd-client localhost 9500 /dev/nbd0

Seems to launch a daemon as well:
root      2276     1  0 07:04 ?        00:00:00 nbd-client localhost 9500 /dev/nbd0
root      2277     2  0 07:04 ?        00:00:00 [nbd0]

We can now format this device and mount it:
sudo mkfs /dev/nbd0 sudo mount /dev/nbd0 /mnt $ df -h Filesystem Size Used Avail Use% Mounted on ... /dev/nbd0 9.7M 92K 9.1M 1% /mnt



The test is OK. Let's umount and release:

sudo umount /mnt
sudo nbd-client -d /dev/nbd0
killall nbd-server



3) Trickle + NBD to make a slow drive


I succeeded by running the server through trickle this way:

$ trickle -d 20 -u 20 nbd-server 9500 /home/phil/apps/NBD/nbd-drive bs=1024 count=10240
trickle: Could not reach trickled, working independently: No such file or directory

Then started the client again with:


sudo nbd-client localhost 9500 /dev/nbd0

And indeed this block device was really slow :

$ sudo dd if=/dev/nbd0 of=/dev/null bs=65536 skip=100 count=10
10+0 records in
10+0 records out
655360 bytes (655 kB) copied, 28.5783 s, 22.9 kB/s



Finally I wrote that script to make the drive ready when I need it:

$ cat mountslowdrive.sh 
#!/bin/bash

FILE_OR_PARTITION=/home/phil/apps/NBD/nbd-drive
MOUNTPOINT=/home/phil/apps/NBD/mnt

if [ $# -eq 0 ] ; then

  echo making/mounting slow drive
  trickle -d 20 -u 20 nbd-server 9500 $FILE_OR_PARTITION bs=1024 count=10240
  sudo nbd-client localhost 9500 /dev/nbd0
  sudo mount /dev/nbd0 $MOUNTPOINT


else
  if [ \( $# -eq 1 \) -a \(  $1 == "-u" \) ]; then

    echo unmountint slow drive

    sudo umount $MOUNTPOINT
    sudo nbd-client -d /dev/nbd0
    killall nbd-server

  else
    echo  USAGE: $0 [-u]
    echo         -u : umount the slow drive
  fi
fi





Todo:
Use the -l or /etc/nbd_server.allow option of the server to restrict the clients (Otherwise any client on the network can use the block device... )




Thursday, August 29, 2013

proving bad CPU


Acting as DBA today I experience bad CPU performance it seems
The Oracle database is running on a Oracle/Sparc virtualization system.

AWR/Statspack shows:
DB CPU 11,825 87.51
log file sync 21,933 1,292 59 9.56 Commit
db file parallel read 16,277 268 16 1.98 User I/O

But looking at the top "SQL by CPU Time" I see no demanding statement,
and all seem to be using a lot of CPU evenly.

Working on the system everything is slow, for instance 'autotrace' and 'explain plan'

I have a heavy suspicion it's the system's fault. Maybe some other VM on the machine is sucking all CPU
I reported to the syadmin team, but I'd like proofs.

I thought about writing a PL/SQL CPU-intensive job, but it wont help with the system guys. It maybe Oracle's fault for all they know.

So I quickly wrote a BASH prime computation script, and compared results to an old PC at home

$ cat p.sh
#!/bin/bash

if [ -z "$1" ]; then
  echo usage: $0 MAXNUM; exit 1
fi
nmax=$1

primes=(2)
n=3

while [ $n -le $nmax ]; do
  #echo verifier $n
  isprime=1
  for d in ${primes[*]} ; do
        #echo check div par $d
        r=$(( $n % $d ))
        if [ $r -eq 0 ]
        then
          isprime=0
          #echo No: $d div $n
          break
        fi
  done

  if [ $isprime -eq 1 ]; then
    primes=(${primes[*]} $n)   #primes+=($n)    dont work on older bash(solaris 10)
  fi

  #echo primes = ${primes[*]}
  n=$(( $n + 2 ))
done

echo primes = ${primes[*]}





Results are indeed bad:

customer_system$ time ./p.sh 2000
primes = 2 3 5 7 11 ...
real    0m9.376s
user    0m8.496s
sys     0m1.182s

my_old_pc$ time ./p.sh 2000
primes = 2 3 5 7 11 ...
real    0m2.594s
user    0m2.348s
sys     0m0.226s


A VM on a 5 years old PC is 3x faster ! wow!

Then I thought: These are T-series Sparc system. I imagine they might tell me this could be normal, it is designed for parallel-intensive stuff (Still, 5y old PC... )
But to be sure I also tried some parallel work:

$ cat para.sh
#!/bin/bash

if [ -z "$2" ]; then
  echo usage: $0 parallel_degree primemax; exit 1
fi

max=$1
i=0

while [ $i -le $max ]; do
  ./p.sh $2 &

  i=$(( $i + 2 ))
done
wait
echo fin


Results:

my_old_pc$ time ./para.sh 10 500
primes = 2 3 5 7 ...
primes = 2 3 5 7 ...

primes = 2 3 5 7 ...
...
fin
real    0m1.487s
user    0m1.288s
sys     0m0.184s


customer_system$ time ./para.sh 10 500
primes = 2 3 5 7 ...
primes = 2 3 5 7 ...primes = 2 3 5 7 ...
...
fin
real    0m2.427s
user    0m5.027s
sys     0m4.148s




Less bad, but still BAD. There is definitely weak a CPU on this system causing these bad DB performances.





Monday, November 19, 2012

mySQL fillup fake data & autocommit

Playing around with MySQL performance, I had to fill up tables with test data

As with Oracle, how much you write before a commit has huge effects on performance. If you don't pay attention it takes AGES:


delimiter //
create table tbad (n integer, n2 integer, t varchar(20000), primary key(n))//

create procedure fill1 ()
begin
    declare x integer;
    set x=0;
    while x<50000 do
        insert into tbad values(x,x,'bobobobobobobobobo');
        set x=x+1;
    end while;
end;
//
mysql> call fill1 //
Query OK, 1 row affected (31 min 39.88 sec)

OUCH! 31 minute for just 50K rows!

That's because MySQL has autocommit on by default

Solution is to add something like:

begin
    declare x integer;
    SET AUTOCOMMIT=0;
    set x=0;
    while x<50000 do
        insert into tbad values(x,x,'bobobobobobobobobo');
        set x=x+1;
        if mod(x,1000) = 0 THEN
            commit;
        end if;

    end while;
    commit;
end;

And this is of course much faster
Query OK, 0 rows affected (3.49 sec)


I often commit about every 1000, but then how much is good enough? I did a test, by writing a procedure to create a test table and fill it up with varying commit bulk sizes.

delimiter //

# this is a result table
drop table if exists zztresults;
create table zztresults(testno int auto_increment primary key,nbrows int,commit_every int, time int);
//


drop procedure testperf1//

create procedure testperf1 (nbrows integer, commit_every integer)
begin
    declare x integer;
    declare t_start timestamp;
    declare totaltime integer;

    drop table if exists zztperf1;
    create table zztperf1 (n integer, n1 integer, v varchar(2000));

    select now() into t_start;

    SET AUTOCOMMIT=0;
    set x=0;
    while x<nbrows do
        insert into zztperf1 values(x,x,'bobobobobobobobobo');
        set x=x+1; 
        if mod(x,commit_every) = 0 THEN
   
        commit;
 
        end if;
    end while;
    commit;

    select (now()-t_start) into totaltime;
insert into zztresults values (null,nbrows,commit_every,totaltime);
select totaltime;
end;
//
call testperf1(1024000,1)//
call testperf1(1024000,10)//
call testperf1(1024000,50)//
call testperf1(1024000,100)//
call testperf1(1024000,500)//
call testperf1(1024000,1000)//
call testperf1(1024000,5000)//
call testperf1(1024000,10000)//
call testperf1(1024000,50000)//



mysql> select * from zztresults //
+--------+---------+--------------+--------+
| testno | nbrows  | commit_every | time   |
+--------+---------+--------------+--------+
|      1 | 1024000 |            1 | 870397 |
|      2 | 1024000 |           10 |  14597 |
|      3 | 1024000 |           50 |   1394 |
|      4 | 1024000 |          100 |    703 |
|      5 | 1024000 |          500 |    201 |
|      6 | 1024000 |         1000 |    120 |
|      7 | 1024000 |         5000 |     87 |
|      8 | 1024000 |        10000 |     83 |
|      9 | 1024000 |        50000 |     79 |
+--------+---------+--------------+--------+



So it can take from days, down to a large minute to have a million rows...

Put it in CSV file like this:

SELECT commit_every,time from zztresults
into outfile '/tmp/speed.csv'
fields terminated by ','
lines terminated by '\n'
//




Actually numbers are so exponentially disparate, I used the logarithmic scale on both axis:



-->
commit_every time(s)
log(commitnb) log(time)
1 870397
0.00 5.94
10 14597
1.00 4.16
50 1394
1.70 3.14
100 703
2.00 2.85
500 201
2.70 2.30
1000 120
3.00 2.08
5000 87
3.70 1.94
10000 83
4.00 1.92
50000 79
4.70 1.90

So in this case with a default MySQL install, I'd go with 5000 commits at once.

What are the cons of using larger sets of commit? In my opinion: Increasing undo space. 

On Oracle that would mean the UNDO tablespace could grow or at least internal undo segments.
On MySQL/innodb alike, the rollback segment will grow to hold all uncomitted data. In 5.5 and earlier it is by default inside the one and only tablespace (ibdata)





Sunday, October 28, 2012

Test HA 3: Oracle RAC setup

Following the setup of an iSCSI shared disk available to 2 VMs, this time I am installing a test Oracle RAC
- Again without any specific hardware (everything runs on a simple PC)

 

Interconnect network: 

In RAC parlance it is the private net between the nodes (==heartbeat net)
On the virtual machine I add a 'Host-only network' interface.
In my case I took the 192.168.160.0/24 network, I changed the DHCP to static as follows: (using system-config-network for example)

RAC1: eth1 192.168.160.101 / 255.255.255.0
RAC2: eth1 192.168.160.101 / 255.255.255.0


#service network restart
RAC1 $ ping 192.168.160.102
OK

 

Users & system setup


Reference: Database Installation Guide for Linux

On each node:

# yum install gcc elfutils-libelf-devel glibc-devel libaio-devel libstdc++-devel unixODBC unixODBC-devel gcc-c++

# groupadd dba
# groupadd oinstall
# useradd -m oracle -g oinstall -G dba,asmdba
# passwd oracle

# cat >>  /etc/security/limits.conf
 oracle          hard    nofiles         65536
 oracle          soft    nofiles         65536

# cat >>  /etc/sysctl.conf
 kernel.sem = 250        32000   100      128
 fs.file-max = 6815744
 net.ipv4.ip_local_port_range = 9000    65500
 net.core.rmem_default = 262144
 net.core.rmem_max = 4194304
 net.core.wmem_default = 262144
 net.core.wmem_max = 4194304
 fs.aio-max-nr = 1048576

# /sbin/sysctl -p

# mkdir /opt/oracle
# chown oracle.oinstall /opt/oracle/

ssh connection without password

RAC1# su - oracle
RAC1$ ssh-keygen -b 2048
(type enter for an empty passphrase)

Repeat on RAC2 (this creates .ssh directory and private/pub keys)

RAC1$ scp .ssh/id_rsa.pub rac2:/home/oracle/.ssh/authorized_keys
RAC2$ chmod 600 /home/oracle/.ssh/authorized_keys
RAC2$ scp .ssh/id_rsa.pub rac1:/home/oracle/.ssh/authorized_keys
RAC1$ chmod 600 /home/oracle/.ssh/authorized_keys

Then ssh works without password from one node to the other
RAC1$ ssh RAC2

OUI (Oracle Installer) also needs login from itself, so we also need on each node our own public key:
$ cd ~/.ssh && cat id_rsa.pub >> authorized_keys

Choose some config names and IPs

in my case:

cluster name= raccluster
public hostname1 = rac1             192.168.0.201
public hostname2 = rac2             192.168.0.202
virtual hostname1 = rac1-vip       192.168.0.211
virtual hostname2 = rac2-vip       192.168.0.212

virtual IP:                 racvip          192.168.0.203
SCAN addresses:     rac-scan 192.168.0.213  192.168.0.214 192.168.0.215
(defined though the DNS, see my DNS post if like me you forgot...)

vi /etc/nsswitch.conf
hosts: dns files

# service nscd restart

Created some directories:

# mkdir -p /u01/app/11.2.0/grid
# chown -R oracle:oinstall /u01/app/11.2.0/grid

# mkdir -p /u01/app/oracle/
# chown -R oracle:oinstall /u01/app/oracle/
# chmod -R 775 /u01/app/oracle/


NTPD:


ntpd is needed and need special slewing option:

# vi /etc/sysconfig/ntpd
     OPTIONS="-x -u ntp:ntp -p /var/run/ntpd.pid"

# service ntpd restart
# chkconfig ntpd on

 

Setup ASMLib

In this example we use ASM (alternatives are: ocfs2, GFS..)
We insall ASMlib which is just the lower level software (kernel driver and low level utils). The rest of ASM is installed through the 'grid'

rac1 & rac2:
wget http://oss.oracle.com/projects/oracleasm/dist/files/RPMS/rhel5/x86/2.0.5/2.6.18-238.el5/oracleasm-2.6.18-238.el5-2.0.5-1.el5.i686.rpm
wget http://oss.oracle.com/projects/oracleasm-support/dist/files/RPMS/rhel5/x86/2.1.7/oracleasm-support-2.1.7-1.el5.i386.rpm
wget http://download.oracle.com/otn_software/asmlib/oracleasmlib-2.0.4-1.el5.i386.rpm
rpm -i oracleasm-support-2.1.7-1.el5.i386.rpm oracleasmlib-2.0.4-1.el5.i386.rpm oracleasm-2.6.18-238.el5-2.0.5-1.el5.i686.rpm

ASMlib configuration: (note the documentation is missing the '-i' option)

# /usr/sbin/oracleasm configure -i
Default user to own the driver interface []: oracle
Default group to own the driver interface []: dba
Start Oracle ASM library driver on boot (y/n) [n]: y
Scan for Oracle ASM disks on boot (y/n) [y]: y
Writing Oracle ASM library driver configuration: done


rac1# /usr/sbin/oracleasm createdisk ASMDISK1 /dev/sdd1

rac1# /usr/sbin/oracleasm listdisks
ASMDISK1

rac2# /usr/sbin/oracleasm scandisks
rac2# /usr/sbin/oracleasm listdisks
ASMDISK1

I can the ASM disk on both nodes. Good !


Grid Installation

The grid software contains ASM and Oracle Clusterware.
In this test setup I used the same 'oracle'  user (with hindsight I should have used 'grid', much cleaner to separate the grid/clusterware from DB itself)

export ORACLE_BASE=/u01/app/oracle/
export ORACLE_HOME=/u01/app/11.2.0/grid
./runInstaller

I met this error: [INS-40910] Virtual IP: entered is invalid.
misleading in my case it was due to bad reverse DNS resolution...

run the root script, which started a bunch of stuff and used the ASM disk

 ohasd is starting
CRS-2672: Attempting to start 'ora.gipcd' on 'rac1'
CRS-2672: Attempting to start 'ora.mdnsd' on 'rac1'
CRS-2672: Attempting to start 'ora.gpnpd' on 'rac1'
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'rac1'
CRS-2672: Attempting to start 'ora.cssd' on 'rac1'
CRS-2672: Attempting to start 'ora.diskmon' on 'rac1'
CRS-2672: Attempting to start 'ora.ctssd' on 'rac1'
CRS-2672: Attempting to start 'ora.crsd' on 'rac1'
CRS-2672: Attempting to start 'ora.evmd' on 'rac1'
CRS-2672: Attempting to start 'ora.asm' on 'rac1'
CRS-2672: Attempting to start 'ora.DATA.dg' on 'rac1'
CRS-2672: Attempting to start 'ora.registry.acfs' on 'rac1'


Verifications:

 $ ./crsctl check cluster -all
**************************************************************
rac1:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************
rac2:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************


Oracle processes after clusterware + ASM (grid) install

Oh man, Oracle it works but it is not really lightweight... We haven't installed any real DB yet !

root      8521     1  0 07:47 ?        00:00:00 /bin/sh /etc/init.d/init.ohasd run
root      8544     1  0 07:47 ?        00:00:04 /u01/app/11.2.0/grid/bin/ohasd.bin reboot
root      9842     1  0 07:48 ?        00:00:01 /u01/app/11.2.0/grid/bin/orarootagent.bin
oracle   10624     1  0 07:51 ?        00:00:03 /u01/app/11.2.0/grid/bin/oraagent.bin
oracle   10639     1  0 07:51 ?        00:00:00 /u01/app/11.2.0/grid/bin/mdnsd.bin
oracle   10651     1  0 07:51 ?        00:00:00 /u01/app/11.2.0/grid/bin/gipcd.bin
oracle   10662     1  0 07:51 ?        00:00:01 /u01/app/11.2.0/grid/bin/gpnpd.bin
root     10677     1  0 07:51 ?        00:00:01 /u01/app/11.2.0/grid/bin/cssdmonitor
root     10694     1  0 07:51 ?        00:00:01 /u01/app/11.2.0/grid/bin/cssdagent
oracle   10696     1  0 07:51 ?        00:00:00 /u01/app/11.2.0/grid/bin/diskmon.bin -d -f
oracle   10715     1  0 07:51 ?        00:00:03 /u01/app/11.2.0/grid/bin/ocssd.bin
root     10792     1  0 07:52 ?        00:00:00 /u01/app/11.2.0/grid/bin/octssd.bin
oracle   10852     1  0 07:52 ?        00:00:00 asm_pmon_+ASM1
oracle   10854     1  0 07:52 ?        00:00:00 asm_vktm_+ASM1
oracle   10858     1  0 07:52 ?        00:00:00 asm_gen0_+ASM1
oracle   10860     1  0 07:52 ?        00:00:00 asm_diag_+ASM1
oracle   10862     1  0 07:52 ?        00:00:00 asm_ping_+ASM1
oracle   10864     1  0 07:52 ?        00:00:00 asm_psp0_+ASM1
oracle   10866     1  0 07:52 ?        00:00:00 asm_dia0_+ASM1
oracle   10868     1  0 07:52 ?        00:00:00 asm_lmon_+ASM1
oracle   10870     1  0 07:52 ?        00:00:00 asm_lmd0_+ASM1
oracle   10873     1  0 07:52 ?        00:00:00 asm_lms0_+ASM1
oracle   10877     1  0 07:52 ?        00:00:00 asm_lmhb_+ASM1
oracle   10879     1  0 07:52 ?        00:00:00 asm_mman_+ASM1
oracle   10881     1  0 07:52 ?        00:00:00 asm_dbw0_+ASM1
oracle   10883     1  0 07:52 ?        00:00:00 asm_lgwr_+ASM1
oracle   10885     1  0 07:52 ?        00:00:00 asm_ckpt_+ASM1
oracle   10887     1  0 07:52 ?        00:00:00 asm_smon_+ASM1
oracle   10889     1  0 07:52 ?        00:00:00 asm_rbal_+ASM1
oracle   10891     1  0 07:52 ?        00:00:00 asm_gmon_+ASM1
oracle   10893     1  0 07:52 ?        00:00:00 asm_mmon_+ASM1
oracle   10895     1  0 07:52 ?        00:00:00 asm_mmnl_+ASM1
oracle   10897     1  0 07:52 ?        00:00:00 /u01/app/11.2.0/grid/bin/oclskd.bin
oracle   10900     1  0 07:52 ?        00:00:00 asm_lck0_+ASM1
root     10912     1  0 07:52 ?        00:00:08 /u01/app/11.2.0/grid/bin/crsd.bin reboot
oracle   10928     1  0 07:52 ?        00:00:01 /u01/app/11.2.0/grid/bin/evmd.bin
oracle   10930     1  0 07:52 ?        00:00:00 asm_asmb_+ASM1
oracle   10932     1  0 07:52 ?        00:00:00 oracle+ASM1_asmb_+asm1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
root     10958     1  0 07:52 ?        00:00:00 /u01/app/11.2.0/grid/bin/oclskd.bin
oracle   10960     1  0 07:52 ?        00:00:01 oracle+ASM1_ocr (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle   11017 10928  0 07:52 ?        00:00:00 /u01/app/11.2.0/grid/bin/evmlogger.bin -o /u01/app/11.2.0/grid/evm/log/evmlogger.info -l /u01/app/11.2.0/grid/evm/log/ev
mlogger.log
oracle   11220     1  0 07:53 ?        00:00:02 /u01/app/11.2.0/grid/bin/oraagent.bin
root     11388     1  0 07:53 ?        00:00:11 /u01/app/11.2.0/grid/bin/orarootagent.bin
oracle   11415     1  0 07:53 ?        00:00:00 /u01/app/11.2.0/grid/opmn/bin/ons -d
oracle   11416 11415  0 07:53 ?        00:00:00 /u01/app/11.2.0/grid/opmn/bin/ons -d
oracle   11467     1  0 07:53 ?        00:00:03 /u01/app/11.2.0/grid/jdk/jre//bin/java -Doracle.supercluster.cluster.server=eonsd -Djava.net.preferIPv4Stack=true -Djava
.util.logging.config.file=/u01/app/11.2.0/grid/srvm/admin/logging.properties -classpath /u01/app/11.2.0/grid/jdk/jre//lib/rt.jar:/u01/app/11.2.0/grid/jlib/srvm.jar:/u01
/app/11.2.0/grid/jlib/srvmhas.jar:/u01/app/11.2.0/grid/jlib/supercluster.jar:/u01/app/11.2.0/grid/jlib/supercluster-common.jar:/u01/app/11.2.0/grid/ons/lib/ons.jar orac
le.supercluster.impl.cluster.EONSServerImpl
oracle   11609     1  0 07:53 ?        00:00:00 /u01/app/11.2.0/grid/bin/tnslsnr LISTENER_SCAN2 -inherit
oracle   11620     1  0 07:54 ?        00:00:00 /u01/app/11.2.0/grid/bin/tnslsnr LISTENER_SCAN3 -inherit
oracle   12474     1  0 08:05 ?        00:00:00 /u01/app/11.2.0/grid/bin/tnslsnr LISTENER -inherit

 

ASM peek:

With: export ORACLE_SID="+ASM1" and PATH to OH/bin

$ asmcmd
ASMCMD> ls
DATA/
ASMCMD> du
Used_MB      Mirror_used_MB
    263                 263
...

Database install

Again with same 'oracle' user, but at a different 'home'

$ export ORACLE_BASE=/u01/app/oracle/
$ export ORACLE_HOME=/u01/app/oracle/product/11.2.0/dbhome_1

$ ./runInstaller

-Create & configure
-Server Class
-RAC type
-Typical
-Storage type ASM, and location on DATA

-Global name: DTST


went Ok except
problem1)
  ora-845 memory_target not supported (actually not enough shm!)
added to the fstab:
shmfs             /dev/shm         tmpfs   size=1200m     0

and did it manually:
# mount -t tmpfs shmfs -o size=1200m /dev/shm

problem2)
 strange error
CRS-5804: Communication error with agent process
CRS-2632: There are no more servers to try to place resource 'ora.dtst.db' on that would satisfy its placement policy

RAC2$ ./srvctl status database -d DTST
Instance DTST1 is running on node rac1
Instance DTST2 is not running on node rac2

Tried to restart, expecting to see the error...
$ ./srvctl stop database -d DTST
$ ./srvctl start database -d DTST

$ ./srvctl status database -d DTST
Instance DTST1 is running on node rac1
Instance DTST2 is running on node rac2

but it went OK this time, should have investigated this, but skipped for now...


Verifications

Documentation suggests this:

$ cd /u01/app/11.2.0/grid/bin
$ ./crsctl status resource -w "TYPE co ’ora’" -t 
what an intuitive command!!!


( alternative: "./crsctl stat resource" is less nice, but I'm having difficulties remembering the other one )
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS      
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
               ONLINE  ONLINE       rac1                                        
               ONLINE  ONLINE       rac2                                        
ora.LISTENER.lsnr
               ONLINE  ONLINE       rac1                                        
               ONLINE  ONLINE       rac2                                        
ora.asm
               ONLINE  ONLINE       rac1                     Started            
               ONLINE  ONLINE       rac2                     Started            
ora.eons
               ONLINE  ONLINE       rac1                                        
               ONLINE  ONLINE       rac2                                        
ora.gsd
               OFFLINE OFFLINE      rac1                                        
               OFFLINE OFFLINE      rac2                                        
ora.net1.network
               ONLINE  ONLINE       rac1                                        
               ONLINE  ONLINE       rac2                                        
ora.ons
               ONLINE  ONLINE       rac1                                        
               ONLINE  ONLINE       rac2                                        
ora.registry.acfs
               ONLINE  ONLINE       rac1                                        
               ONLINE  ONLINE       rac2                                        
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       rac2                                        
ora.LISTENER_SCAN2.lsnr
      1        ONLINE  ONLINE       rac1                                        
ora.LISTENER_SCAN3.lsnr
      1        ONLINE  ONLINE       rac1                                        
ora.dtst.db
      1        ONLINE  ONLINE       rac1                     Open               
      2        ONLINE  ONLINE       rac2                     Open               
ora.oc4j
      1        OFFLINE OFFLINE                                                  
ora.rac1.vip
      1        ONLINE  ONLINE       rac1                                        
ora.rac2.vip
      1        ONLINE  ONLINE       rac2                                        
ora.scan1.vip
      1        ONLINE  ONLINE       rac2                                        
ora.scan2.vip
      1        ONLINE  ONLINE       rac1                                        
ora.scan3.vip
      1        ONLINE  ONLINE       rac1        

Proper RAC shutdown

# ./crsctl stop cluster -all
"This command attempts to gracefully stop resources managed by Oracle Clusterware while attempting to stop the Oracle Clusterware stack."



Conclusion

Ok this was simplistic, but we do have our test RAC system working without any special hardware, using the iSCSI target from the previous post.

At this point we can backup the 2 Virtual machines (rac1, rac2), as well as the file used for the iSCSI disk. And experiment at will...