Saturday, September 14, 2013

Oracle: partitioning a huge table _online_

My customer has an application in production 24/24 with tables getting huge (billions of rows... )  we'd like to partition it to compress and move older data.

Not a problem -as long as we have disk space. With Oracle 10g/11g we can partition it online, without any interruption.

The package to use is DBMS_REDEFINITION, and the basic idea is:
-Create a new table how you like it (partitionned... )
-This table becomes a materialized view and is synchronized with the existing table (need double space!)
-Add indexes etc..
-When ready: Lock shortly the table while it does a last resync and switch the segments in the data dictionary

The segments formerly belonging to the MV now belongs to the new table, and it was completely transparent.

( I think in Oracle 12c, this is getting easier with a single command. -will check)



-- In this example we have a table: BIGAPP1.BIGTABLE


-- create the new table as we want it  (this will be transormed to a Materialized View)
drop table BIGAPP1.BIGTABLE_TMPPARTNING ;

create table BIGAPP1.BIGTABLE_TMPPARTNING
PARTITION BY RANGE (TIME_STAMP_UTC)

  PARTITION P2010 VALUES LESS THAN (TO_DATE('2011-01-01','YYYY-MM-DD' )) TABLESPACE TS_DATA110,
  PARTITION P2011 VALUES LESS THAN (TO_DATE('2012-01-01','YYYY-MM-DD' )) TABLESPACE TS_DATA110,
  PARTITION P2012 VALUES LESS THAN (TO_DATE('2013-01-01','YYYY-MM-DD' )) TABLESPACE TS_DATA110,
  PARTITION P2013 VALUES LESS THAN (TO_DATE('2014-01-01','YYYY-MM-DD' )) TABLESPACE TS_DATA110,
  PARTITION P2014 VALUES LESS THAN (TO_DATE('2015-01-01','YYYY-MM-DD' )) TABLESPACE TS_DATA110
) as select * from BIGAPP1.BIGTABLE where 0=1 ;


-- call dbms_redefintion
-- there are two methods, by PK or by ROWID which we had to use here
-- ref: Admin guide / chapter 15 Managing Tables / Redefining Tables Online

set serveroutput on;
exec dbms_output.put_line('check');

BEGIN
DBMS_REDEFINITION.CAN_REDEF_TABLE(
   uname        => 'BIGAPP1',
   tname        => 'BIGTABLE',
   options_flag => DBMS_REDEFINITION.CONS_USE_ROWID);
END;
/

-- The following will actually copy the data to the materialized view, and add a temporary index (because of USE_ROWIS)
-- on the (slow) test system it took 25 minutes for 11M rows
BEGIN
DBMS_REDEFINITION.START_REDEF_TABLE('BIGAPP1', 'BIGTABLE','BIGTABLE_TMPPARTNING',
       null, dbms_redefinition.cons_use_rowid);
END;
/

--  now our  table has become a MV, but has only been synced once (an INSERT at this point is not copied )
-- now we may create the indexes

-- local index
CREATE INDEX BIGAPP1.IX_BIGTABLE_AUDOBJID ON BIGAPP1.BIGTABLE_TMPPARTNING(SURV_OBJ_ID) LOCAL
( PARTITION P2010  TABLESPACE TS_DATA110, 
  PARTITION P2011  TABLESPACE TS_DATA110, 
  PARTITION P2012  TABLESPACE TS_DATA110, 
  PARTITION P2013  TABLESPACE TS_DATA110, 
  PARTITION P2014  TABLESPACE TS_DATA110 );
-- 3 minutes

...

-- a global index because it is unique and does not contain the partition key
-- we partition it with another range

CREATE UNIQUE INDEX BIGAPP1.IX_BIGTABLE_OBJID ON BIGAPP1.BIGTABLE_TMPPARTNING(OB_ID) GLOBAL PARTITION BY RANGE (OB_ID)

  PARTITION P10M VALUES LESS THAN (10000000) TABLESPACE TS_DATA110,
  PARTITION P20M VALUES LESS THAN (20000000) TABLESPACE TS_DATA110,
  PARTITION P30M VALUES LESS THAN (30000000) TABLESPACE TS_DATA110
  PARTITION PMAX VALUES LESS THAN (MAXVALUE) TABLESPACE TS_DATA110
) ;


--process grants, triggers, constraints and privileges
-- here we do not set the 3r param to CONS_ORIG_PARAMS, because we re-creates the index manually

DECLARE
num_errors PLS_INTEGER;
BEGIN
DBMS_REDEFINITION.COPY_TABLE_DEPENDENTS('BIGAPP1', 'BIGTABLE','BIGTABLE_TMPPARTNING', 
   0, TRUE, TRUE, TRUE, TRUE, num_errors);
END;
-- 1 minute

--We can check for errors here, typically: already existing constraints
select object_name, base_table_name, ddl_txt from    DBA_REDEFINITION_ERRORS;

-- Optional re-synchro : I prefer to do that since it reduces the last operation, which is the only one doing some locking

BEGIN
DBMS_REDEFINITION.SYNC_INTERIM_TABLE('BIGAPP1', 'BIGTABLE','BIGTABLE_TMPPARTNING');
END;
/

-- And right after this, call the final step
BEGIN
DBMS_REDEFINITION.FINISH_REDEF_TABLE('BIGAPP1', 'BIGTABLE','BIGTABLE_TMPPARTNING');
END;
/

-- 20 sec

--this final step did a lock, a final sync and switched the segments references in the data dictionary
-- BIGTABLE is now partitioned and  BIGTABLE_TMPPARTNING is actually pointing to the old table (including its data: we have a copy)

-- We may keep a little the older version just to verify everything is OK

-- Verify we do have data in different partitions now:

select count(*) from BIGAPP1.BIGTABLE partition (P2010);
select count(*) from BIGAPP1.BIGTABLE partition (P2011);
select count(*) from BIGAPP1.BIGTABLE partition (P2012);
select count(*) from BIGAPP1.BIGTABLE partition (P2013);


-- OK DONE!  




References:
Admin Guide - redefinition example
This example demonstrates redefining a single partition. It moves the oldest partition of a range-partitioned sales table to a tablespace named TBS_LOW_FREQ

Thursday, September 12, 2013

Generate stuff with SQL 'connect by'


'connect by' was added in Oracle to build hierarchical queries, but is really usefull to generate STUFF.
(I think I first saw this usage from Johnathan Lewis... )

SQL> select rownum from dual connect by level <= 4;
         1
         2
         3
         4


Create a huge table:

With the help of RPAD:

SQL> select rpad('A',50,'B') from dual;
ABBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB


We can create quickly a big table with data:

SQL> create table testbig as
             select rownum as n,  

                    rpad('A',50,'B') as txt 
             from dual connect by level <= 100000;


Somewhat more varied with modulo:

create table testbig2 as
  select rownum as n,  

         mod(rownum,100) as m, 
         rpad('A',50,'B') as txt 
  from dual connect by level <= 10000;



Generate days:
SQL> select TO_DATE('2012-01-01','YYYY-MM-DD')+rownum from dual connect  by level <=3;
2012-01-02
2012-01-03
2012-01-04





Generate months:
SQL> select ADD_MONTHS( TO_DATE('2012-01-01','YYYY-MM-DD'),rownum) from dual connect  by level <=3;
2012-02-01
2012-03-01
2012-04-01



Generate PARTITION clauses...



SQL> set pagesize 0
SQL> select 'PARTITION P'

||TO_CHAR(ADD_MONTHS( TO_DATE('2012-01-01','YYYY-MM-DD'),rownum),'YYYYMM')
||' VALUES LESS THAN (TO_DATE('''||TO_CHAR(ADD_MONTHS( TO_DATE('2011-01-01','YYYY-MM-DD'),rownum),'YYYYMMDD') ||''',''YYYYMMDD'')),'

from dual connect  by level <=3; 
PARTITION P201202 VALUES LESS THAN (TO_DATE('20110201','YYYYMMDD')),
PARTITION P201203 VALUES LESS THAN (TO_DATE('20110301','YYYYMMDD')),
PARTITION P201204 VALUES LESS THAN (TO_DATE('20110401','YYYYMMDD')),






Sadly I know no equivalent in MySQL,
(discussed here: stackoverflow.com/questions/701444 )


Of course in bash you can also generate things like so:

$ for i in {1..5}
> do
> echo PARTITION$i
> done
PARTITION1
PARTITION2
PARTITION3
PARTITION4


But I don't know an easy way to process dates though...

MySQL SHOW SLAVE STATUS Explained

Sometimes I don't have to deal with MySQL slaves for months and again I forgot which entry means what.

Here is a SHOW SLAVE STATUS explained  while catching up (I just restarted the slave after an error)





On the slave:
10:24 [vega]:> show slave status\G
...
      Master_Log_File: vega20-bin.000740  just current log and pos on the master
  Read_Master_Log_Pos: 3396922             == show master status
                                           (sometimes takes time to be refreshed)

       Relay_Log_File: vega21-relay-bin.000074 relay file on the slave we're

                                               currently applying

        Relay_Log_Pos: 274423937          position in bytes in the relay file.

Relay_Master_Log_File: vega20-bin.000727  corresponding log file on the master to
                                          what we're applying 
                                          (727 -> 740  is left to be done)
...
  Exec_Master_Log_Pos: 274423781          position on master log corresponding to
                                          what we're applying
...
Seconds_Behind_Master: 110997             estimated time to complete applying,
                                          here 727->740

Files on slave:
        219 Sep 10 04:08 vega21-relay-bin.000076
     510344 Sep 10 04:08 vega21-relay-bin.000075
 1073741921 Sep 10 04:08 vega21-relay-bin.000074 <- applying this one
     193266 Sep 10 07:18 vega21-relay-bin.000079
  (a bit later, are deleted)
...
  196984797 Sep 11 09:03 vega21-relay-bin.000112
    3397078 Sep 11 10:24 vega21-relay-bin.000114 <- last redo info we copied from the master (data from master's
log is copied to a slave relay log. Size may not be same as master ex:if we stop/start etc..)

Files on the master:
 1073904414 Sep  7 09:59 vega20-bin.000720
 1073774868 Sep 10 02:21 vega20-bin.000721
...
 1074251834 Sep 10 04:07 vega20-bin.000727<- applying data which is here 
 1073934723 Sep 10 07:17 vega20-bin.000728 
(copied on slave in relay log 74)
 1
073993761 Sep 10 07:27 vega20-bin.000729
...
 1091352292 Sep 11 09:02 vega20-bin.000739
    3396922 Sep 11 10:23 vega20-bin.000740<- current log written by the master





Status on the master:

10:24 [(none)]:> show master status;
+----------------------------+----------+--------------+------------------+
| File                       | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+----------------------------+----------+--------------+------------------+
| mysql-clappdb50-bin.000740 |  3396922 |              |                  |
+----------------------------+----------+--------------+------------------+




Other notes:

11:00 [vega]:> show processlist;
+--+-----------+----+----+-------+------+--------------------------------+------+
|Id|User       |Host|db  |Command|Time  |State                           |Info  |
+--+-----------+----+----+-------+------+--------------------------------+------+
| 5|system user|    |NULL|Connect|  6979|Waiting for master to send event|NULL  |
|69|system user|    |NULL|Connect|111664|Reading event from the relay log|NULL  |


Waiting for master to send event: currently the Master isn't writing much, so nothing to send to the slave relay log

Reading event from the relay log: means we're reading the relay file. If this is seen often, it means we're spending more time reading than applying... (slow I/O!)

Sometimes we see the action replicated on the slave:

|97|system user |    |STOCKS|Connect|101289| NULL      |TRUNCATE TABLE `TOTA01`|

Monday, September 2, 2013

simulating a slow drive


I want to simulate a slow drive, to better grasp the symptoms of a database suffering from a slow disk.

Until now I used to follow up on suspicions by testing with  IOzone or sysbench/fileio , but:
  • This is sometimes difficult to do if the sysadmin is not me, and will not/can not cooperate
  • This is imperfect because we would typically do such tests at night when the SAN which I suspect is less used ;-)
I could use a SD card or something, but most of my test setups are on VM, I would prefer a software solution.
I almost started to write my own slow FUSE filesystem, when I finally found a nice trick from this post:
http://askubuntu.com/questions/13590/can-i-simulate-a-slow-hard-drive

Use nbd, the Network Block Device, and then rate limit access to it using say trickle.
sudo apt-get install nbd-client nbd-server trickle


Knowing neither trickle nor nbd, I tested each  before doing the setup:


1) Trickle test


Trickle will limit the network speed of a command:
trickle -u <upload speed> -d <download speed> command

Example:$ wget http://192.168.0.6/videos2007/CIMG0500-small.avi
=> 11M/s

$ trickle -d 20 wget http://192.168.0.6/videos2007/CIMG0500-small.avi

=> 22 K/s


2) NBD test

NBD stands for Network Block Device. I understand it to be a simple way to use a block device from a remote machine (simpler than drbd or iscsi targets/initiators implementations)


Blocks may be files or partitions, logical volumes... Let's try to create a file :



dd if=/dev/zero of=/home/phil/apps/NBD/nbd-drive bs=1024 count=10240
Then run the nbd server on that file:
nbd-server 9500 /home/phil/apps/NBD/nbd-drive 

It seems to launch a daemon 'nbd-server', which do listen on 9500:


$ netstat -plnt
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
...          
tcp        0      0 0.0.0.0:9500            0.0.0.0:*               LISTEN      2195/nbd-server 



Let's try to connect the client:

sudo nbd-client localhost 9500 /dev/nbd0

Seems to launch a daemon as well:
root      2276     1  0 07:04 ?        00:00:00 nbd-client localhost 9500 /dev/nbd0
root      2277     2  0 07:04 ?        00:00:00 [nbd0]

We can now format this device and mount it:
sudo mkfs /dev/nbd0 sudo mount /dev/nbd0 /mnt $ df -h Filesystem Size Used Avail Use% Mounted on ... /dev/nbd0 9.7M 92K 9.1M 1% /mnt



The test is OK. Let's umount and release:

sudo umount /mnt
sudo nbd-client -d /dev/nbd0
killall nbd-server



3) Trickle + NBD to make a slow drive


I succeeded by running the server through trickle this way:

$ trickle -d 20 -u 20 nbd-server 9500 /home/phil/apps/NBD/nbd-drive bs=1024 count=10240
trickle: Could not reach trickled, working independently: No such file or directory

Then started the client again with:


sudo nbd-client localhost 9500 /dev/nbd0

And indeed this block device was really slow :

$ sudo dd if=/dev/nbd0 of=/dev/null bs=65536 skip=100 count=10
10+0 records in
10+0 records out
655360 bytes (655 kB) copied, 28.5783 s, 22.9 kB/s



Finally I wrote that script to make the drive ready when I need it:

$ cat mountslowdrive.sh 
#!/bin/bash

FILE_OR_PARTITION=/home/phil/apps/NBD/nbd-drive
MOUNTPOINT=/home/phil/apps/NBD/mnt

if [ $# -eq 0 ] ; then

  echo making/mounting slow drive
  trickle -d 20 -u 20 nbd-server 9500 $FILE_OR_PARTITION bs=1024 count=10240
  sudo nbd-client localhost 9500 /dev/nbd0
  sudo mount /dev/nbd0 $MOUNTPOINT


else
  if [ \( $# -eq 1 \) -a \(  $1 == "-u" \) ]; then

    echo unmountint slow drive

    sudo umount $MOUNTPOINT
    sudo nbd-client -d /dev/nbd0
    killall nbd-server

  else
    echo  USAGE: $0 [-u]
    echo         -u : umount the slow drive
  fi
fi





Todo:
Use the -l or /etc/nbd_server.allow option of the server to restrict the clients (Otherwise any client on the network can use the block device... )