My version of SQL string to table tokenizer

This one’s a short post on a fairly random topic as unfortunately I don’t have time today to come up with anything deeper :)

I needed to come up with a delimited string to table tokenizer for an Oracle development project. There are quite a few examples out there how to do that, including Adrian Billington’s www.oracle-developer.net and the http://technology.amis.nl/blog/?p=1631.

So far the simplest solution I had seen was using a bunch of INSTR’s, SUBSTR’s and DECODE’s in a CONNECT BY loop.
However as this application is using Oracle 10g, I could use regular expressions to make the SQL even shorter:

Tanel@Sol01> define separator=":"
Tanel@Sol01> define mystring="A:BCD::EFG:H:IJKL"
Tanel@Sol01>
Tanel@Sol01> SELECT
  2             LEVEL,
  3             REGEXP_REPLACE(
  4                     REGEXP_SUBSTR( '&mystring'||'&separator', '(.*?)&separator', 1, LEVEL )
  5                     , '&separator$'
  6                     , ''
  7             ) TOKEN
  8  FROM
  9     DUAL
 10  CONNECT BY
 11     REGEXP_INSTR( '&mystring'||'&separator', '(.*?)&separator', 1, LEVEL ) > 0
 12  ORDER BY
 13     LEVEL ASC
 14  /

LEVEL TOKEN
----- ----------
    1 A
    2 BCD
    3
    4 EFG
    5 H
    6 IJKL

6 rows selected.

For anyone not yet familiar with regular expressions, I definitely recommend to get to know at least the basics – they can save you hours (or even days) of work in some cases. And the cool thing is that regular expressions are fairly standard across the development platforms too (e.g. if you know Oracle regexp, then it’s very easy to handle Java, perl or C# regexp too).

And regexp is not only a developer thing, it can be very useful for everyday production DBAs as well. For example, every decent “grep” utility has regexp support built in – this makes search for complex conditions from log or config files easier for you.

This entry was posted in Uncategorized and tagged . Bookmark the permalink.

10 Responses to My version of SQL string to table tokenizer

  1. prodlife says:

    I completely agree with this. I keep shocking fellow DBAs and developers by solving complex problems with a single SQL statement using RegExps. The only downside is that older Oracle versions don’t have RegExp, and sometimes I only remember that after solving the problem with RegExp and now I have to solve it again only because we never got around to upgrading this particular system.

  2. carlosal says:

    This one works in 9i too ;-)

    carlosal@db01.xxxxxx> SELECT * FROM V$VERSION
    2 /

    BANNER
    —————————————————————-
    Oracle9i Enterprise Edition Release 9.2.0.7.0 – Production
    PL/SQL Release 9.2.0.7.0 – Production
    CORE 9.2.0.7.0 Production
    TNS for 32-bit Windows: Version 9.2.0.7.0 – Production
    NLSRTL Version 9.2.0.7.0 – Production

    carlosal@db01.xxxxxx> define separator=”;”
    carlosal@db01.xxxxxx> define mystring=”1;22;333;4444;5;66;77″
    carlosal@db01.xxxxxx>
    carlosal@db01.xxxxxx> SELECT SUBSTR( ‘&mystring’ || ‘&separator’,
    2 NVL(LAG(TOKEN) OVER (ORDER BY TOKEN),0) + 1,
    3 TOKEN – (NVL(LAG(TOKEN) OVER (ORDER BY TOKEN),0) + 1))
    4 FROM (
    5 SELECT LEVEL,
    6 INSTR( ‘&mystring’ || ‘&separator’, ‘&separator’, 1, LEVEL ) TOKEN
    7 FROM DUAL
    8 CONNECT BY
    9 INSTR( ‘&mystring’ || ‘&separator’, ‘&separator’, 1, LEVEL ) != 0
    10 ORDER BY
    11 LEVEL
    12 )
    13 /

    SUBSTR(‘1;22;333;4444;
    ———————-
    1
    22
    333
    4444
    5
    66
    77

    7 filas seleccionadas.

    carlosal@db01.xxxxxx>

  3. Anup Pani says:

    Incase we want only the relevant tokens then use,

    Create or replace FUNCTION Tokenizer (p_string VARCHAR2
    ,p_separators in VARCHAR2)
    RETURN dbms_sql.varchar2s IS
    l_token_tbl dbms_sql.varchar2s;
    pattern varchar2(250);
    BEGIN

    pattern := ‘[^(‘ || p_separators || ‘)]+’ ;

    select regexp_substr(p_string, pattern,1,level) token
    bulk collect into l_token_tbl
    from dual
    where regexp_substr(p_string, pattern,1,level) is not null
    connect by regexp_instr(p_string, pattern,1, level ) > 0;

    RETURN l_token_tbl;
    END Tokenizer;
    /

    declare
    l_token_tbl dbms_sql.varchar2s;
    delimiters varchar2(12) := ‘ .;!@#$%^&*,';
    begin
    l_token_tbl := Tokenizer(‘ anup.pani@ gmail.com ‘, delimiters);
    FOR i IN l_token_tbl.FIRST..l_token_tbl.LAST LOOP
    dbms_output.put_line(l_token_tbl(i));
    END LOOP;
    end;
    /

  4. Michael says:

    This is great stuff. Thanks a lot.

  5. radino says:

    I use select from pipelined function as tokenizer..

    something like pipelined function here:
    http://www.oracle-base.com/articles/misc/DynamicInLists.php
    (with more intelligent parsing – shifting position instead of trimming)

  6. Scott says:

    Way cool!

    I will have to play more with regexes and sql.

    Thanks!

  7. Tanel Poder says:

    I agree! (both with that it’s cool and that you should play with regex/sql! )

  8. Adam says:

    But what if we have strings like

    Schema.Table1.Column1 = Schema.Table2.Column2 -> Schema.Table1.Column11
    Schema.Table3.Column3 = Schema.Table4.Column4 -> Schema.Table4.Column44 = Schema.Table5.Column5 -> Schema.Table5.Column55

    How to get result:

    columns: Schema | Table |Column
    _______Schema Table1 Column1
    e.t.c?

    with t as (
    select ‘Schema.Table1.Column1 = Schema.Table2.Column2 -> Schema.Table1.Column11′ col from dual
    )
    select *
    from t,
    xmltable(‘for $i in ora:tokenize(.,” = | -> “) let $m:=ora:tokenize($i,”\.”) return {$m[1]}{$m[2]}{$m[3]}’
    passing t.col
    columns
    “Schema” varchar2(30) path ‘/r/schema’,
    “Table” varchar2(30) path ‘/r/table’,
    “Column” varchar2(30) path ‘/r/column’)

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>