What is so cool in analytic functions? Analytic functions give you possibility to look at previous and following rows in a row set without expensive and often impossible joins because there are times you don’t know how many joins you’ll need. With analytics you can simply look for a value (or compute it using) either foregone or calculated rows forward or backwards. That means Oracle can scan source data once and then sort it (if necessary more than once) in memory or temp. My experience shows that sorting even several times is much faster than scanning source data more than once. In next four cases I’ll try to show four different real world reports that exploited analytic functions a lot. Case one: persons and their citizenships Report basically counted people and their citizenships. Mostly people have only one citizenship, but those who are agile enough get more than one. Respectively report stated following algorithm: 1) If person has citizenship X (most of the people has it) then is counted it and only it regardless of all other his citizenships; 2) If person hasn’t citizenship X then are counted all his citizenships. Table PERSONS contains each person along with its name and surname. Table CITIZENSHIPS is joined to PERSONS and each person’s citizenship has one row in CITIZENSHIPS. So how to get the report using algorithm defined above? Traditional approach would be something like count all distinct people with citizenship X (scan table CITIZENSHIPS via full scan, because X is very common citizenship) and then count all citizenships whose people haven’t X (scan source data second time most probably again via full scan and perform some index scan to check if the current person has X or not). Query is shown in Code listing 1:
SELECT count(distinct ctz_prs_id) cnt, ctz_code code FROM citizenships WHERE ctz_code = 'X' GROUP BY ctz_code UNION ALL SELECT count(distinct ctz_prs_id) cnt, ctz_code code FROM citizenships c1 WHERE ctz_code <> 'X' AND NOT EXISTS ( SELECT 'X' FROM citizenships c2 WHERE c1.ctz_prs_id = c2.ctz_prs_id AND c2.ctz_code = 'X' ) GROUP BY ctz_code ORDER BY cnt desc; Code listing 1 Traditional query for Case one But looking from our perspective after first full scan of all citizenships Oracle already knows all information it needs to determine whether the current citizenship record has to be counted or not. The only problem is that we have to look unknown number of rows back or forth in the same query result. But if we’d at first try to order our query result then we’ll know exactly for what row we should look. So the first step is clear – let’s order so that row with citizenship X for each person will be the first one. In the second step we should follow the algorithm defined in the beginning of this chapter using some temporary flag column to mark whether we’ll count this row or not. And the last step is to perform final counting. First step is implemented using partition and order by clause PARTITION BY ctz_prs_id ORDER BY decode (ctz_code, 'X', 0, 1) of analytical function first_value (ctz_code). Using this analytical function for each row we can move on to our second step – perform calculation whether we are interested in current row or not i.e. if the current row belongs to another person than previous row (which is counted using some other analytical function) then it is applicable; if the current row belongs to the same person than previous row then we should check whether the first citizenship for this person is X or not and set flag accordingly. The second step is as follows: CASE WHEN ctz_prs_id <> prev_prs_id THEN 1 WHEN first_code = 'X' THEN 0 ELSE 1 END flag Third step is to filter out only rows with flag = 1 and perform simple group by citizenship. And putting together all pieces we get final query (see Code listing 2). SELECT count(*) cnt, ctz_code FROM ( SELECT ctz_code, CASE WHEN ctz_prs_id <> prev_prs_id THEN 1 WHEN first_code = 'X' THEN 0 ELSE 1 END flag FROM ( SELECT ctz_prs_id, ctz_code, first_value (ctz_code) OVER ( PARTITION BY ctz_prs_id ORDER BY decode (ctz_code, 'X', 0, 1)) first_code, lag(ctz_prs_id, 1, -1) OVER ( ORDER BY ctz_prs_id, decode (ctz_code, 'X', 0, 1)) prev_prs_id FROM citizenships ) ) WHERE flag = 1 GROUP BY ctz_code ORDER BY cnt desc; Code listing 2 Analytic query for Case one As a result instead of two times full scanning and additional some index scan we have scanned source once and according to autotrace performed the same count of sorts. Using generated data (see here) Table 1 shows compared results of these two queries: | Query/Metric | Scans | Elapsed time (seconds) | Consistent gets | Sorts | | 1st traditional | 2 full scans, 1 index scan | 0.40 | 50977 | 3 | | 2nd with analytics | 1 full scan | 0.18 | 61 | 3 | Table 1 Compared runtime statistics for Case one Case two: peoples without valid identity cards This query illustrates opposing situation than previous one. Current report had to count peoples without (as opposing to previous with) something, particularly valid identity cards. Unfortunately it is even worse – if person is citizen of citizenship X then he needs valid Citizen ID card, but if he is non-citizen of citizenship X then he needs valid Non-citizen ID card. Valid document means that today’s date is between its start date and end date. The traditional approach would be scanning all citizenships and see whether this person hasn’t valid ID card of appropriate type (see Code listing 3). SELECT COUNT(distinct ctz_prs_id), ctz_code, ctz_type FROM citizenships WHERE ctz_code = 'X' AND NOT EXISTS ( SELECT 'X' FROM documents WHERE doc_valid_from < sysdate AND doc_valid_to > sysdate AND doc_prs_id = ctz_prs_id AND (ctz_type = 'C' AND doc_type = 'Citizen ID card' OR ctz_type = 'N' AND doc_type = 'Non-citizen ID card') ) GROUP BY ctz_code, ctz_type; Code listing 3 Traditional query for Case two As a result seeking possible valid document for almost each person we had effectively scanned almost all table documents. Unfortunately it wasn’t done via full scan but using index-first-table-after approach. What if we’d try to join them somehow? The idea can be expressed in following steps: 1) Join both tables via outer join (because some people have citizenships but haven’t valid ID cards). 2) Use again already known technique of a “flag” column – this time it will show us whether person has valid ID card or not. 3) After that simple analytic partitioning by person and ordering by flag is performed row_number() OVER (PARTITION BY ctz_prs_id ORDER BY flag desc). It means first row for each person will show us whether he has valid ID card or not. 4) Final step is to count just these persons that are flagged they haven’t valid ID cards (see Code listing 4). SELECT count(distinct ctz_prs_id), ctz_code, ctz_type FROM ( SELECT ctz_prs_id, ctz_code, ctz_type, flag, row_number() OVER (PARTITION BY ctz_prs_id ORDER BY flag desc) rn FROM ( SELECT /*+ FULL(citizenships) FULL(documents) */ ctz_prs_id, ctz_code, ctz_type, CASE WHEN ctz_type = 'C' AND doc_type = 'Citizen ID card' THEN 1 WHEN ctz_type = 'N' AND doc_type = 'Non-citizen ID card' THEN 1 ELSE 0 END flag FROM citizenships, documents WHERE ctz_code = 'X' AND doc_valid_from (+) < sysdate AND doc_valid_to (+) > sysdate AND doc_prs_id (+) = ctz_prs_id ) ) WHERE rn = 1 AND flag = 0 GROUP BY ctz_code, ctz_type; Code listing 4 Analytic query for Case two Using generated data (see here) Table 2 shows compared results of these two queries: | Query/Metric | Scans | Elapsed time (seconds) | Consistent gets | Sorts | | 1st traditional | 1 full scan, 1 index scan (full table) | 0.10 | 17965 | 1 | | 2nd with analytics | 2 full scans | 0.06 | 309 | 2 | Table 2 Compared runtime statistics for Case two Case three: getting the current value and first value in a hierarchy In this case I’ll try to explain the method proposed by Thomas Kyte in http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:13946369553642 as well as in [1]. In fact methods are two: (1) One general method – how to solve complicated problem using analytics. With complicated I mean problems that can’t be solved in one/two obvious steps; (2) One technical method for a special case – how to make groups in a row set. The original problem for a report was as follows: Addresses have tendency to change over time. Most usual reason is street name change, and some administrative reforms. To record this information address history was designed as recursive relationship in ADDRESS table pointing to next address in hierarchy. So the problem was to find all current addresses for a given set of address and output both 1) Address identifying information that conforms given criteria. 2) Most recent address for each found address. Both methods will be explained using generated example data that can be found here. So let’s start iterative thinking with very simple first step – create a query that shows information for the given address criteria (see Code listing 5): SELECT adr_id, adr_adr_id, adr_country_code, adr_street FROM addresses WHERE adr_country_code = 'WIN' AND adr_region_code = 'DOW'; Code listing 5 Step one query for Case three Output from query looks as follows: ADR_ID ADR_ADR_ID ADR ADR_STREET ---------- ---------- --- ---------- 208452 408452 WIN WEEKEND_WI 608453 808453 WIN MAINTENANC 8451 WIN WEEKNIGHT_ Last row haven’t child (adr_adr_id is empty), but first two rows has at least one child. Having recursive relationship in ADDRESS table obviously leads us to hierarchical query clause (CONNECT BY). Let’s look at query that shows all address history over time for above given criteria (Code listing 6). SELECT rownum rn, level lvl, adr_id, adr_adr_id, adr_country_code, adr_street FROM addresses START WITH adr_country_code = 'WIN' AND adr_region_code = 'DOW' CONNECT BY PRIOR adr_adr_id = adr_id; Code listing 6 Step two query for Case three Output from query looks as follows: RN LVL ADR_ID ADR_ADR_ID ADR ADR_STREET ---- ---------- ---------- ---------- --- ---------- 1 1 8451 WIN WEEKNIGHT_ 2 1 208452 408452 WIN WEEKEND_WI 3 2 408452 WIN EKEND_WIND 4 1 608453 808453 WIN MAINTENANC 5 2 808453 1008453 WIN INTENANCE_ 6 3 1008453 WIN ENANCE_WIN Looking at data output we can see that actually we have groups of records that always start with lvl = 1, but end with lvl >= 1. Rows with lvl = 1 always indicate rows that conforms to starting address criteria. Row with maximal lvl value just before next group of records indicates most current address. So we can draw at least following conclusions: 1) There are obvious groups of records. 2) Count of groups is the same as row count conforming starting address criteria. So what do we really need for final output? We need somehow to combine first and last records in each group and our task would be achieved! We can remember that we have used analytical function first_value to found a first row for some partitioning. The only problem is to somehow mark each group with some unique tag and we already know how to deal with such a situation. As you already may guess rownum for above row set wasn’t added just for fun, it is very useful unique identifier. So let’s mark start of each group with a unique marker adding column in the end (see Code listing 7). SELECT rownum rn, level lvl, adr_id, adr_adr_id, adr_country_code, adr_street, CASE WHEN level = 1 THEN rownum ELSE NULL END new_rn FROM addresses START WITH adr_country_code = 'WIN' AND adr_region_code = 'DOW' CONNECT BY PRIOR adr_adr_id = adr_id; Code listing 7 Step three query for Case three Output from query looks as follows: RN LVL ADR_ID ADR_ADR_ID ADR ADR_STREET NEW_RN ---- ---------- ---------- ---------- --- ---------- ---------- 1 1 8451 WIN WEEKNIGHT_ 1 2 1 208452 408452 WIN WEEKEND_WI 2 3 2 408452 WIN EKEND_WIND 4 1 608453 808453 WIN MAINTENANC 4 5 2 808453 1008453 WIN INTENANCE_ 6 3 1008453 WIN ENANCE_WIN Next task is to mark all rows for each group with this unique identifier. It is done using an analytical function that almost everyone has used as a set function – max. This time the only difference is that it is used in analytical way. So we need max(new_rn) OVER (ORDER BY rn). We cannot use partitioning clause for max simply because there isn’t any column suitable for that. There is one thing that is hidden in this function use – by default it is working from unbounded preceding and current row. It means using this function for every row for column new_rn it will display maximal value of this column since the beginning up to current row. So the first level of analytics has been applied and query now looks as follows (see Code listing 8). SELECT rn, lvl, adr_id, adr_adr_id, adr_country_code, adr_street, max(new_rn) OVER (ORDER BY rn) grp FROM ( SELECT rownum rn, level lvl, adr_id, adr_adr_id, adr_country_code, adr_street, CASE WHEN level = 1 THEN rownum ELSE NULL END new_rn FROM addresses START WITH adr_country_code = 'WIN' AND adr_region_code = 'DOW' CONNECT BY PRIOR adr_adr_id = adr_id ); Code listing 8 Step four query for Case three Output as you already can imagine looks as follows: RN LVL ADR_ID ADR_ADR_ID ADR ADR_STREET GRP ---- ---------- ---------- ---------- --- ---------- ---------- 1 1 8451 WIN WEEKNIGHT_ 1 2 1 208452 408452 WIN WEEKEND_WI 2 3 2 408452 WIN EKEND_WIND 2 4 1 608453 808453 WIN MAINTENANC 4 5 2 808453 1008453 WIN INTENANCE_ 4 6 3 1008453 WIN ENANCE_WIN 4 Each group has been successfully marked and you already know what to do with such a result set – simply use analytic function first_value and partition by just now computed column grp. But this time to make things a bit different we will use exactly opposed function last_value. And this will give us another possibility use analytic constructions because the working window from unbounded preceding and current row is true also for functions first_value and last_value. Because we will show rows with lvl = 1 along with it’s corresponding group last row that is after it in the ordering we have to use non-default working window from current row and unbounded following. The last step performed is to show only one row for each group – of course we have to leave only rows with lvl = 1. So the final query with second level of analytics and last filter is (see Code listing 9). SELECT * FROM ( SELECT last_value(adr_id) OVER ( PARTITION BY grp ORDER BY rn ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) curr_addr_id, last_value(adr_street) OVER ( PARTITION BY grp ORDER BY rn ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) curr_street, lvl, adr_id, adr_country_code, adr_street FROM ( SELECT rn, lvl, adr_id, adr_adr_id, adr_country_code, adr_street, max(new_rn) OVER (ORDER BY rn) grp FROM ( SELECT rownum rn, level lvl, adr_id, adr_adr_id, adr_country_code, adr_street, CASE WHEN level = 1 THEN rownum ELSE NULL END new_rn FROM addresses START WITH adr_country_code = 'WIN' AND adr_region_code = 'DOW' CONNECT BY PRIOR adr_adr_id = adr_id ) ) ) WHERE lvl = 1; Code listing 9 Step five query for Case three And the result is the same as we have asked in the very beginning: CURR_ADDR_ID CURR_STREE LVL ADR_ID ADR ADR_STREET ------------ ---------- ---------- ---------- --- ---------- 8451 WEEKNIGHT_ 1 8451 WIN WEEKNIGHT_ 408452 EKEND_WIND 1 208452 WIN WEEKEND_WI 1008453 ENANCE_WIN 1 608453 WIN MAINTENANC One more remark for this particular Case three – I’m by no means saying that the solution provided here was the ultimate best one. The main idea was to show method for solving queries step by step. I can imagine at least 3 more solutions for this problem: 1) Method provided by Thomas Kyte in (http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:489772591421) that definitely could be written with much less code, but asks ~5 sorts for each returned row compared to constant sort count for query with analytic functions (see Code listing 10). SELECT to_number(substr(( SELECT max(to_char(level, 'fm000009') || ' ' || adr_id) FROM addresses adr2 START WITH adr2.adr_id = adr1.adr_id CONNECT BY PRIOR adr_adr_id IS NOT NULL AND PRIOR adr_adr_id = adr_id ), 8)) curr_addr_id, adr_id FROM addresses adr1 WHERE adr_country_code = 'WIN' AND adr_region_code = 'DOW'; Code listing 10 Another approach for Case three 2) One can use built-in function SYS_CONNECT_BY_PATH and filter out result later even in 9i. 3) Using 10g one can use CONNECT_BY_ROOT operator that returns column value using data from the root row. Case four: do all that in few sections This case wouldn’t be about analytics although impact on a report execution time could be in the same magnitude or sometimes even more. The basic idea is – for complicated reports divide target numbers into preaggregated values and then manipulate with them. I assume if you have created several complicated reports in your life there were times you have used or at least thought of using temporary tables holding temporary preaggregated values. WITH clause introduced as of 9i provides this ability right in the very SELECT statement itself. The real example is taken from several of our reports that showed peoples citizenship count over their address, like shown in Table 3: | Territory | CitizenshipX Citizens | CitizenshipX Non-citizens | Other citizenships Citizens | Other citizenships Non-citizens | | All territories | | | | | | Country A | | | | | | Big city1 | | | | | | Suburb1 | | | | | | Suburb2 | | | | | | Big city2 | | | | | | Region1 | | | | | | District1 | | | | | | District2 | | | | | | Big city1 | | | | | | Region2 | | | | | Table 3 People count breakdown by citizenships and address The problem is that I cannot use ROLLUP because 1) sum of elements is before elements (probably even that could be overcome after with some ordering of rows) 2) the main reason – big cities are shown several times in report and the overall sum isn’t sum of all elements. The original query tried to solve this situation very straightforward – for each territory type scan all involved tables schematically it was something like: 1) SELECT count of citizens and non-citizens from CITIZENSHIP table representing all territories row. 2) SELECT count of citizens and non-citizens from CITIZENSHIP table, joined to PERSON ADDRESSES table joined to ADDRESSES table joined to COUNTRIES table representing rows for each country. 3) SELECT count of citizens and non-citizens from CITIZENSHIP table, joined to PERSON ADDRESSES table joined to ADDRESSES table joined to CITIES table filtering out only big cities representing rows for each big city. 4) And similarly for suburbs, regions, and districts. As a result some of such reports never ended on production data i.e. after running for more than 48 hours they were simply killed. A simple question what to do was raised. Of course we could eliminate some of these continuous scans of the same data over and over again but I somehow had feeling that after the first full scan of all involved large base tables we have collected all information we need to display the result. The only problem was how to twist and format the result to show it desired times in the desired place. Solution was simple – create preaggregated counts grouped by citizenship, citizenship type and address components. After that we could easily sum and join them to necessary address classifier components. Following query is basically the same query taken from “Case one: persons and their citizenships” and joined with PERSON ADDRESSES and ADDRESSES tables (see Code listing 11). SELECT count(*) cnt, ctz_code ctz, ctz_type, adr_country_code cnt, adr_region_code reg, adr_city_code cty FROM addresses, person_addresses, -- a bit modified select from case1, -- just without grouping (SELECT * FROM ( SELECT ctz_prs_id, ctz_code, ctz_type, CASE WHEN ctz_prs_id <> prev_prs_id THEN 1 WHEN first_code = 'X' THEN 0 ELSE 1 END flag FROM ( SELECT ctz_prs_id, ctz_code, ctz_type, first_value (ctz_code) OVER ( PARTITION BY ctz_prs_id ORDER BY decode (ctz_code, 'X', 0, 1)) first_code, lag(ctz_prs_id, 1, -1) OVER (ORDER BY ctz_prs_id, decode (ctz_code, 'X', 0, 1)) prev_prs_id FROM citizenships ) ) WHERE flag = 1) ctz_source WHERE ctz_source.ctz_prs_id = pra_prs_id AND pra_adr_id = adr_id GROUP BY ctz_code, ctz_type, adr_country_code, adr_region_code, adr_city_code; Code listing 11 Query used n WITH clause for Case four So here it is – a preaggregated source to use in WITH clause and now we can take the same steps we had in the original query: 1) Sum all counts of citizens and non-citizens from preaggregate representing all territories row. 2) Sum counts of citizens and non-citizens from preaggregate joined to COUNTRIES table representing rows for each country. 3) Sum counts of citizens and non-citizens from preaggregate joined to CITIES table filtering out only big cities representing rows for each big city. 4) And similarly for suburbs, regions, and districts. As a result we eliminated roughly 5/6 of all work and additionally using technique described in “Case one: persons and their citizenships” all such reports run on production data in less than 20 minutes down from at least 48 hours and it was achieved mostly because following factors: 1) Scanned big source tables (CITIZENSHIPS, ADDRESSES, PERSON_ADDRESSES in example, actually there were more) only once via full scan. 2) Scan created small temporary table in memory. In our real production system it contained several thousand rows. 3) Instead of joining big source table results several times to classifier tables we join small temporary table. The same number of times though. Complete query and sample data generation can be seen here. Summary I had few goals in my mind writing this article: 1) Using analytic functions makes reporting easier in many cases. It is like cycle operator you can do on your result set. If you can successfully exploit it, you can free up much time and resources for other tasks. 2) Solving problems step by step is an ancient method for solving various tasks that could be well used also finding solutions for complex reports. 3) Divide et impera was (and is) successfully used method in politics, why not use it in SQL? WITH clause and preaggregate values working like this ancient method can sometimes be even more effective than analytic functions. The minimum goal of this article would be achieved if you’d be at least a little more interested in analytic functions and WITH clause. The maximum goal would be achieved if you’d be encouraged enough to try out them yourself. About author Gints Plivna
Этот e-mail защищен от спам-ботов. Для его просмотра в вашем браузере должна быть включена поддержка Java-script
is system analyst in Rix Technologies Ltd. (www.rixtech.lv). He has experience in working with Oracle since 1997 and his interests mostly have been connected with analyzing system requirements, design, development and SQL tuning. Contacts: e-mail -
Этот e-mail защищен от спам-ботов. Для его просмотра в вашем браузере должна быть включена поддержка Java-script
website - http://www.gplivna.eu/ Licenses This work is licensed under the Creative Commons Attribution-ShareAlike 2.5 License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/2.5/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA. Appendix 1 Database model diagram used in this document  Figure 1 Database model diagram used in this document |