Notes from the Field: 2010

Thursday, October 21, 2010

Database Mirroring monitor for ETL

BY Jim Bennett

Do you have databases that are mirrored that your ETL is populating? Does the database mirror ever start to lag?

If you answered yes to either of these questions then this is for you. SQL Server has a handy table that is created when database mirroring is configured. This table (msdb.dbo.dbm_monitor_data) allows you to see the status of your mirrored databases. So how can we make use of this in ETL? By using the code provided below, you can pause your ETL if database mirroring is behind and give SQL Server a chance to catch up before you start loading more data from your ETL process.

The usp_CheckMirroringStatus stored procedure does two things for you: 1) it captures information about the status of the mirrored database and logs this information to a table (dbo.MirroringStatus). This can help show you longer term trends about how mirroring is working on the server. 2) the stored procedure will not finish until the redo_queue falls below your threshold. This essentially can give the SQL Server some breathing room before more data is introduced from your ETL.

Overall this is a handy piece of logic that can be baked-in to your ETL process for those times when your server is not keeping up with your ETL process.

IF EXISTS (SELECT 1 FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[MirroringStatus]') AND type in (N'U'))
DROP TABLE [dbo].[MirroringStatus]
GO

PRINT 'Create Table dbo.MirroringStatus'
GO
CREATE TABLE [dbo].[MirroringStatus](
       [Id] [int] IDENTITY(1,1) NOT NULL,
       database_name varchar(128),
       mirroring_state varchar(20),
       witness_status varchar(15),
       log_flush_rateKBs int,
       send_queue_sizeKB int,
       send_rateKBs int,
       redo_queue_sizeKB int,
       redo_rateKBs int,
       transaction_delayMS int,
       transactions_per_sec int,
       local_time datetime,
       end_of_log_lsn numeric(25,0),
       failover_lsn numeric(25,0)
CONSTRAINT [PK_MirroringStatus] PRIMARY KEY CLUSTERED
(
       [Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
)
GO

IF EXISTS (SELECT 1 FROM sys.procedures WHERE name = 'usp_CheckMirroringStatus')
DROP PROCEDURE [dbo].[usp_CheckMirroringStatus]
GO

PRINT 'Create SP dbo.usp_CheckMirroringStatus'
GO
CREATE PROCEDURE [dbo].[usp_CheckMirroringStatus]
@dbname varchar(128),
@RedoQueueThresholdKB int = 1024,
@WaitDelayTimeInSeconds int = 15
AS
/*
Procedure Name: usp_CheckMirroringStatus
Created By:     Jim Bennett – CapstoneBI
Created Date:   10/20/2010
Website:        www.capstonebi.com
Purpose:        Allows database mirroring to be monitored and cause calling
    process to wait until mirroring is caught up.
*/

DECLARE @isFinished tinyint = 0;
DECLARE @CurrentRedoQueueSizeKB int;

CREATE TABLE #Result
       (
       database_id int,
       role int,
       mirroring_state int,
       witness_status int,
       log_flush_rate int,
       send_queue_size int,
       send_rate int,
       redo_queue_size int,
       redo_rate int,
       transaction_delay int,
       transactions_per_sec int,
       [time] datetime,
       end_of_log_lsn numeric(25,0),
       failover_lsn numeric(25,0),
       local_time datetime
       )

IF (
SELECT COUNT(*)
FROM sys.database_mirroring
WHERE database_id = DB_ID(@dbname)
AND mirroring_guid IS NOT NULL
) > 0
BEGIN
--The database is being mirrored
PRINT 'Database: ' + @dbname + ' is mirrored'

       WHILE (@isFinished = 0)
       BEGIN
              TRUNCATE TABLE #Result;

              PRINT 'Checking Redo Queue Size'
             INSERT INTO #Result
              SELECT TOP 1 *
              FROM msdb.dbo.dbm_monitor_data WITH (NOLOCK)
              WHERE database_id = DB_ID(@dbname)
              ORDER BY [time] DESC

              SELECT
                     @isFinished = CASE WHEN redo_queue_size <= @RedoQueueThresholdKB THEN 1 ELSE 0 END,
                     @CurrentRedoQueueSizeKB = redo_queue_size
              FROM #Result
              PRINT 'Current RedoQueue Size in KB: ' + CAST (@CurrentRedoQueueSizeKB as varchar(50))

IF (@isFinished = 0)
WAITFOR DELAY @WaitDelayTimeInSeconds

              INSERT INTO dbo.MirroringStatus
                     (
                     [database_name],
                     mirroring_state,
                     witness_status,
                     log_flush_rateKBs,
                     send_queue_sizeKB,
                     send_rateKBs,
                     redo_queue_sizeKB,
                     redo_rateKBs,
                     transaction_delayMS,
                     transactions_per_sec,
                     local_time,
                     end_of_log_lsn,
                     failover_lsn
                     )
              SELECT
                     db_name(database_id),
                     CASE mirroring_state
                           WHEN 0 THEN 'Suspended'
                           WHEN 1 THEN 'Disconnected'
                           WHEN 2 THEN 'Synchronizing'
                           WHEN 3 THEN 'Pending Failover'
                           WHEN 4 THEN 'Synchronized'
                           ELSE 'Unknown'
                     END,
                     CASE witness_status
                           WHEN 0 THEN 'Unknown'
                           WHEN 1 THEN 'Connected'
                           WHEN 2 THEN 'Disconnected'
                            ELSE 'Unknown'
                     END,
                     log_flush_rate,
                     send_queue_size,
                     send_rate,
                     redo_queue_size,
                     redo_rate,
                     transaction_delay,
                     transactions_per_sec,
                     local_time,
                     end_of_log_lsn,
                     failover_lsn
              FROM #Result

END

DROP TABLE #Result

END
ELSE
PRINT 'Database: ' + @dbname + ' is not mirrored'

Monday, October 4, 2010

SSIS Package Configurations

By Jim Bennett

Recently I was asked to verify whether a directory full of SSIS packages all had their package configurations enabled. So instead of opening each one manually I decided to write a Powershell script to do it for me. In each SSIS package file there is an XML node that is designated for the package configurations. The XML looks like this for disabled package configurations:

<DTS:Property DTS:Name="EnableConfig">0</DTS:Property>
and this for enabled package configurations:
<DTS:Property DTS:Name="EnableConfig">-1</DTS:Property>

The Powershell script is rather straightforward and could easily be modified to change this setting.

Set-ExecutionPolicy Unrestricted
## ScriptName: SearchForPackagesWithConfigurationsOff
## WrittenBy: Jim Bennett – CapstoneBI
## WrittenDate: 09/30/2010
## Website: www.capstonebi.com
## Purpose: Search for SSIS packages in a folder with their configurations setting turned off
if ([String]$args[0] -eq "")
{write-output "Usage: $($MyInvocation.InvocationName) DirectoryToSearch"}
else
{
$sourcefolder = [String]$args
if(!$sourcefolder.EndsWith("\")) {$sourcefolder+="\"}
$configurationSearchString = '<DTS:Property DTS:Name="EnableConfig">0</DTS:Property>'
$includedFileTypes = "*.dtsx"
get-childitem $($sourcefolder+"*") -include $includedFileTypes ForEach-Object {
if ($(Select-String -quiet $configurationSearchString $($sourcefolder + $_.Name)) -eq "true")
{write-output $_.Name}
}
}

Tuesday, September 21, 2010

Predictive Analytics in the Cloud

By Irit Eizips

The ability to do predictive analytics and data mining without being a statistician is here! Predixion Insight is a very cool public and private (on-premise) cloud service that is accessed via Predixion’s client software, which is installed into Excel (as an add-in) and lets you build sophisticated models without really needing to know all the technical stuff that would typically go into performing predictive analytics. Predixion provides powerful collaboration and reporting features, and best of all they are natively integrated to Microsoft’s new in memory BI (IMBI) tool, PowerPivot so you can perform predictive analytics on up to 100 million rows all in the client, then write results back to PowerPivot for slicing and dicing, then publishing via SharePoint.

The service is over the cloud, so all you really need is the Predixion Insight for Excel client and an internet connection. If your company uses SAS or SPSS, Predixion Insight will soon allow you to leverage that investment by running those models in the cloud, sharing them, reporting on them, then sharing the results with anyone via SharePoint. Predictive Analytics for the masses is finally here!

Here’s a link to a bunch of video tutorials on their website which are easy to follow – and a free 7 day trial version you could download and play with.

http://www.predixionsoftware.com/spv-90.aspx

Here are some interesting articles about their tool:

http://axis.aberdeen.com/index.php/information-technology/89-information-technology-insights/1488-predixion-software-should-find-a-home-at-the-best-in-class

http://www.b-eye-network.com/blogs/myers/archives/2010/09/analytics_movin.php

http://jtonedm.com/2010/09/14/first-look-predixion/

http://www.wiseanalytics.com/blog/2010/09/14/changing-the-predictive-analytics-game/

http://powerpivotpro.com/ (See interview with JamieMac of Predixion)

Monday, September 20, 2010

Getting Started with SQL Server Data Mining for Retail/Finance

By Rick Durham

There are two informative, technical webcasts found at this link that cover how SSAS Data Mining is used to solve general retail/marketing problems…

“Overview of SQL Server Data Mining”
“Applying SQL Server 2005 Data Mining to Enterprise Business Problem”

Here is the URL: http://www.sqlserverdatamining.com/ssdm/Home/Webcasts/tabid/62/Default.aspx

Types of finance/marketing problems that data mining can be used for include:

Database marketing applications include offer response, up-sell, cross-sell, and attrition models
Financial risk management models attempt to predict monetary events such as credit default, loan prepayment, and insurance claims
Fraud detection methods attempt to detect or impede illegal activity involving financial transactions
Process monitoring applications detect deviations from the norm in manufacturing, financial, and security processes
Pattern detection models are used in applications that range from handwriting analysis to medical diagnostics
(please note this is by no means an exhaustive list)

Another good site in terms of general tutorial is this following:

http://www.thearling.com/dmintro/dmintro_frame.htm

Hope this helps get you started!

Thursday, September 9, 2010

Standardize Your MDX Parameter Queries

By Dan Meyers

One thing that I have found useful when writing a lot of Reporting Services reports that have parameters and use MDX is to create some standard calculated members in my cube for the Label and Value properties of the parameters. Just like any other calculated measures that get built into the cube script, you get the advantage of reusing them instead of doing them over and over again in the WITH clause of your query. The code below is dynamic and is not specific to a particular dimension or anything so it will work with whatever you put on ROWS in your data set query for your report parameters.

Below is the MDX for the calculated members and a sample query.

Insert this code into your cube script (at the bottom)

CREATE MEMBER CURRENTCUBE.[Measures].[ParameterValue] AS
Axis(1).Item(0).Item(0).Dimension.CurrentMember.UNIQUE_NAME,
VISIBLE = 1;

CREATE MEMBER CURRENTCUBE.[Measures].[ParameterCaption] AS
String(Axis(1).Item(0).Item(0).Dimension.CurrentMember.Level.Ordinal * 1 , ' ' ) + Axis(1).Item(0).Item(0).Dimension.CurrentMember.Member_Caption,
VISIBLE = 1;

Sample Query
SELECT
      {[Measures].[ParameterCaption], [Measures].[ParameterValue]} ON 0,
      {[Date].[Calendar].ALLMEMBERS} ON 1
FROM
      [Adventure Works]

Thursday, September 2, 2010

Interpreting Causation in Time Series Forecast models

By Rick Durham

· Building data mining models is one thing
· Determining the primary factors that contribute to the final predicted data point is quite another

One of the most useful algorithms in the SSAS suite of data mining tools is the Time Series algorithm. Its forecast method is really a combination of two other algorithms (ARIMA, ARTxp) and it takes the historic data values in a series to make future predictions regarding that series. It does this by assigning weights to past data values to make future predictions. One of the most important features of this method is its ability to allow the data from other series in the model to be incorporated in the final prediction values.

For example, in the model below, we can see a direct historic relationship between the price of gas and the price of oil. When the time series algorithm was used to build the model it looked at all of the past data points for both gas and oil to make the final prediction for the price of gas.

We can infer from this model, based on historic data points, that the price of oil affects the price of gas (no big surprise here). Please note the oil price value is normalized against Feb 2008 oil price in %.

But what affected oil prices? What caused the huge spike in oil prices July 2008 in our time series forecast? My experience has been that when we create these type models we are always going to be ask what factors caused or drove certain data points in the series to extreme values.

In this case, we might casually answer “ It was a lack of supply in oil with high demand” but this is not true as the following set of historic data charts prove. The chart below shows that in July 2008 production of oil was at an all time high as suppliers were willing increase production when the price point was high. Again no surprise, it’s simple Econ 101.

We would think that demand for oil during this period would also be high driving up the price point -but that is not the case. In fact, demand for oil during this period was very low as this historic data chart reveals. It started dropping in 2007 and hit a low in the summer of 2008 when the price of oil and gas were both at all time highs.

So how can we explain what was driving the price of oil up and thus the price of gas in July 2008? It turns out that one of the primary factors driving up the price of oil was the value of the dollars value against other currencies. In effect, because the value of the dollar was low oil suppliers wanted more dollars for the same units of oil thus driving up the price.

Many types of data that are typically used in time series analysis (think commodities, stock prices, long term weather forecasts…) are driven by complex factors that are often changing and may be non-stationary. The factors that drive a forecast today are not the same as what might be driving it tomorrow. This is what makes understanding what series need to be included in the mining models and what indirect factors drive them tricky.

If we take our oil example, many factors have driven the price around historically. These include: supply, demand, war, strikes, geopolitical tension (or lack of), weather… In some cases, the input factors are so complex and varied that the only way to predict future values is to use the historic time values as there is no way to determine all of the factors that move the data. This is certainly true of the stock market as well.

In this blog, I have taken a simple example of time series analysis based on historic oil and gas prices to show how once we have developed our mining model we can potentially dive deeper to understand what factors are driving the forecasts, and ultimately provide better insight to the business. This is the essence of what BI should be about and is often overlooked by technicians who are overly occupied in the complexity of the tools they are using rather than how to use the results they generate to impact the organization.

Notes from the Field