zondag 14 mei 2017

Getting filtered data from a tabular with Excel

Introduction

Building a PowerPivot model on filtered data of a table from a SQL Database is fairly simple. In the table properties you can change the table view to query view and change the query. I haven't found a similar way when using a tabular with Excel (realtime connection).

The only option I have found so far is using the CommandText in the connection properties, but that changes my look and feel of browsing the cube. The retrieved data from the tabular is just like one plain table. This blogpost will describe the way of adjusting the .odc file in Excel and getting the data as a plain table.

The goal of this blogpost is to show how to filter the factdata based on a dimension value (DimCustomer).

The Datamodel

I've created a simple model in SQL Server and imported it into the Tabular project. It is modeled as follows.


One Facttable with a key to the DimCustomer table and a measure Amount. I added a few records.

Some simple EVALUATES

In SSMS, I experimented with some EVALUATES to get some data from the Tabular. Below the result of this exercise.

--Getting the data from a table
EVALUATE Fact

--A simple  query (and join between tables)
EVALUATE
SUMMARIZE (Fact
,DimCustomer[Name]
,"TotalAmount", SUM(Fact[Amount])
)

--A simple query and a calculated measure is used
EVALUATE
SUMMARIZE (Fact
,DimCustomer[Name]
,"TotalAmount", Fact[TotalAmount]
)

--A calculated measure and a Filter
EVALUATE
FILTER(
 SUMMARIZE (Fact
 ,DimCustomer[Name]
 ,"TotalAmount", Fact[TotalAmount]
 ),
 DimCustomer[Name] = "Hennie"
)

Adjusting the .odc file

From the experiments in SSMS i took one of the DAX Evaluates and inserted that in the CommandText tag in the odc file of Excel.



This is the result in the Connection properties in the Excel file. The CommandType combobox is changed to Default and in the CommandText box a DAX EVALUATE expression is presented.

The result in Excel

And below you can see the fields in the FieldList and as you can see the Look and Feel has changed of the FieldList. 



Conclusion

This blogpost is about how to filter the data in Excel from a Tabular Model.


Best regards,

Hennie



woensdag 3 mei 2017

DAX : Selecting a measure with a Slicer

Introduction

Someone asked me a question about selecting a measure in Excel with PowerPivot/Tabular and I founded a blog about selecting a measure written by James Thomas. I have made my own example and wrote this blogpost, about it. In this blogpost I'll describe the following steps:
  • Create some sample data.
  • Create a helper table.
  • Create the datamodel.
  • Create the measures.
  • Create a pivottable and add the properfields to the pivottable.
  • Testing the result.

The source data

I've created three tables: DimCustomer, FactSales and a Measure table.


I added these to the PowerPivot model and the datamodel looks like the following screenshot:



The DAX Expressions

I've created the following DAX Expression (the + 2, etc is for getting some different testdata)

MeasureA:=SUM([Amount]) +2      
MeasureB:=SUM([Amount]) + 3
MeasureC:=SUM([Amount]) +8
MeasureD:=SUM([Amount]) +4
       
MeasureValue:=switch (TRUE;
Min ('Measure'[MeasureID] ) = 1;
FactSales[MeasureA];
Min ('Measure'[MeasureID] ) = 2;
FactSales[MeasureB];
Min ('Measure'[MeasureID] ) = 3;
FactSales[MeasureC];
Min ('Measure'[MeasureID] ) = 4;
FactSales[MeasureD])

Configure the Pivottable

This is how the pivottable is setup. The MeasureName is added to the rows and the MeasureValue measure is in the VALUES area.


The result

Here is the result of the solution. Here you can see that the slicerbutton MeasureA and MeasureC selects only the measure A and C and that is exactly what I want.


Yet, another example of selection with the slicerbutton:


Here you can see the different measures according to the selected slicer buttons.

Conclusion

This blogpost is about selecting the proper measures with a slicer.


Greetz,
Hennie

zondag 30 april 2017

Powershell : Backing up your project

Introduction

In a project where people work together in a team it is advisable to use Team Foundation Server for tracking and storing your Visual Studio project. When I start a project or working on my own on a project, I normally make a backup of the Visual Studio project every day or couple of days. In this blogpost I describe a way to automate the backup of a Visual Studio project by zipping it, tagging it, storing it in a backup folder, send a email when it succeeded or when it failed.

The Script

Below an example of the Powershell script that executes a couple of steps:
  1. If the same zipped backup file exists remove it.
  2. Try to compress the project. 
  3. Copy the file to the backup folder.
  4. Send an email when it succeeds or when it fails.

#########################################
# Author : Hennie de Nooijer
# Description : Script copies the folder, make a zip with a date and time tag.
# Date : 20170430
#########################################
$source = "F:\Project"
$dDateTime = 
$strDate = Get-Date -format yyyyMMdd
$strTime = Get-Date -format HHmmss
 
$destination = "F:\Backup\Project_" + $strDate + "_" + $strTime +  ".zip"
 
If(Test-path $destination) {Remove-item $destination}
 
Add-Type -assembly "system.io.compression.filesystem"
Try 
{
    [io.compression.zipfile]::CreateFromDirectory($Source, $destination) 
    Send-MailMessage -From <putyouremailhere> 
                  -To <putyouremailhere> 
      -Subject "Project Backup Succeeded" 
      -SmtpServer  <SMTPServerHere> 
}
Catch
{
    Remove-Item $destination
    $ErrorMessage = $_.Exception.Message
    Send-MailMessage -From <putyouremailhere> 
                  -To <putyouremailhere> 
      -Subject "Project Backup Failed" 
      -SmtpServer <SMTPServerHere> 
      -Body "The error message was $ErrorMessage"
    Break
}

This is the result if you execute the script a couple of times (without _).


Schedule a windows task

In order to automate the backup of the project, a task in windows is needed. To do that, the Computer Management tool is used. There you can setup windows tasks. Below, the steps are explained step by step.

Give the task a appropriate name and press Next.


Select the frequency of the windows task. In this example we choose Daily. Press Next.


Next step is the time of execution. Select the Start moment and by how many days the task should be recurring. Press Next.


Choose start a program and press Next.


The following step is an important step. Here you enter the command that executes the script. The command is Powershell and the parameters -file "D:\ps1\BackupProject.ps1". Press Next.


Here is the script again.

       
powershell -file "D:\ps1\BackupProject.ps1"
        

A warning appears and you can press on Yes.


Finally the task is ready and now press Next.


The task below is shown and is ready or usage.


With the context menu it is possible to run the task on an ad hoc manner..



 And  you can even schedule a task every five minutes ;-) every 5 minutes.



Conclusion

This blogpost describes a Powershell script that automates zips and backups of a project.

Greetz,

Hennie

donderdag 13 april 2017

Azure and AzureML : Loading data in Azure and using it in AzureML

Introduction

Using a database from Azure is easy. With SSMS on your local system (on-premise) a connection with an Azure database is easy to create and to maintain. In this blogpost I'll show you a simple way how to create a connection with the Azure database and how to get this data in AzureML.

Logging into Azure

 Logging into Azure with  the following servername property : <servername>.database.windows.net.


Certain firewall settings should be set in order to get access.


Now the object explorer shows the database of Azure.


Querying the database with a simple SELECT Statement.


With Import and Export Wizard I've uploaded some data (California Dairy information) into Azure


And here a connection is made from AzureML with the Azure Database


Press on Run and with Visualize the data is presented that comes from the Azure Database.




Conclusion

Just a simple blogpost on how to connect SSMS (on premise) with an Azure Database and importing the data from the Azure database in AzureML with the Import Data module.

Greetz,

Hennie

zondag 9 april 2017

Time series

Introduction

A time series is a series of data points in a certain time order. It is a sequence of discrete time data. Examples of time series are heights of ocean tides, power usage, heart rate, counts of sunspots, stock market values, click data, predict the demand for powerplants usage. But think about sales forecasting, or financial analysis or future projections.

Time series data is normally plotted in line graphs.

The real goal is forecasting of timeseries.

Examples of time series

I've gathered a couples of examples of timeseries.


Here is another example of times series with trend and a seasonal effect.


The Microsoft shares that goes up and down during the year.




Definitions

First, let's start with some definitions that helps understanding this topic in a more thorough way. Timeseries have certain characteristics. Time Series needs some patterns because if it hasn't  have patterns it is impossible to predict future values. So, the data should contain some certain repeat ability in order to make predictions for the future. 

Stationary
The mean should be constant and not dependent on time. So, if the overall time series goes upwards or downwards in a graph we can say that this is stationary. The mean should not change over time. Also the variance shouldn't change over time. So if the variance of the data changes overtime like larger and larger peaks it will not be called stationary. Also the covariance should also be the same over time. You can find more information here.

So stationarity has:
  • a constant mean.
  • a constant variance.
  • a constanct covariance over time.

Irregular reporting
In order to build time series analysis it is important that the data points are registered at regular moments. For instance when the data is registered at monthly level and then a half year no data is registered, it's difficult to find regular patterns because of the lack of data.

Auto correlation
Auto means self and Wikpedia : "Autocorrelation, also known as serial correlation, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations as a function of the time lag between them. The analysis of autocorrelation is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise, or identifying the missing fundamental frequency in a signal implied by its harmonic frequencies. It is often used in signal processing for analyzing functions or series of values, such as time domain signals."


Auto Correlation Function (ACF)
One important function that is often used in time series is the "Auto correlation function". The auto correlation function defines the correlation between the data points in a time series graph. If the graph is random there would be no correlation found between the different data points.


Partial Auto Correlation Function (PACF)
Wikipedia :"In time series analysis, the partial autocorrelation function (PACF) gives the partial correlation of a time series with its own lagged values, controlling for the values of the time series at all shorter lags. It contrasts with the autocorrelation function, which does not control for other lags."

Decomposition
The decomposition of time series is a statistical method that deconstructs a time series into several components. There are couple of major components of decomposition of a time series: trend, the pattern and a stationary signal.

LOWESS / LOESS regression
LOWESS (Locally Weighted Scatterplot Smoothing) or LOESS (locally weighted smoothing), is a popular method used in regression analysis. It creates a smooth line through a timeplot or scatter plot to help you to see relationship between variables and foresee trends. LOESS used local polynomial modeling to find trends and you need to determine the bandwidth parameter. This can be a bit difficult to determine. More information on Loess here


Moving Average smoothing
Moving Average smoothing is used to remove the smaller period stuff in contrast with LOESS that is used for larger period patterns. Moving Average smoothing is for a more fine grain level.



With Moving Average smoothing you find trends with averaging over a cycle length. You need to know the cycle trend. This can a bit tedious and a bit difficult to see but with experimenting and trail and error it is possible to determine trends.


STL
STL is a combination of LOESS and Moving Average Smoothing technique. First LOESS is used for finding a general trend. Then a Moving Avere Smoothing is used to find more trends. The result is a seasonal effect and the last pass is the remainder, the residu.

Use Loess to find a general trend T and subtract it (X - T) and the Moving Average Smoothing is used to find a seasonal effect. Subtract that S = X - T - C.

In the above example the original diagram is shown ("data") and in the diagram below the seasonal effect is presented without the trend and the remainder. Without the seasonal effect the trend is clearly recognized and what is left is the remainder and if this is normally distributed you're on the right track. Because if the remainder does not show a normal distribution there is perhaps a correlation between certain variables.


And the remainder is normally distributed we have removed all of the hidden relations. It is pure random.

Forecasting 

If we know the trend and the seasonal effect we can predict the future. In the diagram below the forecast is the blue line is a predicted value of the future based on the trend and the seasonal effect.


Conclusion

This blogpost is an introduction on forecasting based on timeseries.

Hennie

zondag 19 maart 2017

R : Working with data and some basic dataframe manipulations

Introduction

I wanted a long time to write a blogpost about the basic manipulations of data frames in R. Data frames in R are a fundamental concept and probably the most used feature to store table information.

This is the first in a series about the manipulations about dataframes. First I'll show the script and after that, an outputresult.

Create a data frame

First let's create a data frame for the manipulations.
rm(list=ls())

data.frame.x<-data.frame(names=c("Hennie","Robert","John","Jan"),
                         age=c(30,18,25,70),
                         Hobby=c("Tennis","Football","Tennis","motorcross"))

data.frame.y<-data.frame("Person_name"=c("Hennie","Robert","Kees","Henry"),
                         age=c(15,75,32,51),
                         "Pet"=c("Dog","Cat","Bird","Mouse"))
data.frame.x
data.frame.y

Below the result when the code is executed.
> data.frame.x<-data.frame(names=c("Hennie","Robert","John","Jan"),
+                          age=c(30,18,25,70),
+                          Hobby=c("Tennis","Football","Tennis","motorcross"))
> 
> data.frame.y<-data.frame("Person_name"=c("Hennie","Robert","Kees","Henry"),
+                          age=c(15,75,32,51),
+                          "Pet"=c("Dog","Cat","Bird","Mouse"))
> data.frame.x
   names age      Hobby
1 Hennie  30     Tennis
2 Robert  18   Football
3   John  25     Tennis
4    Jan  70 motorcross
> data.frame.y
  Person_name age   Pet
1      Hennie  15   Dog
2      Robert  75   Cat
3        Kees  32  Bird
4       Henry  51 Mouse

Select a column

There are multiple ways of selecting columns in R.
data.frame.x[,1]

data.frame.x$names       

Here is the output when the code is executed.
> data.frame.x[,1]
[1] Hennie Robert John   Jan   
Levels: Hennie Jan John Robert
> data.frame.x$names
[1] Hennie Robert John   Jan   
Levels: Hennie Jan John Robert  

Select multiple columns

It is also possible to select multiple columns of a dataframe.
data.frame.x[,1:3]
data.frame.x[,c(1:3)]
data.frame.x[,c(1,3)]    

And this results in the following output when executed.
 > data.frame.x[,1:3]
   names age      Hobby
1 Hennie  30     Tennis
2 Robert  18   Football
3   John  25     Tennis
4    Jan  70 motorcross
> data.frame.x[,c(1:3)]
   names age      Hobby
1 Hennie  30     Tennis
2 Robert  18   Football
3   John  25     Tennis
4    Jan  70 motorcross
> data.frame.x[,c(1,3)]
   names      Hobby
1 Hennie     Tennis
2 Robert   Football
3   John     Tennis
4    Jan motorcross

Select a row

The same we can do with rows is selecting a row.
data.frame.x[3,] 

Resulting in:
> data.frame.x[3,]
  names age  Hobby
3  John  25 Tennis

Selecting multiple rows

Selecting multiple rows can be done in a similar way as selecting multiple columns
data.frame.x[2:5,]
data.frame.x[c(2:5),]
data.frame.x[c(2:5),]     
 

Resulting in :
> data.frame.x[2:5,]
    names age      Hobby
2  Robert  18   Football
3    John  25     Tennis
4     Jan  70 motorcross
NA   <NA>  NA       <NA>
> data.frame.x[c(2:5),]
    names age      Hobby
2  Robert  18   Football
3    John  25     Tennis
4     Jan  70 motorcross
NA   <NA>  NA       <NA>
> data.frame.x[c(2:5),]
    names age      Hobby
2  Robert  18   Football
3    John  25     Tennis
4     Jan  70 motorcross
NA   <NA>  NA       <NA>

Notice the NA for the rows that are not found in the data frame

Selecting values in a data frame

It is also possible to select a cell value or to subset a data frame.
data.frame.x[1,2]
data.frame.x[c(2:3),c(1,3)]     

Resulting in:
> data.frame.x[1,2]
[1] 30
> data.frame.x[c(2:3),c(1,3)]
   names    Hobby
2 Robert Football
3   John   Tennis

Adding rows with rbind()

It is also possible to add rows to a data frame. Now there is a 'problem' with the factors in the dataframe. When R creates a factor, it only allows whatever is originally in the dataframe that was first loaded. Anything new that doesn’t fit into one of these categories is rejected as nonsense (becomes NA).

For instance "Jim" was not originally in the data frame and when the following code is executed an error occurs.

data.frame.x
data.frame.x <-rbind(data.frame.x, c("Jim",90,"Football"))  

Resulting in:
> data.frame.x
   names age      Hobby
1 Hennie  30     Tennis
2 Robert  18   Football
3   John  25     Tennis
4    Jan  70 motorcross
5   <NA>  90   Football
> data.frame.x <-rbind(data.frame.x, c("Jim",90,"Football"))
Warning message:
In `[<-.factor`(`*tmp*`, ri, value = "Jim") :
  invalid factor level, NA generated

An error occurs because Jim was not one of the list (factors), Hennie, Robert, John, Jan and therefore an error happens. NA is inserted for Jim. When we execute the following code with factors that was already originally in the dataframe it will work.

data.frame.x <-rbind(data.frame.x, c("Hennie",90,"Football"))
data.frame.x     

No Na in the dataframe.
> data.frame.x <-rbind(data.frame.x, c("Hennie",90,"Football"))
> data.frame.x
   names age      Hobby
1 Hennie  30     Tennis
2 Robert  18   Football
3   John  25     Tennis
4    Jan  70 motorcross
5 Hennie  90   Football

But how do we add new values to list of factors? We hae add manually the new categories to the Factors as a new level with the level() function.

levels(data.frame.x$names) <- c(levels(data.frame.x$names), 'Jim')
levels(data.frame.x$Hobby) <- c(levels(data.frame.x$Hobby), 'Bridge')
data.frame.x <-rbind(data.frame.x, c("Jim",90,"Bridge"))
data.frame.x

Here is the executed code.
> levels(data.frame.x$names) <- c(levels(data.frame.x$names), 'Jim')
> levels(data.frame.x$Hobby) <- c(levels(data.frame.x$Hobby), 'Bridge')
> data.frame.x <-rbind(data.frame.x, c("Jim",90,"Bridge"))
> data.frame.x
   names age      Hobby
1 Hennie  30     Tennis
2 Robert  18   Football
3   John  25     Tennis
4    Jan  70 motorcross
5    Jim  90     Bridge    

Or convert the columns to characters, add the rows and turn them back to factors.
data.frame.x$names <- as.character(data.frame.x$names)
data.frame.x$Hobby <- as.character(data.frame.x$Hobby)
data.frame.x <-rbind(data.frame.x, c("Jim",90,"Bridge"))
str(data.frame.x)
data.frame.x
data.frame.x$names <- as.factor(data.frame.x$names)
data.frame.x$Hobby <- as.factor(data.frame.x$Hobby)
str(data.frame.x)   

The code executed..
> data.frame.x$names <- as.character(data.frame.x$names)
> data.frame.x$Hobby <- as.character(data.frame.x$Hobby)
> data.frame.x <-rbind(data.frame.x, c("Jim",90,"Bridge"))
> str(data.frame.x)
'data.frame': 10 obs. of  3 variables:
 $ names: chr  "Hennie" "Robert" "John" "Jan" ...
 $ age  : chr  "30" "18" "25" "70" ...
 $ Hobby: chr  "Tennis" "Football" "Tennis" "motorcross" ...
> data.frame.x
    names age      Hobby
1  Hennie  30     Tennis
2  Robert  18   Football
3    John  25     Tennis
4     Jan  70 motorcross
5     Jim  90     Bridge
6     Jim  90     Bridge
7     Jim  90     Bridge
8     Jim  90     Bridge
9     Jim  90     Bridge
10    Jim  90     Bridge
> data.frame.x$names <- as.factor(data.frame.x$names)
> data.frame.x$Hobby <- as.factor(data.frame.x$Hobby)
> str(data.frame.x)
'data.frame': 10 obs. of  3 variables:
 $ names: Factor w/ 5 levels "Hennie","Jan",..: 1 5 4 2 3 3 3 3 3 3
 $ age  : chr  "30" "18" "25" "70" ...
 $ Hobby: Factor w/ 4 levels "Bridge","Football",..: 4 2 4 3 1 1 1 1 1 1

Adding columns with cbind()

cbind() is a lot easier adding columns than adding rows to a dataframe, because the datatype of the column is only one.

data.frame.x
new.column<-c('190', '168','174', '123')
data.frame.x <-cbind(data.frame.x, new.column)
colnames(data.frame.x)[colnames(data.frame.x)=="new.column"] <- "height"
data.frame.x  

And this is the executed code
> data.frame.x
   names age      Hobby height
1 Hennie  30     Tennis    190
2 Robert  18   Football    168
3   John  25     Tennis    174
4    Jan  70 motorcross    123
> new.column<-c('190', '168','174', '123')
> data.frame.x <-cbind(data.frame.x, new.column)
> colnames(data.frame.x)[colnames(data.frame.x)=="new.column"] <- "height"
> data.frame.x
   names age      Hobby height height
1 Hennie  30     Tennis    190    190
2 Robert  18   Football    168    168
3   John  25     Tennis    174    174
4    Jan  70 motorcross    123    123

Option number 2 is adding an extra column in a following manner.
data.frame.x
data.frame.x$new.column<-c('190', '168','174', '123')
colnames(data.frame.x)[colnames(data.frame.x)=="new.column"] <- "height"
data.frame.x      

Resulting in
> data.frame.x
   names age      Hobby
1 Hennie  30     Tennis
2 Robert  18   Football
3   John  25     Tennis
4    Jan  70 motorcross
> data.frame.x$new.column<-c('190', '168','174', '123')
> colnames(data.frame.x)[colnames(data.frame.x)=="new.column"] <- "height"
> data.frame.x
   names age      Hobby height
1 Hennie  30     Tennis    190
2 Robert  18   Football    168
3   John  25     Tennis    174
4    Jan  70 motorcross    123 

Show first n rows with head()

The head() function is like the TOP keyword in SQL and gives the first n rows of a data frame.
data.frame.x
head(data.frame.x,2)      

And this is the executed code
> data.frame.x
   names age      Hobby
1 Hennie  30     Tennis
2 Robert  18   Football
3   John  25     Tennis
4    Jan  70 motorcross
> head(data.frame.x,2)
   names age    Hobby
1 Hennie  30   Tennis
2 Robert  18 Football

Show bottom n rows with tail()

The tail function is the same as the head() function but then it returns the bottom n rows of the dataframe.
data.frame.x
tail(data.frame.x,2)

And the results are..
> data.frame.x
   names age      Hobby
1 Hennie  30     Tennis
2 Robert  18   Football
3   John  25     Tennis
4    Jan  70 motorcross
> tail(data.frame.x,2)
  names age      Hobby
3  John  25     Tennis
4   Jan  70 motorcross

Getting some columninformation with class()

If information is need about a patriular column the class() function can be used
class(data.frame.x$age)

Resulting in.
> class(data.frame.x$age)
[1] "numeric"      
 

names()

names() is the name of the function for getting or setting the names of an object.
data.frame.x
names(data.frame.x) 

And this is the executed code.
> data.frame.x
   names age      Hobby
1 Hennie  30     Tennis
2 Robert  18   Football
3   John  25     Tennis
4    Jan  70 motorcross
> names(data.frame.x)
[1] "names" "age"   "Hobby"  
 

dim()

knowing the dimensions of a dataframe can be handy sometimes and this can be executed with the dim function
data.frame.x
dim(data.frame.x)      

The executed code...
> data.frame.x
   names age      Hobby
1 Hennie  30     Tennis
2 Robert  18   Football
3   John  25     Tennis
4    Jan  70 motorcross
> dim(data.frame.x)
[1] 4 3

Showing some statistial information about an object with Summary()

In case more information is need about the values in an object a function summary() is available.

summary(data.frame.x)   


>summary(data.frame.x)
    names        age               Hobby  
 Hennie:1   Min.   :18.00   Football  :1  
 Jan   :1   1st Qu.:23.25   motorcross:1  
 John  :1   Median :27.50   Tennis    :2  
 Robert:1   Mean   :35.75                 
            3rd Qu.:40.00                 
            Max.   :70.00 
Occurence of every name is 1, the average of age is 35.75 and tennis is counted twice.

Showing the structure of an object (data frame) with str()

If more information is needed about an object and in this particular case a data frame you can use the str() function in R.
str(data.frame.x) 

Resulting in the following output.
> str(data.frame.x)
'data.frame': 4 obs. of  3 variables:
 $ names: Factor w/ 4 levels "Hennie","Jan",..: 1 4 3 2
 $ age  : num  30 18 25 70
 $ Hobby: Factor w/ 3 levels "Football","motorcross",..: 3 1 3 2
 

Factors is a type that is like categorization of the values and used in statistical functions. Age is numerical column.

Count the rows with nrow()

With nrow() it is possible to count the rows in a data frame
nrow(data.frame.x)     

Resulting in
> nrow(data.frame.x)    # number of data rows
[1] 4

Count the columns with ncol()

Offcourse it is also possible to count the number of columns. this is done with the ncol() function
ncol(data.frame.x)     

Resulting in
> ncol(data.frame.x)
[1] 3

attributes()

The attributes function acesses the attributes of an object and in this example it is the dataframe.
data.frame.x
attributes(data.frame.x)    

And this is the result of the executed code.
> data.frame.x
   names age      Hobby
1 Hennie  30     Tennis
2 Robert  18   Football
3   John  25     Tennis
4    Jan  70 motorcross
> attributes(data.frame.x)
$names
[1] "names" "age"   "Hobby"

$row.names
[1] 1 2 3 4

$class
[1] "data.frame"   
 

merge()

Now it is also possible to merge two data frames with the merge function. Now the first merge function in the code below tries to find similar columns and that could be age but there are now common ages and therefore the merge fails. the second one has the columnnames specified  and the all=TRUE is a full outer join between the two dataframes.

data.frame.x
data.frame.y
merge(data.frame.x, data.frame.y)
merge(data.frame.x, data.frame.y, by.x="names", by.y="Person_name",all=TRUE)     

And this is the executed code
> data.frame.x
   names age      Hobby
1 Hennie  30     Tennis
2 Robert  18   Football
3   John  25     Tennis
4    Jan  70 motorcross
> data.frame.y
  Person_name age   Pet
1      Hennie  15   Dog
2      Robert  75   Cat
3        Kees  32  Bird
4       Henry  51 Mouse
> merge(data.frame.x, data.frame.y)
[1] age         names       Hobby       Person_name Pet        
<0 rows> (or 0-length row.names)
> 
> merge(data.frame.x, data.frame.y, by.x="names", by.y="Person_name",all=TRUE)
   names age.x      Hobby age.y   Pet
1 Hennie    30     Tennis    15   Dog
2    Jan    70 motorcross    NA  <NA>
3   John    25     Tennis    NA  <NA>
4 Robert    18   Football    75   Cat
5  Henry    NA       <NA>    51 Mouse
6   Kees    NA       <NA>    32  Bird 
 

subset()

The subset( ) function is an easy way to select variables and observations. In the following example, we select all rows that have a value of age greater than or equal to 20 or age less then 10 and the columns names and hobbty is returned

data.frame.x
subset.frame.x <- subset(data.frame.x, age >= 20 | age < 10, 
                  select=c(names, Hobby))
subset.frame.x

And this is the result...
> data.frame.x
   names age      Hobby
1 Hennie  30     Tennis
2 Robert  18   Football
3   John  25     Tennis
4    Jan  70 motorcross
> subset.frame.x <- subset(data.frame.x, age >= 20 | age < 10, 
+                   select=c(names, Hobby))
> subset.frame.x
   names      Hobby
1 Hennie     Tennis
3   John     Tennis
4    Jan motorcross  
 

Conclusion

In this blogpost I've discovered the different data frame manipulations in R. 

Greetz,
Hennie

zondag 19 februari 2017

DAX : Bidirectional problem with Product and Bill of Materials

Introduction

Suppose your customer wants to calculate the weight of the components that makes up your product (Bill of Material aka the BOM). This blogpost is a description about summarizing and calculating of quantities and the weight of products and sales.

For this blogpost I've used the explanantion of Exceleratorbi.com.

The Data

So, I've created a small example of the problem. There are three tables in the sample workbook and they are a Material (Product), BOM and a Sales facttable. In order to make the example more interesting I've created some data like car, bicycle and others, together with the parts that made them up like wheels, frame, steering wheel. Lastly a fact with the sales data is created where a quantity tells how many was sold on a day.


So. I'm interested in knowing the weight of the materials and also knowing the weight of the parts of the Material (Product).

The data model

In the diagram I've plotted the datamodel. The keys of the DimMaterial table are propagated in different directions and not in one direction we 'normally' use in dimensional modelling.


The DAX expressions

For the initial implementation of the problem in the workbook the following DAX expressions are used. Pay attention to the TotalMaterialWeight DAX expression calculation. This DAX expression uses a SUMX DAX function. It calculates the weight on a row-by-row manner. SUMX is an interator and is less efficient that the SUM function. Normally, I don't use the SUMX function very often, but in this case it is needed because you need to multiply the rows first and then sum it up.

       
TotalQuantity:=SUM(FactSales[Quantity])

TotalBOMWeight:=SUM(DimBom[BOMWeight])

TotalMaterialWeight:=SUMX(DimMaterial;[TotalQuantity]*[TotalBOMWeight])

The first DAX expression calculates the quantity sold in the FactSales, the TotalBOMWeight expression calculates the weight of the individual BOM parts and the TotalMaterialWeight calcualtes the summarization of the multiplication of the TotalQuantity and TotalBOMWeight.

The outcome

Below a first impression of the Pivottable based on the PowerPivot model and here you can see that the TotalQuantity, the TotalMaterialWeight are calcualted correctly. Next to the pivottable I've included a manual calculation and it is the same as the automated calculation


Below a screenshot of the weights of the BOM parts and that calculation is also correct.


How to improve the model

Now there is something wrong with the TotalBOMWeight. It is 1225 on every row and that doesn't seem right. The problem is that the relation towards the DimBom is not enhanced in a standard manner. In PowerBI Desktop you can set the relation to be bi directional in order to make it work. Now, in order to change this behavior in PowerPivot you have to add some black magic to the DAX expression.

       
TotalSalesQuantity:=CALCULATE(SUM(FactSales[Quantity]);DimBom)

TotalBOMWeight:=Calculate(SUM(DimBom[BOMWeight]);FactSales)

TotalMaterialWeight:=SUMX(DimMaterial;[TotalSalesQuantity]*[TotalBOMWeight])
 

The sums of TotalSalesQuantity and TotalBOMWeight are now wrapped with a CALCULATE function and this handles the calculation of the TotalBOMWeight better.




For a full explanation of this solution, see the blogpost of Exceleratorbi.com.

Conclusion

This blogpost is about the calculation of measures that is not a standard pattern. The method descibed handles a calculation of the total weight of Materials (products) sold and their BOM parts (components). We also explored the problem of bidirectional relations.

Greetz,

Hennie

vrijdag 17 februari 2017

SSAS : Where did my measures go?

Introduction

Currently, developing a Tabular Model in SSDT and somehow I managed to let my measures disappear. What? Yep! This happened me a couple of times now. This blogpost is about the problem and how to solve it.

Situation described

Now it occured when I was entering some measures in SSDT.


Even though I don't saw the errors in the formulas, a blue mark is show in the tab. A bit annoying and you can't see why it's giving me the error. No measure with error  is shown.




Solution option 1

In the forums of Microsoft someone suggested to delete the layout file in order to get the error measures back. So I did..



When I do that, the erroneous measures are shown again but all the measures are now shown on two rows and it is completely reformatted. 


Not what I want! I have to copy and paste manually the formulas to a format that I appreciate. This is a bit labor intensive when you have a lot of measures.

Solution option 2

Now after investigating the files more closely I opened the bim file. Always make a copy of the file!


And I removed the following snippet (although the warning says that you don't want to change it manually;-) )

       
   <Command>
                  <Text>----------------------------------------------------------
-- PowerPivot measures command (do not modify manually) --
----------------------------------------------------------
CREATE MEASURE 'FactSales'[Copy of PYSalesNetRevenue]=CALCULATE([TotalSalesNetRevenue], FILTER(SAMEPERIODLASTYEAR(DimCalendar[Date]), DimCalendar[Date] &lt;= LASTNONBLANK(DimCalendar[Date], [TotalSalesNetRevenue])));
</Text>
                  <Annotations>
                    <Annotation>
                      <Name>FullName</Name>
                      <Value>Copy of PYSalesNetRevenue</Value>
                    </Annotation>
                    <Annotation>
                      <Name>Table</Name>
                      <Value>FactSales</Value>
                    </Annotation>
                  </Annotations>
                </Command>

That solved my problem. The layout of my measures stayed the same and the wrong measures were removed from the model.


Conclusion

It's possible to let measures disappear from SSDT but with editing the .bim file you can remove the measures.

Greetz,

Hennie

zondag 12 februari 2017

SSAS : Processing a Tabular Model

Introduction

This blogpost is about processing a tabular model in Analysis Services in SQL Server 2014. At first glance, the number of processing options is a bit overwhelming but when you have a certain understanding about the options you get the idea behind it. Although, I've gathered this information for SQL Server 2014, most of it is applicable for other versions as well.

I've read a couple of blogposts about processing a tabular model and most of this blogpost is based on the following blogs:
So all the credits goes the people who wrote these blogs.

Stages of loading data

There are two stages of loading data in a tabular model. First, The xVelocity engine takes each column in each table and builds a dictionary from it. 
Secondly, data recalculation takes place, meaning that the engine populates any calculated columns, hierarchies, relationships and indexes for the database.

Processing options

As said before, SSAS Tabular model comes with a couple of processing options that can be used differently depending on the object you want to process. 


Process Add

With the processing option Process Add it is possible to implement incremental loading of the cube. By defining the rows that are new to the cube, incremental loading of the cube is possible. As Marco Russo is writing in his blog, when executing the Process Add option, the SSAS engine will create a new partition and this is merged with the existing partition.

An effect of the Process Add is that automatically recalculates the affected objects.

Process Clear

Process Clear drops all the data in a database, table, or partition. Is there more to say;-)


Process Data

Process Data loads data into a partition or a table. It reads the data and it builds dictionaries, but doesn't process any calculated columns, indexes, relationships or hierarchies. Therefore it's required to perform a Process Recalc after all tables' dictionaries have been updated using Process Data.


Process Default

Process Default loads data into unprocessed partitions or tables. Any affected calculated columns, relationships, user hierarchies, or internal engine structures (except table dictionaries) are recalculated. It recalculates the affected objects (and only the affected objects).


Process Defrag

Process Defrag optimizes the table dictionary (an internal engine structure) for a given table or for all tables in the database. This operation removes all dictionary entries that no longer exist in the data set and rebuilds the partition data based on the new dictionaries.

If you often add or remove partitions from a table, or if you often reprocess individual partitions in a table, you should Process Defrag regularly but carefully. 

Process Defrag may be an expensive operation for a large table.

If you Process Data on a table level, it is never necessary to do a Process Defrag

Run Process Recalc after running Process Defrag.

Process Full

Process Full loads data into all selected partitions or tables. Any affected calculated columns, relationships, user hierarchies, or internal engine structures (except table dictionaries) are recalculated. It recalculates the affected objects. Process Full runs a Process Data on the table or partition, followed by a Process Recalc.


Process Recalc

For all tables in the database, recalculates calculated columns,  Process Recalc rebuilds relationships. rebuilds user hierarchies, and rebuilds other internal engine structures. Table dictionaries are not affected.

Process Recalc is always the last thing you do, and you can batch up as many processing operations as you want before issuing that last Process Recalc.

Keep in mind that you can only issue Process Recalc at the database level. This means that after a Process Data operation on a dimension, you may be stuck recalculating a large fact table when it was not impacted by the initial processing operation.


Processing scenarios

I've read a couple of interesting options for loading the tabular model. 


1) Full load
With Process Full you can perform a full load of the database, partition or a table. It will drop all of the data and then it will process the object. This can become handy for an initial loading of the cube or in case of a structural change of an object in the cube or the cube itself. In that case, you need to redeploy the cube and perform a Process Full of the cube.

2) Incremental processing
Two options:
  • 1) Create a new partition and then use Process Data to populate it. This new partition can then be merged with an existing partition if you like.
  • 2) Use Process Add to add data to the existing partition. If used make sure that you don’t add duplicate rows. 
          Two options:
    • Change the partition definition to the query that returns the set of rows you want returned by the incremental processing operation. 
    • If your partition definition is based on a view, make sure that the view on the original data source now returns just the set of rows you want returned by the incremental processing operation.

3) Semi-incremental processing
Yet another option is to create a partition for a year or a month in a table and process that with a Process Data. This will limit the data loading significantly but it 's simple but effective.

No matter how you do incremental processing, you must Process Recalc on the database afterwards before moving on with your life. If you do this enough, you’ll want to Process Defrag too.


Greetz,

Hennie