DA-100 Microsoft Power BI – Part 3 Level 2 – Statistical functions
- Measures – an introduction, with standard aggregations including Countblank
In this part of the course, what we’re going to do is have a look at not new columns though we will be using what we’ve learnt in the previous session in the future part of this course, but also new measures. Now we’ve used measures quite a lot in part one of this course. Whenever you drag anything through to a visualization it’s quite often if it’s numerical going to be measure. So previously we had this sales territory and sales amount and we are summing it and that is a measure. A measure is something that you can sum, you can average, you can mean, you can max. It also happens to be something that you can count though you can also count things that aren’t measures. For instance, I can count colors, I can count how many distinct colors there are in the rainbow, so six or seven of them. But there are some limitations as to what we can do with measures.
And what we’re going to do in this video is have a look at the standard measures plus one or two more as they are represented in the DAX language. So here are the standard functions that you probably are very familiar with. Sum, average, count, max and min. So sum adds together the numbers. Average is an arithmetical mean of numbers, so it adds all the numbers together and divides by the total number of numbers. Count counts the number of numbers. Max gives the largest numeric value and Min gives the smallest numeric value. However, there are also these other functions with an A after it. And what these do is not just use numbers but also logical values and on occasions other values. Count a for instance, counts everything that is not empty, so it could be text and dates. But in reality what we’ve got in Power Bi is a very tight version.
We can only have one particular type of field contents in a particular column. We can’t mix up numbers and strings as we saw in the previous version, in the previous section. So a lot of this is actually just going to repeat what we’ve got over here, but let’s find one that doesn’t repeat what we’ve got as the standards and that is count blank. Count blank counts a number of blank cells in a column. And in our previous version, in our previous section we had quantity four plus 4. 99 which calculates, well in this case, anything that’s under five or over ten or equal to seven. We expanded it, but it includes a lot of blanks. So let’s have a look at the quantity 4. 99 and count what we’ve got. So if I go to count, you’ll see that we have got a total of 60,000 rows which are not blank in sales amount. And that’s because sales amount, if I scroll down to sales amount, always has a figure.
When somebody is placing an order, it always has value. But we also have this quantity 4. 99 which doesn’t always contain a value. So let’s count quantity 4. 99 and you can see there are 52,000 values which are numeric. But how many values are not numeric? Well, this is when we use the count blank function. So let’s create a new measure. Now, you can create new measures based on sums and averages and counts and so forth. But how user friendly is that really going to be? Well, maybe a little user friendly. You might want to have a value for total sales amount but quite frankly it just adds to the confusion, adds to the number of fields you’ve got unless you are then hiding sales amount.
So let’s concentrate on newer stuff. So I want to use count blank and I’m going to call this quantity blank or quantity 4. 99 blank equals and it is count blank and you can see the computer autocompleting. And which column are we using? We’re using the quantity 4. 99. So this is how we create a new measure. We put in the measure name and an equal sign and then the formula. So we check that, we commit it and nothing happens. I can’t see it anywhere here in my fields but it is actually there. It’s just that we’re in the Fact Internet Sales or I’m looking in the Fact Internet Sales but it’s actually added it into the Dim product. So why is that? Well, maybe I had Dim products highlighted at one point, I don’t know. What I do know is it’s definitely in the wrong place.
So a better way of creating a new measure is not to click on this new measure right up there, but to actually go into the relevant table, in this case the Fact Internet Sales which is where I want it to be and go to the dot dot dot and go to new measure. So there’s my measure which I’ve previously done, click check to commit it and there it is now in the right place. Another way of making sure it goes into the right table is by doing it in the data because there you only have one table highlighted at once and then if you click new measure we know which table that you’re talking about. Now notice the difference in the icons we have here a calculator as opposed to something that looks more like a spreadsheet or column.
Also notice that when I go into the data it is not there as a separate column because it does not operate as a column, it operates as a summation of a column. So now if I move this into the values, you can now see that the number of blank rows is 7403 and when you add these two together you get the total number. Now because it’s already a measure, you can’t change the summation. So like you can change the summation here, the aggregation. You can’t do that in quantity 4. 99 blank because it’s already a formula. Now as I say, you could use some or average on any of these existing ones, any of these existing fields columns, but it just adds to the amount of clutter that there might be and also reduces the flexibility. Because suppose I was putting in order quantity, I would be forcing it to some order quantity. I can’t count it, I can’t average it, I can’t minute, I can’t max it. Of course, if that’s what you want, great. If you want to restrict the user in just doing a summation, then you can create a new measure and this would be total order quantity equals sum of order quantity.
Now, I just want to point out one small other difference. You remember when we were doing new column and I started typing in a field, it starts to auto complete it. So if I try and put in order quantity into the field, into the calculation, it’s going don’t know what you’re talking about because you have to use a formula first. You can’t just say order quantity plus one or something like that. So we have to go sum of order quantity. So there is our new field and then if you so choose, you can then hide order quantity so that it is not in display for the end user. Now, we’ll be using some in the next video and finding out some of the limitations of some but discovering a whole new set of formula.
- Aggregation of calculations
In the previous video we used the sum function to sum the order quantity. And what I’m going to do is just get rid of everything we’ve got apart from sales territory and add in the total quantity order. And I’ll also add in the original one, which defaults the sum. Just so you can see we’ve got working. Now, suppose I didn’t want this particular value, this particular measure, to be the sum of the order quantity itself. Suppose instead I wanted to have a bit of a goal. So every time somebody orders something, it could be one, it could be several of them, I want a goal of them ordering another one. So why it says one, I want this to be two. Now, we did something similar in the quantity 4. 99 plus one. So we said where a certain criteria happens, then give me one on top. So for every single row I want an additional item, an additional one quantity.
So let’s change this total order quantity from sum of order quantity to sum of order quantity plus one. And we can see that there is a problem. The sum function only accepts a column reference as an argument. So in other words, we can’t just put in an expression, it’s only a single column, so we can’t do that with some. So what we could do, of course, is use a helper column, as we’ve done previously. But a better way is to use an extended form of these aggregation.
And these extended functions have an x at the end. So instead of sum, it’s sum x, average x, count x, count all x, which is count ax max x. So max x has two x’s at the end, median x and min x. And in each of these it takes two arguments, two expressions that we have to put in. The first is the table and then it goes down, in this case, the entirety of the table and does a calculation per row. So let’s see how this works. So instead of sum, we’re putting sum x. So we need the table where the table is going to be the fact Internet sales. Now, we can actually do a calculation to bring a calculated table in. We’ll do that much later.
So you will see we don’t have to actually put the entirety of a table in, but we’ll do so just in this particular instance. So the table is the fact Internet sales and then the expression. Well, we want the expression of the order quantity plus one. So this is our standalone, what we want to do, but we are referencing this for every single row within this table. So no squiggly underlines. I will just change this so that it is our goal. And you can see that we have in fact doubled the values because each order quantity was one, it’s now two. Now, this is similar in Excel to array formulas. So if I put in the values 1234. We can just sum that up very easily. Now, if I wanted each of these values to have one first, I could use a helper function. So here we have 2345, and again, sum that up, make 14.
But what I can’t do in Excel is say, have all of these values and add one each time with computer gods. I don’t know what you’re talking about, but we can do that in Excel using an array formula where it takes each of these as a separate value and does something to it. In this case, add one before passing it to the sum. And to do this in Excel, instead of pressing Enter or Return, you press CTRL Shift and Enter and the computer adds these brackets. You don’t do that yourself, but it gives exactly the same result without the need of this helper column.
And that’s what these extended functions do. So the x pertains to an extension of the original function, where we have a table and then an expression which is done for every single row within the table. So you could do a new column and then you’ve got the possibility of doing a sum on that column. But it does take up additional memory because the computer has to maintain for each row the answer to this. Or you can use one of these x functions, these extended aggregations. It’s a lot simpler in terms of the amount of work that the computer has to do. And it takes up a lot less memory inside the computer. So I would prefer using these x functions rather than a helper column, unless I had a specific reason for using the helper column. For instance, I wanted to check the answer row by row just to see whether it actually makes sense. So these are the extended functions.
- Other statistical functions
Now, a lot of the other statistical functions in DAX are fairly advanced and I don’t want to overwhelm you with this number of functions, especially since some of them are to do with statistics or the inverse of the left tailed probability of the chi squared distribution. But don’t forget if any of these look of particular interest, and we do have hyperlinks in the spreadsheet that will take you to the official Microsoft documentation and they’ll show you things like the full syntax and how you can use it. So add columns, for instance, can be quite useful later on when you are manipulating tables in memory and you want to add a calculated column count rows counts the number of rows in a specified column or in a table expression. So you can see a lot of these are about manipulating tables in memory. Distinct count that counts a number of different cells in a column. So for instance, suppose we had a column with a lot of ones, lot of twos, lot of threes, or distinct count would then give you the number three being the number of different types, different results. However, you can do that in power bi itself. So it’s more useful as part of another function, I think percentiles.
So this can be useful if you need to know, say, what the 50th percentile of a particular range is. And you’ll notice some of these are percentile X functions. So these are probably more useful, say, if you wanted to find out the most common percentile, rank x. Again, an extended function, more useful in measures, but very complicated, as you can see from the number of arguments that it takes. Raw function, not what you might think. It doesn’t give you the raw number, it returns tables with a single row. And as king, probably quite advanced sample function gives you a sample number of roles.
It’s similar to top n, except top n gives you the top, say, ten number of roles, whereas sample will give you a random sample. Now, we’ve also got some trigonometric functions like syn or sine and Sineh. But you can see the majority of these functions are fairly advanced. So if you do want to find out more about them, please use the hyperlink. In ending this particular part of the section, I do want to talk about one more function and that is the rank EQ function. So this returns the ranking of a number in a list of numbers. So let’s say we want to find out in this sales amount, where does 499 fit into everything? Well, we can introduce a new calculator column. Is it top, is it bottom? So here we have the sales amount. Rank equals rank dot EQ EQ means equal. So when more than one value is the same rank, the top rank will be returned as opposed to the average rank. So what value are we testing? Where we want to test the sales amount what is the column name that we are testing that particular value in? Well, it also happens to be the sales amount.
So the fact that these first two are identical don’t treat them as actually referring to the same thing. We’re looking for the sales amount on this particular row and we’re looking at the sales amount in the entirety of the column. And then the order is it ascending or descending. So we’re going to say descending, so that the biggest amount gives a number one. So if I get that, you’ll see that all of these four pound 99 amounts are ranked 46,001st, whereas if we look at the very top amount, we will see that that is for 3578 pound 27. So rank EQ, as I say, probably one of the more useful functions that we haven’t looked at in the statistical functions, it gives you the ranking of a number in a list of numbers. So the first one is the value in that particular row and then the second argument is the column name and quite often they will be the same.
And finally, just one word about simplification. We are in the fact internet sales table. So we have got a function here which refers to the table fact Internet sales and then the column sales amount, but we’re already in the Fact Internet sales. So the computer gives this by default. But if you prefer to just simplify the formula, if you’re in a table and you’re phone to the same table, you can get rid of the reference to a table. So if you think that looks simpler, then feel free to omit the table reference. Now, personally, I generally keep the table reference unless it’s really getting a very complicated formula or it’s fairly obvious what I’m referring to. So the choice is yours. If you are in the table and you’re referring to another column in that table, you can get rid of the table name.
So statistical functions, there’s quite a lot of them, they’re fairly advanced, the remaining ones that we’ll looking at. But if you want to know more about any particular function, then just click on the Hype link and it will take you to the documentation.