We have talked a lot recently about Data Modelling. The first video of Geordie Intelligence’s Back2Basic Series even goes through the first stages.

Power BI 101: Data Modelling

Query Editor is where you do much of your modelling but let us look this week at the main part of modelling that we do outside of the Query Editor. What do we do after the ETL process is completed?

Relationships are what make analysis possible beyond a certain scale. Looking at the Superstore dataset used in the Back2Basics series, we are starting from a single data table or a “Flat File”. The sheet has less than 10,000 rows of data, so it could easily be used in this format without any of the fancy modelling Geordie Intelligence took you through, so why bother? In a real-world dataset, the number of rows is expected to be 10, 100, or even 1000 times more. At that point, evaluating everything using a simple Flat File becomes impractical for several reasons relating to processing efficiency. Efficiency is achieved through modelling. The purpose of modelling is to group consistent attributes to form tables. What does that all mean?

Looking back at the superstore data, four columns relate to the Product being purchased: Product Name, Sub-Category, Category and ProductID. So what does that mean? To a customer, Category and Sub-Category may help them find the Product Name they want to purchase. To a Sales Manager, they may wish to know Sales by Category or Sub-Category. A Warehouse Manager may want to know how much of each product name gets sold over time. No one seems to care much about ProductID, so why do we have it? ProductID is, in fact, a Key; it forms a unique identifier for all attributes of a Product. Within the context of the sales report, we only have a small number of additional sales-related attributes: name, category, and sub-category. In a real-world organisation, additional values would be added over time, from the simple ones related to the sale of these items to more complex ones related to procurement or shipping criteria (mass, volume, etc…). It becomes clear that ProductID is the identifier for our Products table; at our small and straightforward report level, it is just Name and hierarchy information. The Unique Identifier evaluates and matches a row of data that can be used with the other data. ProductID and Product Name, in essence, should be equally unique; this is common in modelling; the Name is a “friendly” description, while ID is a more data structure-friendly identifier.

The remainder of the table will comprise attributes related to those two abstract columns. This table will make up something related to a physical entity; a product exists, and one could pick it up. Dimensions are a way of analysing the interactions between the physical world and those entities. Those interactions are called Facts. So, we have laid out the groundwork for our models.

Dimensions

Let us start by calling out the exceptional Dimension, the one that is used most often and is not physical: Time. Time dimensions are almost always needed, although they are typically just Dates in the context of analysis. In some ways, this oddity simplifies exactly what goes on with Dimensions. While a Day is not a physical object it is a clearly defined term for us to be aware of. We all understand what Today means or Tomorrow or 3rd of November 2024. So, we can say that Dimensions are based on discrete things. A Day, a product name, a customer or a building.

We have to have a way of identifying these discrete items while also being aware that that identity does not have to be singular. For example, we could have 100 instances of the same product sitting on our warehouse shelves, but we would not have 100 different ProductIDs for them; instead, we would have one, since the purpose is to understand how that product (name) sells.

Let us consider a sale. When a product is sold, it is sold as part of an order to a customer and shipped to an address at a specific date. Each of these, the product, order, customer, address and date, are dimensions. Each dimension stores the categorical information that relates to those instances. Each dimension has an Identifier, a “name” and attributes. When dimensions come together for something that forms an abstract entity, a Fact, so, the row of a Fact table will consist of multiple ID columns for each time that entity is used.

The Identifier column within a dimension is known as a Primary Key.

Facts

When the Superstore data pulls the primary keys from the dimensions together, it adds additional information: sales amount, quantity, discount percentage, and profit amount. The resulting row of the Fact table is then several Dimension Primary Keys; however, calling them Primary Keys would be confusing, as we know that Primary Keys must be unique, but we would hope to sell a product more than once or to have more than one line item in an order. Therefore, in the interests of clarity, when combined in another table that needs to be able to cope with multiple instances of the ID, they are referred to as “Foreign Keys”. The remainder of your fact table then is the typically numeric data relating to the occurrence of the “Fact”: Sales amount, Quantity, Discount % and Profit amount. You typically seek to aggregate these numeric values as for your analysis, normally against a dimension Foreign key or keys. For example Sales of Product with ProductID of TEC-PH-10001949 = $1,000 as an example or Quantity of items sold on 4th of July 2013 is 345.

Results

When we understand that, we use the dimensions to filter the column values to be aggregated and that those values are drawn from a fact table. We start to understand the principles of data modelling. The purpose of ETL is to prepare the tables that will become fact and dimension tables as the relationships are established in the Relationships pane of Power BI. These relationships allow us to filter the fact table based on multiple foreign keys, not just sales of product TEC-PH-10001949 but those sales on the 4th of July 2013. We could look at products in the “Technology” category sold in “July 2013”. Then, compare that against different months or categories.

Success in modelling relies on understanding this interplay. It also explains why you may need to rework relationships to achieve the required analytical results. In the Week One video, there is a deliberate mistake where the customer and address are linked together as a single dimension. If all customers only had a single address, this would work. However, we know that over time, people may move house or be corporate customers with multiple addresses. These two entities are treated specifically during the training videos for training purposes, which may not be the best way to treat them if you are building this for yourself.

Watch the videos and follow along; share your experiences with relationships and what you feel we should do to correct the relationships in our model.

Geordie Consulting uses tools from the Microsoft Power Platform to significantly enhance the data modelling process. Organisations can achieve higher data consistency and accuracy levels by integrating data from diverse business applications into a cohesive and manageable platform. Services include enabling users to generate custom reports using Power BI and Excel, leveraging Semantic Models for uniform data analysis, and enhancing user autonomy through implementing super user communities to support self-service and promote agility. Comprehensive process documentation and ongoing support are critical in maximising the business value derived from data, ensuring that the insights gained are actionable and aligned with the organisation’s strategic objectives. These practices foster a data-driven culture that supports informed decision-making and drives business success.