Math & Stats for Data Analytics Guidebook
Math & Stats for Data Analytics Guidebook
This guidebook is designed to assist professionals and students alike in understanding and implementing key mathematical calculations relevant to data analysis, business intelligence, and statistical research.
💡 Each concept will be explained with a step-by-step tutorial across the four most popular data analytics tools namely Excel, SQL, Tableau, and Python.
Table of Contents
Percent of Total
Percentage Difference
Month Over Month Calculations
Year Over Year Calculations
Ratios and Proportions
Year-to-Date Calculations
Cumulative Calculations
Rolling Calculations
Trailing Calculations
Mean, Median, and Mode
Standard Deviation
Range
T-test Statistics
ANOVA
Chi-square Test Statistics
Regression
Correlation
Percent of Total
Percent of Total provides a way to understand the proportion that a part represents out of a whole. This concept is essential in data analysis to compare categories or classes within a dataset.
Step-by-Step Guide for "Percent of Total"
Excel:
Data Setup: Assume you have data in two columns: Category (A) and Value (B).
Formula: In cell C1, enter the formula to calculate the total:
=SUM(B:B). In cell C2, enter the formula to calculate the percent of total for each category:=B2/$C$1.
SQL:
Query Setup: Suppose you have a table
saleswith columnscategoryandamount.SQL Query: Use the following query:
Tableau:
Data Import: Load your data into Tableau.
Calculation: Drag your 'Amount' measure to the Rows shelf, right-click on it, go to Quick Table Calculation > Percent of Total.
Python (using pandas):
Data Setup: Assume your DataFrame is named
dfwith columnscategoryandamount.Calculation:
How to Proceed:
Verify data integrity and correct format in all platforms.
Visualize or use the results as needed for further analysis or reporting.
Percent Difference
This metric is used to describe the difference between two values as a percentage of the average of those two values. It is widely used to compare data points in time series analysis, financial reports, and more.
Step-by-Step Guide for "Percentage Difference"
Excel:
Data Setup: Assume you have two columns of data: Previous (A) and Current (B).
Formula: In cell C1, enter the formula to calculate the percentage difference:
SQL:
Query Setup: Suppose you have a table
financialswith columnsyear,previous_sales, andcurrent_sales.SQL Query: Use the following query to compute the percentage difference:
Tableau:
Data Import: Load your data into Tableau.
Calculation: Create a calculated field named "Percentage Difference" with the formula:
Python (using pandas):
Data Setup: Assume your DataFrame is named
dfwith columnspreviousandcurrent.Calculation:
How to Proceed:
Ensure data is correctly formatted and clean for accurate calculations.
Use visualizations to better understand the changes between the two periods.
Month Over Month (MoM) calculations
Next, we'll cover Month Over Month (MoM) calculations, which are pivotal for tracking trends, identifying seasonal effects, and managing performance by comparing the difference between values from one month to the next.
Step-by-Step Guide for "Month Over Month"
Excel:
Data Setup: Assume you have data in two columns: Date (A) and Value (B).
Formula: To calculate MoM percentage change, enter in cell C2 (assuming headers in row 1):Drag the formula down to apply it to other cells.
SQL:
Query Setup: Suppose you have a table
monthly_saleswith columnsdateandsales.SQL Query: Use the following query to compute MoM changes:
Tableau:
Data Import: Load your data into Tableau.
Calculation: Create a calculated field named "Month Over Month" with the formula:This calculates the MoM change and handles potential null values with
ZN.
Python (using pandas):
Data Setup: Assume your DataFrame is named
dfwith columnsdateandsales. Ensuredateis in datetime format and sorted.Calculation:
How to Proceed:
Ensure that dates are sorted and data does not have missing months for accurate calculations.
Visualize the results to identify trends and patterns over time.
Year Over Year (YoY) calculations
Let's delve into Year Over Year (YoY) calculations, which compare the value of a metric on a specific date (or aggregate over a period) to the value of the same metric at the same point in time the previous year. This approach helps highlight trends, growth, and cyclical patterns.
Step-by-Step Guide for "Year Over Year"
Excel:
Data Setup: Assume your dataset includes Date (Column A) and Value (Column B), with dates spanning multiple years.
Formula: In cell C2 (and below), input the following to calculate YoY change:This formula assumes no leap year adjustments. Adjust accordingly for leap years or use date functions to precisely align dates year-over-year.
SQL:
Query Setup: Consider a table
annual_saleswith columnsdateandsales.SQL Query: This query will compute YoY changes:
Tableau:
Data Import: Load your dataset into Tableau.
Calculation: Create a calculated field "Year Over Year" using:This formula assumes monthly data and calculates the YoY change from the same month last year, adjusting for nulls.
Python (using pandas):
Data Setup: Assume
dfis your DataFrame withdate(datetime type) andsales.Calculation: To compute YoY change:
How to Proceed:
Verify date alignment and consistency in the dataset to ensure accurate YoY calculations.
Use data visualization to effectively communicate the year over year trends and insights.
Ratios and Proportions
let's dive into Ratios and Proportions, which are fundamental concepts used to compare quantities and measure relative sizes. These calculations can reveal relationships and distributions within your data, making them valuable for in-depth analysis and decision-making.
Step-by-Step Guide for "Ratios and Proportions"
Excel:
Data Setup: Assume you have two columns of data: Numerator (A) and Denominator (B).
Formula: In cell C1, enter the formula for ratio:To express this as a proportion (i.e., a percentage), modify the formula:
SQL:
Query Setup: Suppose you have a table
datawith columnsnumeratoranddenominator.SQL Query: Use the following query to compute ratios and proportions:
Tableau:
Data Import: Load your data into Tableau.
Calculation: Create two calculated fields:Format the Proportion field to display as a percentage.
Ratio:
Proportion:
Python (using pandas):
Data Setup: Assume your DataFrame is named
dfwith columnsnumeratoranddenominator.Calculation:
How to Proceed:
Validate that the denominator is not zero before performing divisions.
Use visualizations to compare ratios and proportions across different groups or categories effectively.
Year-to-Date (YTD) Calculations
Let's tackle Year-to-Date (YTD) Calculations, a critical metric for assessing cumulative performance from the start of the year up to a specific date. This measure helps in tracking progress against goals and benchmarks throughout the year.
Step-by-Step Guide for "Year-to-Date Calculations"
Excel:
Data Setup: Assume your dataset includes Date (Column A) and Value (Column B), with dates for at least one year.
Formula: To calculate YTD sum at each row, use:This formula sums all values in column B that have corresponding dates in column A that are less than or equal to the current row's date.
SQL:
Query Setup: Suppose you have a table
financial_datawith columnsdateandamount.SQL Query: This query will compute the YTD sum:
Tableau:
Data Import: Load your dataset into Tableau.
Calculation: Drag your date to Columns and your value measure to Rows. Right-click on the value field, go to Quick Table Calculations, and select Running Total. Set the 'Compute Using' to Advanced, ensure Date is in the addressing and restarting every year.
Python (using pandas):
Data Setup: Ensure your DataFrame
dfhasdate(datetime type) andvalue. The DataFrame should be sorted by date.Calculation:
How to Proceed:
Make sure your data is sorted by date and properly formatted.
Consider visualizing YTD trends to highlight growth patterns or identify issues quickly.
Cumulative Calculations
Now we'll explore Cumulative Calculations, essential for summarizing data over time and providing running totals that update with each new data entry. This concept is highly useful in financial analysis, inventory tracking, and when monitoring cumulative metrics like sales or customer acquisitions.
Step-by-Step Guide for "Cumulative Calculations"
Excel:
Data Setup: Assume your dataset includes Date (Column A) and Value (Column B).
Formula: To calculate a cumulative total, enter in cell C2:Drag this formula down along column C to extend the cumulative total down the dataset.
SQL:
Query Setup: Suppose you have a table
daily_datawith columnsdateandvalue.SQL Query: Use the following query to compute cumulative totals:
Tableau:
Data Import: Load your dataset into Tableau.
Calculation: Create a calculated field called "Cumulative Total" with the formula:Set the computation to run along the Date field in your view, typically sorted by Date.
Python (using pandas):
Data Setup: Assume
dfis your DataFrame withdateandvalue, sorted by date.Calculation:
How to Proceed:
Ensure that the dataset is sorted appropriately to avoid logical errors in running totals.
Implement the cumulative calculations using the described methods.
Visualize cumulative totals to provide insights into total growth or accumulation over time.
Rolling Calculations
Let's move on to Rolling Calculations, also known as moving averages. These are used to smooth out short-term fluctuations and highlight longer-term trends or cycles. This method is particularly useful in financial markets, weather data analysis, and inventory control systems.
Step-by-Step Guide for "Rolling Calculations"
Excel:
Data Setup: Assume you have data in Column A (Date) and Column B (Value).
Formula: To calculate a rolling average for a specified number of periods (e.g., 3-day), enter in cell C4:Drag this formula down to apply it to other cells as needed. Adjust the range
B2:B4based on the desired window size.
SQL:
Query Setup: Suppose you have a table
daily_valueswith columnsdateandvalue.SQL Query: To compute a rolling average for 3 days:
Tableau:
Data Import: Load your dataset into Tableau.
Calculation: Create a calculated field named "Rolling Average" with the formula:This calculates the rolling average over the current and the previous two records.
Python (using pandas):
Data Setup: Assume
dfis your DataFrame withdateandvalue, sorted by date.Calculation:Adjust the
windowparameter to change the number of periods for the moving average.
How to Proceed:
Ensure that your data is appropriately sorted and that there are no missing dates (fill or handle them as necessary).
Implement the rolling calculations as described, adjusting the window size as needed for your specific analysis.
Utilize visualization tools to demonstrate the effect of smoothing and to analyze trends more clearly.
Trailing Calculations
Next, we'll cover Trailing Calculations, which are similar to rolling calculations but specifically refer to trailing sums, averages, or other statistical measures over a fixed number of previous periods or data points. These are useful for analyzing performance over a specified recent period.
Step-by-Step Guide for "Trailing Calculations"
Excel:
Data Setup: Assume you have data in Column A (Date) and Column B (Value).
Formula: To calculate a trailing sum for a specified number of periods (e.g., 7 days), enter in cell C8:Drag this formula down to continue the calculation for each new row. Adjust the range
B2:B8based on the desired length of the trailing window.
SQL:
Query Setup: Suppose you have a table
daily_datawith columnsdateandvalue.SQL Query: To compute a trailing sum for 7 days:
Tableau:
Data Import: Load your dataset into Tableau.
Calculation: Create a calculated field named "Trailing Sum" with the formula:This formula computes the sum of the current and the previous six values.
Python (using pandas):
Data Setup: Assume
dfis your DataFrame withdateandvalue, sorted by date.Calculation:Modify the
windowparameter to change the length of the trailing window as required.
How to Proceed:
Ensure data is sorted chronologically and adequately preprocessed to handle missing values or gaps.
Implement the trailing calculations as described, adapting the window size to your data's specific needs.
Use visualizations to show how trailing sums or averages can provide insights into recent trends or performance metrics.
Mean, Median, and Mode
Let’s move on to the fundamental statistical concepts of Mean, Median, and Mode. These measures of central tendency are critical for summarizing data points with a single value that represents the center of the data distribution.
Step-by-Step Guide for "Mean, Median, and Mode"
Excel:
Data Setup: Assume you have data in Column A.
Calculations:
Mean: In cell B1, enter the formula:
Median: In cell B2, enter the formula:
Mode: In cell B3, enter the formula:Use
MODE.MULTfor datasets that may have more than one mode.
SQL:
Query Setup: Suppose you have a table
datawith a columnvalue.SQL Query: SQL typically doesn't support median and mode directly, but mean can be calculated. You can calculate median using an ordered subquery and mode using a frequency table approach:
Tableau:
Data Import: Load your dataset into Tableau.
Calculations: Drag the Value field into a worksheet:
Mean: Use the built-in average calculation.
Median: Use the built-in median calculation.
Mode: Tableau does not have a built-in mode function, but it can be calculated using calculated fields or by visualization techniques that might show the most frequent values.
Python (using pandas):
Data Setup: Assume your DataFrame is named
dfwith a columnvalue.Calculations:
How to Proceed:
Ensure your data is cleaned and preprocessed for null values which might skew these calculations.
Implement the mean, median, and mode calculations as described above.
Consider visualizing the distribution of your data to further understand where these central tendencies lie and to identify any skewness or outliers.
Standard Deviation
Let's now discuss Standard Deviation, a crucial statistical measure that quantifies the amount of variation or dispersion in a set of data values. A low standard deviation indicates that the data points tend to be close to the mean, whereas a high standard deviation indicates that the data points are spread out over a wider range of values.
Step-by-Step Guide for "Standard Deviation"
Excel:
Data Setup: Assume you have data in Column A.
Calculation: To calculate the standard deviation, enter in cell B1:Use
STDEV.Pif you have the entire population orSTDEV.Sif you have a sample of a population.
SQL:
Query Setup: Suppose you have a table
datawith a columnvalue.SQL Query: SQL can calculate standard deviation directly for you:
Tableau:
Data Import: Load your dataset into Tableau.
Calculation: Drag the Value field into a worksheet:
For standard deviation, right-click the field, go to Measure (Sum) > Standard Deviation.
Python (using pandas):
Data Setup: Assume your DataFrame is named
dfwith a columnvalue.Calculation:
How to Proceed:
Validate that your data does not include outliers or errors that could significantly skew the standard deviation.
Implement the standard deviation calculations as described, choosing between sample or population formulas based on your data context.
Use this measure in conjunction with mean and median to analyze the data distribution effectively.
Range
Moving on, we'll explore the concept of Range, which measures the difference between the highest and lowest values in a dataset. This is a simple measure of dispersion and can provide insights into the spread or variability of the data.
Step-by-Step Guide for "Range"
Excel:
Data Setup: Assume you have data in Column A.
Calculation: To calculate the range, enter in cell B1:This formula finds the maximum value and the minimum value in column A and computes the difference.
SQL:
Query Setup: Suppose you have a table
datawith a columnvalue.SQL Query: SQL can calculate the range by determining the maximum and minimum values:
Tableau:
Data Import: Load your dataset into Tableau.
Calculation: You can calculate the range by creating two calculated fields—Max Value and Min Value—and then creating a third to find the difference:
Max Value:
MAX([Value])Min Value:
MIN([Value])Create a new calculated field "Range" that subtracts Min Value from Max Value.
Python (using pandas):
Data Setup: Assume your DataFrame is named
dfwith a columnvalue.Calculation:
How to Proceed:
Make sure your data is clean and adequately preprocessed to ensure that the range is representative of the dataset.
Implement the range calculations as described.
Consider using this measure alongside other variability and central tendency metrics to provide a comprehensive view of your dataset's distribution.
T-test Statistics
Let's delve into T-test Statistics, a statistical examination used to determine if there is a significant difference between the means of two groups, which may be related in certain features. It's commonly used in hypothesis testing to infer if two populations are significantly different from each other based on sample data.
Step-by-Step Guide for "T-test Statistics"
Excel:
Data Setup: Assume you have two sets of data, Group 1 in Column A and Group 2 in Column B.
Calculation: Excel provides functions to perform a T-test:
For an independent samples T-test, where you compare two unrelated groups:The third argument (
2) specifies a two-tailed test, and the fourth argument (2) indicates equal variances (use3for unequal variances).
SQL:
SQL does not typically support T-tests natively. You would need to extract the data and use a statistical package or scripting language like Python or R to perform the T-test.
Tableau:
Tableau does not support conducting T-tests directly through its interface. You would need to calculate necessary statistical measures in Tableau and then use another tool (like R or Python) integrated through Tableau's analytics extensions for conducting the T-test.
Python (using pandas and scipy):
Data Setup: Assume your DataFrame is named
dfwith columnsgroup1andgroup2.Calculation:Adjust
equal_varbased on whether variances are assumed equal or not.
How to Proceed:
Ensure data sampling methods and assumptions of the T-test are correctly met (e.g., independence, normality of the distributions, and homogeneity of variance).
Implement the T-test calculations as described.
Interpret the results: A low p-value (typically < 0.05) indicates strong evidence against the null hypothesis, suggesting a significant difference between the groups.
ANOVA
Next, we'll explore ANOVA (Analysis of Variance), a statistical method used to compare the means of three or more independent groups to determine if there is a significant difference among them. It's particularly useful when testing hypotheses about group differences across multiple categories.
Step-by-Step Guide for "ANOVA"
Excel:
Data Setup: Assume you have data for three groups in three columns, Column A, Column B, and Column C.
Calculation: Use Excel's Data Analysis Toolpak:
Go to
Data>Data Analysisbutton. If it's not visible, install the Toolpak fromExcel Options>Add-ins.Select
ANOVA: Single Factorand clickOK.Input the data ranges for your groups and specify the output range where you'd like the ANOVA table to appear. Choose labels and other settings as necessary.
The output will include F-statistic and P-value, which help determine the significance of the differences between the groups.
SQL:
SQL does not support ANOVA natively. Data needs to be extracted and analyzed in a statistical software or programming environment that supports ANOVA, like R or Python.
Tableau:
Tableau does not directly perform ANOVA, but you can visualize and prepare data to inspect variances manually or use R or Python scripts integrated through Tableau's external services connection to perform ANOVA.
Python (using pandas and statsmodels):
Data Setup: Assume your DataFrame is named
dfwith columnsgroup1,group2, andgroup3.Calculation:
How to Proceed:
Verify that the assumptions for ANOVA are met: independence within and between groups, homogeneity of variances (homoscedasticity), and normally distributed group populations.
Implement the ANOVA calculations using the detailed steps for your preferred tool.
Interpret the results to understand if the mean differences between any of the groups are statistically significant.
Chi-square Test
Now, let’s explore the Chi-square Test Statistics, which is a test used to determine whether there's a significant association between categorical variables. It’s commonly used in market research, A/B testing, and other fields requiring categorical data analysis.
Step-by-Step Guide for "Chi-square Test Statistics"
Excel:
Data Setup: Assume you have a contingency table in range A1:B3, where the rows represent categories and the columns represent outcomes.
Calculation: Use Excel's
CHISQ.TESTfunction:Enter the observed data in A1:B3 and the expected data in C1:D3.
In a new cell, type:
This will give you the p-value to help determine the significance of the association.
SQL:
SQL does not support Chi-square tests natively. You would need to compute the observed and expected frequencies manually and use a statistical package or a scripting language like Python to perform the test.
Tableau:
While Tableau does not directly perform Chi-square tests, you can use it to visually inspect the distribution and calculate observed and expected frequencies. These can then be analyzed in a statistical tool.
Python (using pandas and scipy):
Data Setup: Assume your DataFrame
dfhas categorical variablescategoryandoutcome.Calculation:
How to Proceed:
Ensure your data is properly formatted as categorical and the contingency table is correctly set up with observed counts.
Implement the Chi-square calculations as described.
Interpret the p-value: a low p-value (typically < 0.05) suggests a statistically significant association between the variables.
Correlation
Let's proceed to Correlation, which measures the strength and direction of a relationship between two quantitative variables. This is essential for determining how variables are related without assuming a cause-and-effect relationship.
Step-by-Step Guide for "Correlation"
Excel:
Data Setup: Assume you have two series of data, one in Column A and one in Column B.
Calculation: To compute the Pearson correlation coefficient, use:This function will return a value between -1 and 1, where 1 indicates a perfect positive correlation, -1 indicates a perfect negative correlation, and 0 indicates no correlation.
SQL:
Query Setup: Suppose you have a table
datawith columnsvariable1andvariable2.SQL Query: SQL can calculate correlations using a window function or an aggregate function in some advanced SQL databases:
Tableau:
Data Import: Load your dataset into Tableau.
Calculation:
Drag one variable to the Columns shelf and the other to the Rows shelf to create a scatter plot.
Click on "Analytics" > "Trend Line" > "Linear" to add a trend line, which includes the correlation coefficient.
Python (using pandas):
Data Setup: Assume your DataFrame is named
dfwith columnsvariable1andvariable2.Calculation:This will produce a matrix showing the correlation coefficients between the variables.
How to Proceed:
Verify that the data does not contain outliers, as they can significantly affect the correlation.
Implement the correlation calculation using the described methods.
Interpret the correlation coefficient to understand the degree of relationship between the two variables.
Regression
Let’s delve into Regression, specifically focusing on linear regression, which is used to model the relationship between a dependent variable and one or more independent variables. This method is invaluable for prediction, forecasting, and inferring causal relationships.
Step-by-Step Guide for "Regression"
Excel:
Data Setup: Assume you have the independent variable in Column A and the dependent variable in Column B.
Calculation:
Use the Data Analysis Toolpak and select "Regression":
Input Y Range (dependent variable): B1:Bn
Input X Range (independent variable(s)): A1:An
Choose the output range and click OK to generate the regression analysis output, which includes coefficients, R-squared value, and more.
SQL:
SQL isn't typically used for performing regression directly. Data should be exported to a statistical software or a programming environment capable of regression analysis like R or Python.
Tableau:
Data Import: Load your dataset into Tableau.
Calculation:
Drag the dependent variable to the Columns shelf and the independent variable to the Rows shelf.
Right-click on the view and choose "Trend Lines" > "Show Trend Lines".
Select "Linear" to add a linear regression line to the scatter plot. Tableau will display the equation and R-squared value on the chart.
Python (using pandas and statsmodels):
Data Setup: Assume your DataFrame is named
dfwith columnsindependent_varanddependent_var.Calculation:This output will include the regression coefficients, R-squared, and other diagnostic measures.
How to Proceed:
Ensure your data is appropriate for regression (check for linearity, multicollinearity, etc.).
Implement the regression analysis using the described methods.
Use the model’s output to interpret the relationship between variables, predict future values, and evaluate the model’s accuracy (through R-squared, adjusted R-squared).
Anything missing? Get in touch