Simple stats actions translated from Excel to Python
Updated on 2020-11-08.
There are two foundational ideas that are driving me to carve out some time to sit down and right this today:
This idea that writing technical blog posts is in many ways writing documentation for you to reference in the future.
It amazes me that in 2020 this wasn’t easier to find.
I recently started my MBA program, and one of the first required courses is a statistics class. Coursework starts in probability and moves on to typical stats subjects. Almost half way in, we’ve covered probability, probability distributions, sampling, confidence intervals, and hypothesis testing.
My course, like most MBA courses, presents the theory behind these things before presenting how to do, and those presentations are done in Excel. Let me tell you- I haven’t used Excel since… 2015? Maybe? (When I took this class in undergrad, we only used R, but R is no longer my primary language.)
Excel is a tool, but it’s never the tool that I go for. The same is true for Sheets. I do use the Google Suite and I’ve used GSheet for work occasionally, but I still almost always find myself gravitating towards a Jupyter notebook or something better than a spreadsheet to do the thing whatever that may be.
I can see, though, why my professors proclaim Excel as the way to go. For non-engineering folks, you can:
Open software and use a GUI
Lots of internet tutorials and help videos
For most people, these are probably the tools they’ll be using at work to do the job
In that case, I definitely see the value, but for me, Excel brings up a series of bad memories:
No version control
Difficult to parse formulas
Poorly organized or presented information that is not easy to understand or read
I’m an engineer, give me engineering tools!
I also like the narrative ability that comes with moving from top to bottom as I move through a notebook. I can take a word problem, translate it to a bunch of comments, and use those to drive what comes next.
This also makes it easier to return to later on, if I want to revisit the problem. I don’t need to redecipher what the word problem is saying, especially because in my notes, I delineate between information that is given versus information that is calculated or assumed.
While my professor teaches how to do these things in Excel, if I wanted to learn how to do it in a notebook, I needed to figure out how to do it on my own. I knew I probably wanted to use SciPy, but I didn’t know much more.
Below my cheatsheet on converting Excel formulas into SciPy formulas.
Imports needed for a notebook
Probability
Complements: P!(A) = 1-P(A)
Addition: P(A or B) = P(A) + P(B) - P(A and B)
Mutually Exclusive: P(A and B) = 0
Conditional: P(A|B)= P(A and B) / P(B)
Multiplication: P(A and B) = P(A|B) * P(B)
Independence: P(A and B) = P(A) * P(B)
Probability Distributions
Binomial Distribution
Discrete
Success or failure
n = trials
π = probability of success
mean = n * π
standard deviation = √n*π(1-π)
Notes on Excel formulas:
r = target number, see Purpose
n = trials
π = probability of success
Notes for mapping SciPy to Excel:
k = r
n = n
p = π
Normal Distribution
Bell shaped curve
µ = mean
σ = standard deviation
Notes on Excel formulas:
c (or d) = target number, see Purpose
µ = mean
σ = standard deviation
Notes for mapping SciPy to Excel:
x = c (or d)
loc = µ
scale = σ
q = prob
Standard Normal Distribution is just a normal distribution where the mean = 0 and the standard deviation = 1.
Statistical Inference
Confidence Intervals (Estimation)
µ is unknown
use x̄ (aka xbar)
interval is about x̄
Hypothesis Testing
some belief about µ leads to the H_0 (aka null hypothesis)
observe x̄ and ask how unlikely it is
Confidence Intervals
For calculating the test statistic:
number to enter into the formula = (1-confidence interval / 2) + confidence interval
95% Confidence Interval =
NORM.S.INV(.975)
degrees of freedom = n - 1
Notes for mapping SciPy to Excel:
q = number to enter into the formula = (1-confidence interval / 2) + confidence interval
loc = µ
scale = σ
df = df
Specified Margin of Error (MoE)
MoE = margin of error
n = number of trials needed
use π = .5 for most conservative estimate
Hypothesis Testing
Steps:
Determine H_0 and H_a
Specify value of α
Calculate the test statistic (zobs or tobs)
Draw a picture
Use the test statistic to calculate the p-value
Compare the p-value to α; Reject H_0 if p-value < α
Translate to words
Putting it together
Confidence interval for means when σ is known
[to come]
Confidence interval for means when σ is unknown
s = sample standard deviation
confidence_interval = confidence interval
n = trials or sample size
xbar = x̄ or mean of the sample
Sample conclusion: “There is a 98% probability that a confidence interval calculated in this way captures the true population average claims per employee.”
Confidence interval for proportions
n = trials or sample size
confidence_interval = confidence interval
success = number correct
Sample conclusion: “”
Hypothesis testing for means
mu = mean of the population # what we’re testing
n = trials or sample size
xbar = x̄ or mean of the sample
s = sample standard deviation (used when σ unknown)
alpha = signficance (e.g. 0.05 alpha = 5% significance)
Hypothesis testing for proportions
alpha = signficance (e.g. 0.05 alpha = 5% significance)
success = number of successes
n = trials or sample size
mu_p = mean of the proportion
sigma_p = standard deviation of the proportion
proportion = success/n
Typing Shortcuts
π = option+p
√ = option+v
∑ = option+w
µ = option+m
≤ = option+,
≥ = option+.
± = shift+option+=
σ requires symbol font
x̄ requires symbol font
α requires symbol font