Sorry for missing this Tuesday, I had to take more time learning the statistics than I was expecting to.
In my last post, I talked about using Ace:Error ratio to identify if servers have suboptimal shape. This approach is attractive because the statistics needed to calculate it are relatively easy to acquire. I didn’t have to use volleymetrics or have someone other than the official statisticians at each match take statistics. If you’re coaching at the club or high school level where there is no official statistician, there is no interpretive work required of whoever is keeping the stats, all they need to do is count is aces, errors, attempts and hold serve percentage. But, it’s not as granular as we might like. With better data, a more granular picture of an athlete’s shape is possible. This more granular picture of shape can influence how we practice and what approach we develop for games. In this post, I’ll talk about statistical significance in creating athlete-level data.
When we run a statistical test, we assume a null hypothesis. For a simple test, that might be “the coin I’m flipping is fair.” Then, the p-value produced by the test is the percent chance that the set of outcomes I’m testing would happen if the null hypothesis is true. A statistical test can’t say “That’s an unfair coin.” A statistical test can say “If that’s a fair coin, there’s only a 5% chance that you would flip the coin that many times and the proportion of heads is under .4.” Before running a statistical test, you might decide on a significance level. This is a threshold p-value, any value under it you are saying is significant to you. For example, if you pick a .05 significance level and you run your test and the p-value is .045, you might say “I have good reason to think this coin is unfair.”
When running a statistical test, you can make two types of errors, false positives and false negatives. A false positive is when you say “I have good reason to think this coin is unfair” when the coin is actually fair. A false negative is when you say “I lack a good reason to think this coin is unfair” when the coin is unfair.
We can plan metrics ahead of collecting a sample to know how large that sample needs to be. Let’s shift examples into volleyball. Recall from a previous post that I said that shape of production is the shares of the total for a set of possible results. We’re going to use an experiment to generate an image of an athlete’s shape of production. We’re going to design that experiment using some statistical principles and then analyze the results with some statistical testing to find out what we can learn from that data.
I foresee data collection happening in practice. I'd do that for a couple of reasons. First, it'd be nice to have the information in preseason and be able to take it into early games. Second, reps come slowly in games, it might take a whole season to get a good chunk of data. By that point, it's not very useful. I see the experiment going thusly: We instruct athletes to think of three serving approaches. A 2 out of 5, a 4 out of 5 and a 5 out of 5. Whatever that means to them is great. We’ll then have them serve some number of balls with each approach with a serve receive formation on the other side, stat the results, and then compare. Then, we can use those representations of an athlete’s shape to choose which approach we’d like from them in a game. Are we playing against a team that sides out well, in a 3-hitter rotation while our setter is front row? Maybe we tell them to serve a 5. Are we in the reverse situation and they’re in a 2-hitter rotation and we’re in a 3-hitter rotation with our best blocking middle? Maybe we tell them to serve a 3. Maybe some athletes should serve a 4 in the first situation, maybe some should serve a 2 in the second situation. But, we’d have the data to help them deliberately pick an approach in a game.
To design our experiment, we need to know what data we’ll be collecting and how large of a sample size we’ll need to generate statistically significant results. For data, we’ll stat the first result of the serve. Ace, overpass, 3, 2, 1, error. For our sample size, we need to know our significance level (p-value), the approximate proportion that our target result will make up, and the size of change we want to be able to detect.
For significance level, we can start at .05, that’s standard. For approximate proportion, we can look at Chad Gordon’s data to make approximations: His dataset had this distribution:
Ace proportion: .064
Overpass proportion: .042
1 proportion: .207
3 proportion + 2 proportion: .575
Error proportion: .104
(These numbers do not add up to 1.0, Gordon’s data only has 2 or 3 significant digits so it doesn’t come out perfect.)
To make this compatible with the data I want to collect I’ll split out the 3 and 2 passes by making stuff up and say that ⅔ of the 3+2s are 2s. So we end up with:
Ace proportion: .064
Overpass proportion: .042
1 proportion: .207
2 proportion: .380
3 proportion: .195
Error proportion: .104
Now we need an effect size that we’d like to detect for each. This will also be somewhat arbitrary. Remember that we’re just calculating sample size right now and that once we get our data we’ll run tests on it to see if we’ve found anything significant. Recall that our season total opponent side out proportion was .565. In Gordon’s sample the total was .5725. That’s a difference of .075. I figured that this was as good an indicator of what magnitude of change we could expect as any. .075/.565 is .133. So I’ll multiply each proportion by .133 to find out the effect size we’re looking for.
Ace effect size: .0085
Overpass effect size: .0056
1 effect size: .028
2 effect size: .051
3 effect size: .026
Error effect size: .014.
The formula for sample size for an experiment where the data are proportions of a population made up by mutually exclusive variables is: (The formatting for this is a disaster because Wix doesn't support subscripts or text wrapping.)
n is the sample size, the Z-squared term is the Z-value for your p-value, p (which is not your p-value) is the approximate proportion of your dataset that will be your target result, and ω is the size of change you would like to detect.
To find our sample size for overpasses, we can plug in values to the right side:
1.96 is the Z-value for a p-value of .05, .042 is the overpass percentage, .0056 is the effect size we discussed earlier.
To find out sample size for 2 passes, we can do the same:
That’s a lot of serves. Especially when we’re talking about A/B testing 3 different approaches, that’s over 1,000 serves for our low-end estimate.
We might need to make some compromises on the strength of our results to make the experiment doable. I think the first place to start is with our p-value. .05 is industry standard for a p-value, but that’s not necessarily a very strong reason to use it. It means that we’ll have a 5% chance of saying two approaches are different when they’re actually the same. We could go higher and still get meaningful data.
Let’s try out a low-end estimate of sample size with .1 and .2 p-values.
The 444 serves for a .2 p-value are much more manageable than the 1,000+ for a .05. It’s still a lot, especially if you don’t want you athletes to take 100+ swings in a practice (you probably shouldn’t) but is probably doable in 2 weeks.
Next post I’ll talk about balancing limited time in practice with a need for good data, and building an approach with athletes from data about their shape.
P.s. Does anybody know if we should be pluralizing “side out” as “sides out” a-la “attorneys general”?
Next post: Serving Series Part 7: What a Data-Driven Approach Can Do, What it Can't, and How to Trust It
Special thanks to these sites for teaching me about statistics:
Also special thanks to Dall-E mini for providing this post's image:
Comments