Eji @ejiwarp リンク限定
2020年11月7日
by @statsguyphd コードはGithubにで検証可能 データは連邦政府公開情報: https://county.milwaukee.gov/EN/County-Clerk/Off-Nav/Election-Results/Election-Results-Fall-2020 *:現在はアクセス困難、なぜでしょう 続きを読む
0
Statsguyphd @statsguyphd

I am making these tweets to explain in one place some analysis that was done last night. 1 - I was asked offline about doing Benford's on election data. I explained that this is common and a useful way to detect anomalies in data that are driven by artificial process (e.g. fraud)

2020-11-05 23:05:47
Statsguyphd @statsguyphd

2 - My student then pointed me towards a tweet that was exploring this type of analysis (but they hadn't done Benford's). So I chimed in.

2020-11-05 23:06:47
Statsguyphd @statsguyphd

3 - However, I did not know what data they used so I found a source for the context they referenced. However, I could not initially find write-ins versus non-write-ins, so I looked at candidate counts.

2020-11-05 23:07:46
Statsguyphd @statsguyphd

4 - I then wrote a quick script to gather that data, here is an example of what the data gathering portion of this process looked like. pic.twitter.com/9zfKvJhqU2

2020-11-05 23:08:26
拡大
Statsguyphd @statsguyphd

5 - With this data now available to look at in code, I created a process to analyze first digit conformity to the Benford's distribution. This is a test that is often conducted via Chi-squared.

2020-11-05 23:09:22
Statsguyphd @statsguyphd

6 - I wrote the code to produce the Benford's discrete distribution. This code looks like this. pic.twitter.com/hKpAZgAW1b

2020-11-05 23:11:44
拡大
Statsguyphd @statsguyphd

7 - Now that I had the data and the distribution, I simply needed to perform the test. To do that, I leveraged scipy's chisquare. However, prior to doing that, you need to produce the expected result values (not just the percentages. But this is as simple.

2020-11-05 23:12:58
Statsguyphd @statsguyphd

8 - To do that, you take the total number of observations (number of numbers that the first digit counts are derived from) and multiply them by the Benford's distribution frequencies accordingly. This looks like this: pic.twitter.com/JRkbFvfBcZ

2020-11-05 23:18:12
拡大
Statsguyphd @statsguyphd

9 - The final process, put together, has some additional code to handle data and count the digits from that webpage (comes in 2 parts, first script setup and function definition, then the script on next tweet): pic.twitter.com/2biVrnZ2P3

2020-11-05 23:21:32
拡大
Statsguyphd @statsguyphd

11 - In the end, Biden's vote data from that page is far more anomalous than Trump's. Here is what it looks like visually: pic.twitter.com/7qPivR9zQX

2020-11-05 23:23:09
拡大
Statsguyphd @statsguyphd

12 - And here are the raw numbers (1 to 9): Biden: [86, 35, 52, 69, 79, 62, 42, 28, 22] Trump: [115, 85, 89, 57, 35, 36, 27, 16, 16]

2020-11-05 23:25:41
Statsguyphd @statsguyphd

13 - Here are the respective p-values: Biden 1.5076774999383611e-27 Trump 0.00048111250713426005

2020-11-05 23:26:40
Statsguyphd @statsguyphd

14 - What is notable is the extreme difference in their p-values. The drawback to this analysis is that there is a Better test for Benford's goodness of fit. It is the Watson version of the Cramer von Mises test (U2). You can read about why it is better here (next message)

2020-11-05 23:28:16
Statsguyphd @statsguyphd

15 - Here: Lesperance, M., Reed, W. J., Stephens, M. A., Tsao, C., & Wilton, B. (2016). Assessing Conformance with Benford’s Law: Goodness-Of-Fit Tests and Simultaneous Confidence Intervals. PLoS ONE, 11(3). doi.org/10.1371/journa…

2020-11-05 23:28:25
Statsguyphd @statsguyphd

16 - What is undeniable is that the first digit frequencies of Biden's vote totals is extremely anomalous in comparison to Trump's.

2020-11-05 23:29:04
Statsguyphd @statsguyphd

I should make clear that I am not trying to endorse or refute any political party. I'm analyzing data and I feel we should all be free to analyze data. Ideally I wouldn't even know who the numbers are associated with and I could be free to just analyze them in the blind.

2020-11-06 00:05:37
Statsguyphd @statsguyphd

People are asking for the code not in image form. I was just told I can use this site to do it, so here you go. pastebin.com/YKFyKtbc

2020-11-06 03:55:18
Statsguyphd @statsguyphd

So I am being accused of being: - a Russsian bot - a Fake account - a Qanon handler (unless I misread that) - a criminal - an election interfere At this point I want to go back to make silly stats jokes that no one laughed at.

2020-11-06 08:57:31
Statsguyphd @statsguyphd

*interferer Look, I would like to point out that no one has to trust anything and you are strongly encouraged to run your own analysis. That's sort of the point. Learn to code a bit, collect some data, and run some analysis.

2020-11-06 08:59:42
Statsguyphd @statsguyphd

Now I am being called a fascist as well. I am trying to remember the stoics: Weak men are murderous in mobs, cowards bold in the crowd. I will leave this account up, but I will not engage with the hatefulness I am receiving.

2020-11-06 09:05:52
Spotted Toad @toad_spotted

Alright,a less snazzy version of earlier analysis,&then I'll quit for a while while I'm behind. Here's early&late counts for Trump and Biden in Milwaukee. Trump's are 1 to 1, while Biden doesn't just get more, totals are all over the place. (Correlations 0.86 & 0.46 respectively) pic.twitter.com/NUylGOlnWk

2020-11-06 09:56:35
拡大
Spotted Toad @toad_spotted

You could say that the later counts were shifted towards vote-by-mail, which favors Biden, & so a higher slope of his late counts to earlier counts makes sense. But I'm not sure I know why distribution relative to earlier counts would be so chaotic for him + so regular for Trump.

2020-11-06 09:59:05
Spotted Toad @toad_spotted

Here's another way of looking at the pattern. These are quantile plots which show the mean levels of new votes by early votes. Trump follows the "line of perfect agreement" almost, well, perfectly, while Biden's increases are pretty wild. pic.twitter.com/xKyTImzWC2

2020-11-06 10:04:22
拡大
拡大
Statsguyphd @statsguyphd

Who knew that with less than 100 lines of code you could make half the country wish you were dead, the other half appreciate math, and approximately 0.00000001% laugh at what you thought were high quality stats jokes.

2020-11-06 10:32:54
Statsguyphd @statsguyphd

Thank you to whoever took this over and developed an interest in the data collection and exploration/analysis. I am glad people are getting interested in learning this stuff!

2020-11-06 11:32:54
Statsguyphd @statsguyphd

I meant to provide this link (someone put this kind of work in Jupyter and began expanding the datasets): github.com/cjph8914/2020_…

2020-11-06 11:34:53
Spotted Toad @toad_spotted

I wonder if the "real" trend line for later vote by mail votes would be like this pic.twitter.com/TCDb8AREXR

2020-11-06 12:30:15
拡大
0
まとめたひと
Eji @ejiwarp

ボカロとパソコン関連など自分が気になったものをつぶやいてみる。ミク大好きです。よろしくお願いします。

コメント

コメントがまだありません。感想を最初に伝えてみませんか?