I am making these tweets to explain in one place some analysis that was done last night. 1 - I was asked offline about doing Benford's on election data. I explained that this is common and a useful way to detect anomalies in data that are driven by artificial process (e.g. fraud)
2020-11-05 23:05:472 - My student then pointed me towards a tweet that was exploring this type of analysis (but they hadn't done Benford's). So I chimed in.
2020-11-05 23:06:473 - However, I did not know what data they used so I found a source for the context they referenced. However, I could not initially find write-ins versus non-write-ins, so I looked at candidate counts.
2020-11-05 23:07:464 - I then wrote a quick script to gather that data, here is an example of what the data gathering portion of this process looked like. pic.twitter.com/9zfKvJhqU2
2020-11-05 23:08:265 - With this data now available to look at in code, I created a process to analyze first digit conformity to the Benford's distribution. This is a test that is often conducted via Chi-squared.
2020-11-05 23:09:226 - I wrote the code to produce the Benford's discrete distribution. This code looks like this. pic.twitter.com/hKpAZgAW1b
2020-11-05 23:11:447 - Now that I had the data and the distribution, I simply needed to perform the test. To do that, I leveraged scipy's chisquare. However, prior to doing that, you need to produce the expected result values (not just the percentages. But this is as simple.
2020-11-05 23:12:588 - To do that, you take the total number of observations (number of numbers that the first digit counts are derived from) and multiply them by the Benford's distribution frequencies accordingly. This looks like this: pic.twitter.com/JRkbFvfBcZ
2020-11-05 23:18:129 - The final process, put together, has some additional code to handle data and count the digits from that webpage (comes in 2 parts, first script setup and function definition, then the script on next tweet): pic.twitter.com/2biVrnZ2P3
2020-11-05 23:21:3210 - And the rest of that script: pic.twitter.com/di7qGkfBSX
2020-11-05 23:21:5911 - In the end, Biden's vote data from that page is far more anomalous than Trump's. Here is what it looks like visually: pic.twitter.com/7qPivR9zQX
2020-11-05 23:23:0912 - And here are the raw numbers (1 to 9): Biden: [86, 35, 52, 69, 79, 62, 42, 28, 22] Trump: [115, 85, 89, 57, 35, 36, 27, 16, 16]
2020-11-05 23:25:4113 - Here are the respective p-values: Biden 1.5076774999383611e-27 Trump 0.00048111250713426005
2020-11-05 23:26:4014 - What is notable is the extreme difference in their p-values. The drawback to this analysis is that there is a Better test for Benford's goodness of fit. It is the Watson version of the Cramer von Mises test (U2). You can read about why it is better here (next message)
2020-11-05 23:28:1615 - Here: Lesperance, M., Reed, W. J., Stephens, M. A., Tsao, C., & Wilton, B. (2016). Assessing Conformance with Benford’s Law: Goodness-Of-Fit Tests and Simultaneous Confidence Intervals. PLoS ONE, 11(3). doi.org/10.1371/journa…
2020-11-05 23:28:2516 - What is undeniable is that the first digit frequencies of Biden's vote totals is extremely anomalous in comparison to Trump's.
2020-11-05 23:29:04I should make clear that I am not trying to endorse or refute any political party. I'm analyzing data and I feel we should all be free to analyze data. Ideally I wouldn't even know who the numbers are associated with and I could be free to just analyze them in the blind.
2020-11-06 00:05:37People are asking for the code not in image form. I was just told I can use this site to do it, so here you go. pastebin.com/YKFyKtbc
2020-11-06 03:55:18So I am being accused of being: - a Russsian bot - a Fake account - a Qanon handler (unless I misread that) - a criminal - an election interfere At this point I want to go back to make silly stats jokes that no one laughed at.
2020-11-06 08:57:31*interferer Look, I would like to point out that no one has to trust anything and you are strongly encouraged to run your own analysis. That's sort of the point. Learn to code a bit, collect some data, and run some analysis.
2020-11-06 08:59:42Now I am being called a fascist as well. I am trying to remember the stoics: Weak men are murderous in mobs, cowards bold in the crowd. I will leave this account up, but I will not engage with the hatefulness I am receiving.
2020-11-06 09:05:52Alright,a less snazzy version of earlier analysis,&then I'll quit for a while while I'm behind. Here's early&late counts for Trump and Biden in Milwaukee. Trump's are 1 to 1, while Biden doesn't just get more, totals are all over the place. (Correlations 0.86 & 0.46 respectively) pic.twitter.com/NUylGOlnWk
2020-11-06 09:56:35You could say that the later counts were shifted towards vote-by-mail, which favors Biden, & so a higher slope of his late counts to earlier counts makes sense. But I'm not sure I know why distribution relative to earlier counts would be so chaotic for him + so regular for Trump.
2020-11-06 09:59:05Here's another way of looking at the pattern. These are quantile plots which show the mean levels of new votes by early votes. Trump follows the "line of perfect agreement" almost, well, perfectly, while Biden's increases are pretty wild. pic.twitter.com/xKyTImzWC2
2020-11-06 10:04:22Who knew that with less than 100 lines of code you could make half the country wish you were dead, the other half appreciate math, and approximately 0.00000001% laugh at what you thought were high quality stats jokes.
2020-11-06 10:32:54Thank you to whoever took this over and developed an interest in the data collection and exploration/analysis. I am glad people are getting interested in learning this stuff!
2020-11-06 11:32:54I meant to provide this link (someone put this kind of work in Jupyter and began expanding the datasets): github.com/cjph8914/2020_…
2020-11-06 11:34:53I wonder if the "real" trend line for later vote by mail votes would be like this pic.twitter.com/TCDb8AREXR
2020-11-06 12:30:15