Determining Authorship of Ron Paul’s Newletters Through Text Analysis: Part 2
Update: Part 3 is here.
Last week, readers (all two of you), may remember that I attempted to explore the question of authorship of Ron Paul’s controversial newsletters. You may recall that I attempted to compare the frequency of word length of a number of Paul’s known writings with four newsletter excerpts of which Paul denies authorship.
The trouble with the approach I took is that the tests are designed to show differences in authorship, but do not address the question of similarity. We may be able to statistically show that two pieces of writing come from different authors through a chi-square test of independence through the appearance of a small p-value. A large, p-value, however, does not necessarily show that the same author wrote two pieces of writing, though many take this result to be implicit.
What the results of the previous post do require, however, is further tests.
I focused on four articles, one on the coming race war, one on carjacking, one on AIDS and another one calling for Paul reelection to Congress. By analyzing word length, punctuation and letter appearance, We were able to determine that Paul probably did not write two of the four articles, namely the re-election article and the particularly offensive article on carjacking. The article on AIDS and the coming race war, however, are still in dispute.
Taking a cue from a paper sent my Mr. JD Klein, who kindly took the time to comment on the last post, I ran a principal component analysis (PCA) on word length. I have since added several articles by Lew Rockwell, head of the Mises Institute (a libertarian think tank), a few articles written by other members of the institute, more of Paul’s articles and three more of his books.
PCA is a mathematical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of uncorrelated variables called principal components. It is normally used as a data reduction technique when one has multiple correlated variables and wishes to reduce them into one, two or possibly three compact, but uncorrelated variables. In this case, there 30 variables representing the percentage of word lengths (from 1-30) in all of the texts.
What one can also do, is find important clusters of observations when plotting the first and second PC’s against one another. Thus, if Paul wrote some works, but someone else wrote others, we might see that all of Paul’s writings occupy a particular region on the plot, whereas the other author occupies another.I have included the plot on the right. Interestingly, Paul’s writings are all over the place. What is of not, is that some of Lew Rockwell’s writings appear to be clustered in one region, along with the re-election and carjacking articles, the very two aritcles that were found to likely NOT to be written by Paul in the previous post. I have circled the appropriate region.
Searching further on authorship attribution and text analysis (this field is rather new to me), I also found a software package called JGAAP (Java Graphical Authorship Attribution Program) . It is a Java based textual analysis program. It allows one to feed in a number of text files, assign authorship to each one of them, and compare them with a number of texts of unknown authorship. While the program allows for a number of comparison methods, I opted for the path of lowest resistance (and time) and compared word length between the texts using and nearest neighbor driver and a histogram distance.
I have included a table of the three most likely authors of the four articles based on word length. Interestingly, Paul is not the definitive author on any of the texts. In fact, he is not even in the top three for the re-election article. Lew Rockwell, however, is implicated in all four of these articles. Michael Rozeff (I included a number of “control” articles) made the top three for the race war article, a result that I’m not sure how to interpret.
Clearly, further analysis is in order. Given extra time, I will pursue this to the best of my ability. I find these results fascinating, however. Paul maintains that he did not write the articles and, given these results, that may be true. Lew Rockwell, long involved in Paul activities could have, in fact written these.
That Paul himself disavows these articles is not surprising in an election campaign. What is missing, though, is the question of who wrote these articles and the extent of Paul’s knowledge of what was written in his name. I think that I have, in some way, cracked this egg for further investigation.