Determining Authorship of Ron Paul’s Newletters Through Text Analysis: Part 2

Update: Part 3 is here.

Last week, readers (all two of you), may remember that I attempted to explore the question of authorship of Ron Paul’s controversial newsletters. You may recall that I attempted to compare the frequency of word length of a number of Paul’s known writings with four newsletter excerpts of which Paul denies authorship.

The trouble with the approach I took is that the tests are designed to show differences in authorship, but do not address the question of similarity. We may be able to statistically show that two pieces of writing come from different authors through a chi-square test of independence through the appearance of a small p-value. A large, p-value, however, does not necessarily show that the same author wrote two pieces of writing, though many take this result to be implicit.

What the results of the previous post do require, however, is further tests.

I focused on four articles, one on the coming race war, one on carjacking, one on AIDS and another one calling for Paul reelection to Congress. By analyzing word length, punctuation and letter appearance, We were able to determine that Paul probably did not write two of the four articles, namely the re-election article and the particularly offensive article on carjacking. The article on AIDS and the coming race war, however, are still in dispute.

Taking a cue from a paper sent my Mr. JD Klein, who kindly took the time to comment on the last post, I ran a principal component analysis (PCA) on word length. I have since added several articles by Lew Rockwell, head of the Mises Institute (a libertarian think tank), a few articles written by other members of the institute, more of Paul’s articles and three more of his books.

PCA is a mathematical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of uncorrelated variables called principal components. It is normally used as a data reduction technique when one has multiple correlated variables and wishes to reduce them into one, two or possibly three compact, but uncorrelated variables. In this case, there 30 variables representing the percentage of word lengths (from 1-30) in all of the texts.

What one can also do, is find important clusters of observations when plotting the first and second PC’s against one another. Thus, if Paul wrote some works, but someone else wrote others, we might see that all of Paul’s writings occupy a particular region on the plot, whereas the other author occupies another.

Biplot of First Two PC's of Word Length of Writings by Ron Paul, Other Authors and Disputed Articles from the Ron Paul Newletters

I have included the plot on the right. Interestingly, Paul’s writings are all over the place. What is of not, is that some of Lew Rockwell’s writings appear to be clustered in one region, along with the re-election and carjacking articles, the very two aritcles that were found to likely NOT to be written by Paul in the previous post. I have circled the appropriate region.

Searching further on authorship attribution and text analysis (this field is rather new to me), I also found a software package called JGAAP (Java Graphical Authorship Attribution Program) . It is a Java based textual analysis program. It allows one to feed in a number of text files, assign authorship to each one of them, and compare them with a number of texts of unknown authorship. While the program allows for a number of comparison methods, I opted for the path of lowest resistance (and time) and compared word length between the texts using and nearest neighbor driver and a histogram distance.

I have included a table of the three most likely authors of the four articles based on word length. Interestingly, Paul is not the definitive author on any of the texts. In fact, he is not even in the top three for the re-election article. Lew Rockwell, however, is implicated in all four of these articles. Michael Rozeff (I included a number of “control” articles) made the top three for the race war article, a result that I’m not sure how to interpret.

Clearly, further analysis is in order. Given extra time, I will pursue this to the best of my ability. I find these results fascinating, however. Paul maintains that he did not write the articles and, given these results, that may be true. Lew Rockwell, long involved in Paul activities could have, in fact written these.

That Paul himself disavows these articles is not surprising in an election campaign. What is missing, though, is the question of who wrote these articles and the extent of Paul’s knowledge of what was written in his name. I think that I have, in some way, cracked this egg for further investigation.

Advertisements

About Pete Larson

Assistant Professor of Epidemiology at the Nagasaki University Institute for Tropical Medicine

12 responses to “Determining Authorship of Ron Paul’s Newletters Through Text Analysis: Part 2”

  1. JD Klein says :

    I think that this analysis shows how difficult author attribution is, and how deceptive the pure p-values are, and how risky it is to use canned programs. Perhaps word length is the wrong metric to use, because it has low sensitivity, as the big overlap between Paul and Rockwell illustrates. The PCA paper I cited in the previous post used vocabulary instead of word length. Specifically, it singled out a class of words (function words like ‘of’) that are independent of the immediate subject matter, avoiding systematic biases. A good test is to get samples of known columnists, like Tom Friedman, Maureen Dowd et al, and run any method on them as well. If it doesn’t work in the known case, it won’t say much about unknown cases.

    For amusement, when I was tinkering with these methods, I considered a variant in which specific words would be replaced by NOUN, VERB, ADVERB, GERUND …. and then one would run a frequency analysis, thereby stripping out the subject matter and leaving only the structure. You’d need some pretty sophisticated pre-processing code, so I never did it.

  2. V. Chem says :

    You should have a look at the Wordnet data and some of the associated libraries. They provide tags for words. You could then develop chains of tagged words for each sentence and paragraph. This could allow for a higher level of analysis than just word and letter distribution. This exercise borders on reverse engineering content generation software. The same techniques are discussed as ways for detecting generated content.

    Aside from the interesting technical matters, I would like to speak to the broader issue at hand. Paul’s philosophy prohibits him from voting for government mandated or sanctioned events or functions. Some people have called him Dr. No in reference to his frequent non-support of legislation. Yet, within this frame he supported creating a federal holiday in the name of MLK. How does this square with the contents of the newsletter, specifically: “hate whitey day”.

    When you admit that you have your biases as we all have, it shows honesty. But the above contradiction should be a clue to you that your biases have prevented you from viewing the issue objectively. Just as you cherry picked the data in the original article; You seem bent on rationalizing your preconceptions by avoiding information that does not fit your conclusions.

    To pursue information is to admit that you presently have an insufficient amount. When you attach your ego to statements or perceived facts you are inviting disaster. Truth seeking should be equally gratifying when the results do not match your conclusion. There is nothing more noble than the pursuit of information. It stands to reason that more information is gained when you have to question and reexamine your perspective.

    What you have done here is dressed up your predetermined conclusions in the air of scientific and statistical significance. Only to backpedal and dress down your conclusions in the above post. Would it not make sense to examine the processes that brought you to this juncture? Perhaps you find yourself struggling with the technical implementation because the results you seek do not fit with reality. This could be indicative of a poorly constructed hypothesis. If you honestly are seeking truth, it may be prudent to examine the information and methods which gave rise to your unobtainable expectations.

  3. Pete Larson says :

    Sir,

    I appreciate your thoughtful comments.

    I take issue with your point, however. In fact, here, I have shown that, given the simple methodologies and the simplistic use of word length, evidence exists that indicates that Paul may NOT have written the articles in question. I have shown here, that there is reason to believe that Lew Rockwell may be the culprit. Yes, you are correct, my politics would have very much liked to have seen definitive evidence that Paul wrote the articles, but, alas, that result proved elusive (though the determination of authorship in any context is more complicated than I originally anticipated).

    Also, far from “cherry picking data,” I used the entire text of three of Paul’s books, blindly chose more than 30 of his articles along with several blindly chosen articles written by other authors, including Lew Rockwell. I made every attempt to create as impartial a data set as time would allow.

    As for the newsletter text, I was restricted to only the four articles that I could find in text format. While the contents of three of them were of a disgusting nature, one of which wasn’t incendiary in the least, but rather a call for electoral support. I had hoped this would serve as a control.

    This blog represents the personal and rather boring explorations of a single individual. You are correct, not all of it is well thought out, some of it may be flat out incorrect and my political biases may conflict with available information but, as an exploration of my own personal ideas, it serves its purpose. In this country, we make the great mistake of believing that hard headeness is a virtue. I like to think that my ideas are malleable given new information. You seem to believe that, as well.

    Truth be told, though, hardly anyone ever reads this and even fewer actually take the time to comment.

    I will point out that my opinion of Paul is less informed by these newsletters and more informed by writings which are, in fact, attributable to him. After reading the three books, and more than 50 articles by him, I can say confidently that I cannot support the man.

    Again, I welcome your thoughtful comments. Comments such as yours do help one to improve.

    Pete

  4. steve says :

    James B Powell is the author of the so called racist articles in the Ron Paul newsletter. Take a look at the youtube link below.

  5. Pete Larson says :

    Thank you for letting me know. I will see if I can add him to the analysis.

  6. JD Klein says :

    Yes, I’ve considered WordNet. But it takes a while to it working with one’s language of choice, and any tags are ambiguous. The entire field of correctly tagging word types is very hard, probably even harder than statistical speech analysis: see http://en.wikipedia.org/wiki/Part-of-speech_tagging

    It’s still only my to-do list, if I ever get into it again.

  7. Paul Skeptic says :

    I would appreciate seeing what you can find. Comparing the two articles Mr Swann (a libertarian and Ron Paul supporter) cited on his site, they did not look at all like two documents written by the same author to me. They used different vocabulary, emphasis, style, rhetorical techniques (One of them has at least one rhetorical question for each topic covered, the other does not have a single rhetorical question), flow… really, apart from covering some of the same topics, and being published in the Ron Paul newsletters, I saw no similarities between them. I am very skeptical of this so-called explanation, and would like to see some more rigorous analysis.

  8. Pete Larson says :

    So which Jim Powell? The person who runs the stock report or the historian at the Cato Institute.

    I think this is a sham.

    This is all that appears on the CNN blog site:

    ““I recently discovered that the author of the so-called ‘racist’ newsletter produced by the Ron Paul group, was written by James B. Powell. James B. Powell is now working as a high-level director at Forbes magazine, which has great relations with Fox News. Seems like the media dug up the wrong grave, huh? ”

    This text is flying around the internet:

    “It is a 1993 edition of the Ron Paul Strategy Guide. The article is titled “How to Protect Against Urban Violence.” The author is James B. Powell.

    The full eight pages of his article match so closely to some of those other so-called “racist newsletters” it is stunning.”

    Yet, the text isn’t available, nor is the criteria for establishing the match disclosed. Furthermore, the text of the “How to Protect Against Urban Violence” is unavailable. The only portion that is available is the heading with Mr. Powell’s name on it.

    This author is not convinced.

  9. Pete Larson says :

    Unfortunately, the article on Mr. Swann’s site isn’t text readable. Reading it, though, I’m skeptical, as well.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: