Determining Authorship of Ron Paul Newsletters through Text Analysis

Update: Part 2 is here.

Update: Part 3 is here.

Ron Paul sold newsletters in the 80′s and 90′s. The content of these newsletters was appalling though unsurprising. Here’s a sample:

“We don’t think a child of 13 should be held responsible as a man of 23. That’s true for most people, but black males age 13 who have been raised on the streets and who have joined criminal gangs are as big, strong, tough, scary and culpable as any adult and should be treated as such.”

“And Stanford, Michigan, and many other universities have banned speech that offends privileged groups. Anti-white, anti-male, anti-heterosexual or anti-Christian remarks are perfectly OK, of course.” You can imagine, then, what a relief it must be to minorities, homosexuals, women and non-Christians to find themselves the privileged people of America. The rest of this page and part of the second details a cabal of homosexuals in the Bush administration who like to lead “the young” astray.

“Boy, it sure burns me to have a national holiday for that pro-communist philanderer, Martin Luther King. I voted against this outrage time and time again as a Congressman. What an infamy that Ronald Reagan approved it! We can thank him for our annual Hate Whitey Day. Listen to a black radio talk show in any major city. The racial hatred makes a KKK rally look tame.”

“Dr. Douglass believes that AIDS is a deliberately engineered hybrid of these two animal viruses cultured in human tissue, and he blames World Health Organization experimentation at Ft. Detrick, Maryland…. Could the government have experimented with it in the civilian population, as it did in the 1950s with LSD, and had things get out of control? I don’t know, but these sure are interesting questions.”

“A well-known libertarian editor just back from New York told me: ‘The ACT-UP slogan, on stickers plastered all over Manhattan, is “Silence = Death.” But shouldn’t it be “Sodomy = Death”?’”

Paul claims not only to NOT have written the trash in his newsletters, but also claims to not have known of the content of them. I find it highly unlikely that, given Paul’s prolific written output, that Paul would not have had the time to write the content of newsletters and signitures which bear his name. I also find it unlikely that he himself wouldn’t have read them, given that he drew a portion of his income from their continued sale.

Regardless, the claim that Paul did NOT write the content of his own newsletters needs to be put to rigorous test. Clearly, Paul himself is of no use in this venture, given his precipitous position as a Presidential candidate.

PhiloComp.net offers the “The Signature Stylometric System,” a text analysis software package offered for free. One can use the package, for example, to determine if the same author wrote all of Shakespeare’s plays or to determine the authorship of the Federalist Papers. It compares word and sentence length between texts, and determines frequency of letter usage and punctuation. Authors have particular styles. For example, one author may often use three letter words (or four letter words!). We may take a disputed work, compare the word length of it against all other works by said author, and then statistically test whether there is evidence to suggest that the work came from that author.

I collected a number of works known for a fact to be written by Paul. I included a couple of chapters from “End the Fed,” a number of his speeches, and more than 20 articles and compiled them into a single corpus. On the internet, I then found four articles from his newsletters: one asking readers to assist in his re-election to office (his present seat in Congress, actually), one on the supposed government conspiracy to create and spread AIDS (partially quoted above), one on the coming race war, and one particularly deplorable article on carjacking and the need for an armed populace.

A graph of the distribution of word length in Paul’s output can be seen below.

Word Length

Using the software, I compared the word length and sentence length of each of the four newsletter articles to works known to be written by Paul. The results are below. For those unfamiliar with stats and/or p-values, the gist is this: If the p-value is less than , say, .05, there is reason to believe that authors of the newsletter articles is someone other than Paul. If the p-value is greater than .05, we might concluded that there is not enough evidence to suggest that Paul did not write the articles, and move on to other methods of testing (as is seen in the next post).

Results of textual analysis on Ron Paul writings

The results are interesting. There is not enough evidence to suggest that someone other than Paul wrote the piece on AIDS and the piece speculating on a coming race war, though to confirm (or refute) Paul’s authorship, we may have to resort to other methodologies. On the other hand, there is reason to believe that someone else may have written the other two articles, the one on carjacking and the re-election piece.

I have also included a comparison with a piece on health care that is known to be written by Paul. The tests confirm that it compares nicely with the rest of Paul’s known writings (or at least provides no evidence that it is significantly different). For reference, I have also included the results of a tests between Paul’s writings and the entire text of this blog starting in 2007. Again, the test confirms that the authors are likely different people (which I already knew).

A visual comparison of word length between the feature on the coming race war and the rest of Paul’s works shows that the two are very similar. For reference, I have included a comparison of Paul’s works with my last blog post, which, incidentally was also statistically different from Paul’s writing on all measures.

Race War and Paul Works

My Last Blog Post and Ron Paul

Obviously, we will never know without a doubt who did or did not write the trash that appeared in Paul’s incendiary newsletters, though results like these and more casual spot-check analyses indicate that the case is hardly closed. I am convinced that Paul happily exploited the worst elements of the American political landscape. He willfully mixes with racists, conspiracy nuts and paranoid gun freaks for nothing more than political gain, political contributions and worse yet, book sales. I am also convinced that he was aware of the newsletters that he has “disavowed” though the results above indicate that he may, in fact, have farmed out some of the writing to other people.

Subjecting myself to his writing was one of the most painful and useless experiences of my life. I really wanted to give the man a chance, particularly after his impressive display at the Republican foreign policy debate. “End the Fed” read more like a paper from freshman comp than a serious book, though it somehow attempts to pass itself off as a work of deep economic analysis. Not to disparage people I know that may support Paul (and I do apologize), but I think that Krugman’s recent quip that Newt Gingerich is “a stupid man’s idea of what a smart man sounds like” is actually more true of Ron Paul.

It doesn’t take a piece of software to know that it is possible that Paul at least signed off on some of the nonsense in his newsletters. The jury on whether he did or did not write these articles may be out, but a reading of his works shows that philosophically, it doesn’t take a great leap of faith to move from Paul’s public persona to some of the ugliest portions of right wing politics.


Update: Please see further analysis in the next post that expands upon these results. If Paul didn’t write these letters, who did?

Further discussion of methods and criticism of this post on another blog can be found here.

About these ads

About Peter Larson

Assistant Professor of Epidemiology at the Nagasaki University Institute for Tropical Medicine

37 responses to “Determining Authorship of Ron Paul Newsletters through Text Analysis”

  1. Matthew Saint-Germain says :

    A question regarding methodology: In terms of Paul’s speeches, were you able to verify that he alone wrote each?

  2. Matthew Saint-Germain says :

    Also, considering that many technical writers (and I realize the brazen assumption I’m making here) do tend to write differently for journal articles and speeches, did run another test with the speeches excluded from the database?

  3. Ellen G says :

    Typically a p value under .05 indicates significance.

  4. Ellen G says :

    Okay, its been a while since Ive taken stat, but the p value is basically the chance Paul wrote the article, right?

    • Pete Larson says :

      No, the p value is the probability that one would see a statistical difference between the distributions of (for example) letter lengths at least as extreme as the one found, assuming random sampling alone.

      Basically, if the p value is small, the likelihood that the difference occurred by chance is small. If it is large, the likelihood that the small differences in the pattern of (for example) word lengths are just due to random variation is large.

      We would expect that the pattern of word lengths varies to some extent in Paul’s writings, since they are different articles. The question is whether that difference is significant enough to assume that different people wrote them. If the p-value is large, we have no reason to assume that different people wrote them.

      I hope this helps.

    • Pete Larson says :

      In a nutshell:

      p-value big (>.05): Paul may or may not have written it.
      p-value small (less than .05): Paul probably did not write it.

  5. Ellen G says :

    Great! One more question. Shouldn’t the null hypothesis assume there is no relationship?

    Null hypothesis, no relationship
    Alternative hypothesis, relationship

    ?

    • Pete Larson says :

      In general terms, the null hypothesis here is that the writers are the same, the alternative is that they are different.

      If we do not find convincing evidence to reject the null (the notion that the writers are the same), then we might be suspect that Paul wrote the articles, though we would have to confirm this with further tests.

  6. Aras says :

    Hello:

    This is a very interesting analysis, but I have two points which call your analysis into question. Perhaps you can address these and explain to me why I am wrong:

    (1) You choose to focus on word length, when you could just as easily have focused on letter frequency (your second column in the chart above.) If you had chosen to focus on letters, the p-value for the “Race War” article is 0.001, meaning that there is only a 0.1% probability that the differences in the letter frequency between that article and Ron Paul’s known writings occurred by chance. That is, there is a 99.9% likelihood that the letter frequency differences did NOT occur by chance— implying an extreme likelihood they were written by two different people.

    By your own analysis, that only leaves the “AIDS” letter in question.

    (2) The “AIDS” letter is the only letter where the p-value is greater than 0.05 for all three columns (word length, letter frequency and punctuation.) However, the p value is still very low, 0.20. This means there is only a 20% likelihood that the word length, letter frequency and punctuation differences occurred by chance— implying that there is an 80% likelihood that they DIDN’T appear by chance— meaning an 80% chance that they were written by two different people.

    Far from convincing me that Ron Paul didn’t write these letters, the data that you present actually convinces me of the opposite.

    If Ron Paul actually wrote these items, I would expect a p-value across all three columns of at LEAST 0.50, and probably more like 0.80 or even 0.99. 0.20 is hardly convincing to me.

    What I would like to see would be for you to take half a chapter of Ron Paul’s book and compare it to the last half of that same chapter– presumably written by the same person. That would give you baseline figures for making a conclusion. Or, take one page of your blog and compare it to rest of your blog. Check the p-values comparing known apples to known apples. I really don’t know what kind of variation you will find, but this is a necessary baseline from which to draw conclusions.

    Again, very interesting analysis, although I feel that you are misinterpreting the data.

    I await your reply.

    • Pete Larson says :

      Sir,

      Thank you for your kind and thoughtful input.

      I have included the actual p-values for all tests. The software does not produce exact p-values for some reason, only thresholds of significance.

      Yes, I recognize that the letter frequency results may call my conclusions into question, though this is true only for the piece on AIDS. The other measures do indicate that there is reason to believe that Paul is the author of both the piece on AIDS and the piece on the coming race war.

      I have also included a comparison of an article on health care that is known to be written by Paul and compared it with the other works I have on hand. As you can see, the tests indicate that Paul is the likely author.

      I have also included a test of the entire text of my blog and compared it with Paul’s work. The test confirmed that Paul is not the author of this blog, a fact I have been aware of since 2007.

      Running the test of different portions of the same chapter as you suggest is difficult. The sample size of word frequency, etc of both halves will be quite small, hence my choice of comparing a work known to be written by Paul with the rest of the works in its place.

      You are correct in calling the results into question though I point out that no statistical test can confirm or refute allegations of authorship. I do not intend these results to be definitive, though I believe that the results are enough to keep a conversation on Paul’s accountability going.

      Clearly, I am not an impartial analyst here. This is a personal blog. I admit that I do not like Paul though I have made every attempt to be as impartial as possible when performing the statistical tests.

      Thank you again for your input. It is very much appreciated.

      Pete

  7. Louer Adun says :

    I am sorry, but your logic here is flawed. Just because the works examined used a similar number of words of the same length, does not prove or disprove authorship. I downloaded the free software and attempted the same test on papers written by me over the last few years. All my papers varied by more than the margins shown here, and they are all written by me.

    In fact if you take an average (create a corpus) of all my results and compare it to John Hamilton’s; using your logic, I wrote John Hamilton’s papers. This is an interesting tool, but definitely doesn’t prove anything one way or the other.

    Oh and for anyone interested, I created a corpus using about 25,000 words, so it was a solid sample size, probably larger than the author used for his Ron Paul corpus, but its hard to know since that information wasn’t given.

    • Pete Larson says :

      Sir,

      Fair enough. You are correct, there is nothing about a statistical test which can definitively determine who of who did not write the contents of Paul’s newsletters. We can only discuss differences between what is likely or unlikely.

      I also ran a number of nonsense tests and comparisons. I also found some variation.

      I intend for the presentation of the results to keep the conversation regarding Paul’s accountability alive.

      Thank you for your kind input.

      Pete

  8. nh@yahoo.com says :

    You say you “intend for the presentation of the results to keep the conversation regarding Paul’s accountability alive.” But really, the conversation here is about the accuracy of your statistical analysis.

    With actual regard to the accountability of Ron Paul, what is needed is substantive discourse, not speculative discourse echoed and transmuted over and over until people just accept it as fact. You may not agree with Ron Paul, but you should base your arguments in reality, and not noise.

    Watch this for a real, first-hand account of his disposition toward blacks (and healthcare):

    • Pete Larson says :

      First, I have no control over what kind of comments people leave here. Second, I came to my conclusions based on reality, namely Paul’s writings that are known to, in fact, be his own.

      I read them, I read “End the Fed” and have watched a number of videos of his speaking in public. I feel that I am versed on Paul at this point, much more than I have been on any other Presidential candidate. I am convinced that he did, in fact, write much of the content of those newsletters, despite what he may say. After reading his works, I can also confidently say that I do not support his candidacy. You are free to send me more videos if you think I need more convincing. I will certainly give them a chance.

      Perhaps you have not considered the very obvious possibility that Paul might be lying?

    • brainswithteeth says :

      I’m going to take issue with just two words you choose to use in your question for Pete: “speculative discourse.”

      Nevermind that the entirety of Paul’s newsletters are exactly what you are claiming as to be so damaging to the question of authorship, though, I’m not even so sure you are attempting to address that in your rejoinder. Instead, it feels as if you, as well as Pete, have a subjective opinion about the authorship issue. As someone who tries to objectively view such things, I have a feeling you are on the losing side. Let’s just get that out of the way.

      Now, I get that Paul may have changed his opinions, but the fact of the matter is that a man who is running for President, there just needs to be a standard of honesty that needs to at least have the appearance of being kept. Claiming that you have no idea who authored articles for your newsletter that you yourself made a substantial amount of money off of and gained an incredible amount of notoriety from just does not pass the smell test. Keep in mind, I’m operating on probably the same level you are: OK, I believe you, Ron Paul. However, I need to go one step further and then say: Then who did?

      I myself wrote a newsletter about arcane music that no one read, and I still remember the author for every piece from over a decade ago. It’s your baby. It would seem that one of your editorial duties would be to ensure you kept a paper trail on authorship, and if you could not confirm authorship, you shouldn’t be publishing the pieces.

      Make no mistake, the onus of responsibility here is on Paul to prove if he did not indeed write these pieces, then who did? By avoiding this question, it naturally leads to pieces like Pete’s to spring up which operate almost from an incredulous position of shock that responsibility is being so neglected. Getting angry at Pete is getting angry at the human condition. Paul is running for president and plain and simple he is side-stepping this issue, hoping it goes away. If I am to vote for Paul, I would need the answer of who wrote these pieces, and why in the world did you include them in your newsletters? Time may heal all, but my father never held these views. I’m less inclined to believe we need to be electing more guys like Strom Thurmond, regardless if they’ve “changed their ways.”

      So, when you say “speculative discourse” I would think you would at least appreciate the fact that Pete has in fact run some statistical analyses, rather than just trolling the comments’ section of numerous newspapers and blogs looking for anyone who disagrees with their preconceived and unwavering belief system and spends way too much time trying to control what other people say/type, rather than dealing with the underlying reason why they feel the need to have to control people.

      Either way, “discourse” assumes a lot, especially in an academic sense. However, I’m not so sure about how you are defining discourse, so before I put words in your fingers, could you please unload what you mean by “discourse”?

      • justamom says :

        A couple of completely subjective opinions that you have expressed during your “objective” analysis:

        a) “Yourself made a ‘substantial’ amount of money off of” How much did Paul actually make on an annual basis from this venture?What were his expenses compared to what he was making off of his medical practice at the same time? It seems reasonable that his medical practice was much more lucrative. Even if it was not in the early years, it certainly had a higher potential for profit. It would therefore justify occupying more of Paul’s time than publishing a special interest investment newsletter. A newsletter, which seems to most reasonable people as a way to profit from investment and economic research that one is interested in and enjoying as a hobby anyway. I was also a perfectly legal way to reduce taxable income by expensing (sic) all costs related to research & publishing a hobby newsletter. Remember, in those days, a small home based hobby business operating at a loss (meaning expenses were greater than profits) was an excellent & legal way to reduce taxable income on a much more lucrative venture.

        b) “an incredible amount of notoriety” Are you really saying that everyone subscribed to these newsletters? What exactly was the circulation and how much was a yearly subscription and how many complimentary subscriptions were issued and to whom?

        c) “It’s your baby” Quite an assumption that Paul’s publishing style while running a medical practice is the same as yours. Were you also running an 80 hour per week business while writing your newsletter? Did you hire an editor to handle those responsibilities for you? Did your newsletter have the same circulation and number of pages and issues per month/year? To be objective, we should compare like things, no?

        Your entire post, actually, is a good example of “speculative discourse”.

      • Pete Larson says :

        Your comment is a little difficult to read. I broke up the first point to make it a little more digestible. Actually, I’m not even sure who you are attempting to address here, me or Mr. Brains With Teeth? Perhaps you are a bit confused as to the difference in our identities. Regardless…

        a) I hope that you are implying that I make millions off this blog. It would make me feel a little more important than I actually am.

        As for b): I don’t know how much money Paul made, nor how large his circulation was. Some have claimed that revenues from the newsletter were more than one million dollars in 1993 alone. It’s up to you whether you wish to believe it or not, though I suspect that you are unaware of it.

        As for c) I did not say “It’s your baby,” though I do find it hard to believe that Paul, whose name it bore and whose reputation would suffer from inaccurate content, would at least have given the newsletter his time on the toilet. If I ran a newsletter, or even had, for example, Mr. Teeth write a post for this blog, I would take the time to read and sign off on it.

        It’s not interesting that you wish to defend Paul; you appear to be a supporter. What is interesting, is that you (and others) assume that he is infallible and, if I may be so bold, super-human.

        I may have come to the conclusion that Mr. Paul did not write some of the more controversial texts that I was able to find (I assume you did not read parts 2 and 3 of this series), but he, in my opinion, is not to be removed from accountability for its contents. After all, they did have HIS name on it.

        I appreciate your comment,

        Pete

  9. Pete Larson says :

    I said “he did it” which I believe to be true. I explain why in this post. I then left the title to this post on NYT.

    I will leave the “disavowing” to Dr. Paul.

    Thank you for your time, though.

  10. Pete Larson says :

    Believe what you like, sir. You are entitled to your opinion.

    For transparency’s sake, Mr. DeSucre is taking issue with this text that I posted in the comments section of the NY Times:

    “He did it. Checking authorship of Ron Paul newsletters through text analysis.

    http://peterslarson.com/2011/12/23/determining-authorship-of-ron-paul-ne…”

    The word “prove” does not appear in the text, as Mr. Sucre initially suggested.

  11. brainswithteeth says :

    I think what would work here is instead of trying to control someone else’s behavior, that you take a page from someone like Nate Silver, and try to disprove his results.

  12. Jose Desucre says :

    Pete:

    In the NYT you posted “He did it”. And here, you posted: “nothing about what I have done definitively determines who of who did not write the contents of Paul’s newsletters.”

    Now, please, tell me if those two statements are consistent.

    • Pete Larson says :

      The first is a statement of belief. Based on the evidence, I believe that Paul wrote at least some of the disputed newsletters. Based on the evidence, I believe that Paul is lying about the entire affair.

      The second is a statement regarding the nature of certainty, note the key word “definitive.” Neither I, nor you, can know with 100% certainty whether Paul did or did not write those articles. Only Paul himself knows whether he did or did not write those articles with 100% certainty.

      Clearly, though, it is even impossible to determine authorship even from his own statements. He could very well be lying, a possibility which seems to escape his supporters.

      Rather than being inconsistent, I believe that the two statements are, in fact, complementary to one another.

    • Pete Larson says :

      I do like Mr. Teeth’s suggestion. If you are so concerned with whether my allegations are accurate or not, why not investigate it yourself?

  13. JD Klein says :

    I’ve written this before, but my comment was rejected. I’ll try again. I am not a Paul supporter. I just dislike bad math.

    Pete Larson is misusing statistics in a fundamental manner. A small p value shows that two pieces of writing are probably dis-similar. A large p value does not, as Larson implies, show that two things are similar.

    The only way to attribute significance to a large p value is to perform a large number of trials, using similar size text samples, to see how rarely works by different authors produce a large p value compared against Ron Paul. Only then can one say that the large-p agreement between the disputed articles and Paul’s writings indicates authorship.

    To say with a confidence of 99% (ie, 1-1/100) that Paul is the author, you would have to perform tests against 100 similar text samples. But Pete Larson has, at best, performed a couple of casual tests. Not the large number required to make positive claims about authorship. I have refereed some dozens of ‘hard’ scientific papers, and I can say with confidence that Larson’s reasoning would not be accepted for publication.

    • Pete Larson says :

      Sir,

      I take your concerns and statements very seriously and agree completely. I would remind you, however, that this is a personal blog and not a refereed journal.

      That being said, I am compiling texts as you have indicated, as has been suggested to me by another reader.

      I emphatically state that I recognize your very correct statements here.

      Pete

      Typing on my phone, please excuse terseness and/or errors.

    • Pete Larson says :

      I just reread the post. While I appreciate your input, I believe that you are assigning far more weight to the post than is necessary.

      I intended the exercise as a 15 minute diversion addressing an aspect of the topic that others may not have considered, not as a rigorous analysis. I simply do not have time for that.

      I also believe that I am sufficiently careful to point out that the results do not, in and of themselves, definitively show that Paul is the author but, importantly, do not rule out the possibility that he is not. I am sorry if this was not sufficiently clear.

      This is a personal blog. I write about things of interest to me without concern over political bias; posts here represent my personal opinions, diversions and explorations of topics outside of the confines of my professional career. I now realize, however, that this strategy could, at some point, cause me harm, a realization that, to be honest, saddens me.

      I apologize for the rejection of your previous comment, I had though I had approved it, but something must have gone wrong. I certainly appreciate your input.

      Thank you,

      Pete

      • JD Klein says :

        People were drawn here by a very strongly positive claim in the New York Times comments, stating that Paul is in fact the author. The weakened claims offered here are at odds with the initial claim, which was probably a bit overstated to attract readers to the blog.

        It is excusable to present a carefully worded casual analysis as an invitation for others to do the job correctly. The problem is with the far stronger claims posted at the NYT.

        I think that this sort of forensic detective work is very valuable. I’ve dabbled in it myself, and I’m a big admirer of Nate SIlver’s efforts. I also think you MIGHT be onto something with respect to Paul, but it will need lots more work. It’s unfortunate that you don’t have the time for it, because time (and lots of work) are exactly what are needed for a good analysis. When I was mucking about with these methods (using my own codes, and statistical methods), I was struck at how awfully hard it was to make meaningful statements about authorship. I used the major NYT columnists as my sample set, collecting dozens of their columns. Sometimes I could tell the columnists apart statistically, and sometimes a columnist was inconsistent with him/her-self on different days. Often, the Gaussian or binomial or Poisson techniques used to assign p-values woefully understated systematic variations in an individual author’s writing, or in the influence of subject matter on the quantities computed. The most persuasive measures of authorship, I decided, are those that graphically plot a large number of samples on a 2d plane (eg the biggest components of a SVD/PCA decomposition, or similar). This lets the reader eyeball the systematic scatter. This wasn’t done here.

      • Pete Larson says :

        Yeesh. I’ll remember to be more careful when posting in comments to an NYT article. I certainly never foresaw this type of reaction. It would seem that a single link out of 391 comments to one article out of millions would be as insignificant as any post on this blog, but alas, that is not the case.

        I agree, this type of analysis merits further work. I wrote to the creators of the software to see if I could get some advice, but they also suggested obtaining a large number of controls. Somehow, that approach seems a but unsatisfying.

        What are you referring to here:

        “The most persuasive measures of authorship, I decided, are those that graphically plot a large number of samples on a 2d plane (eg the biggest components of a SVD/PCA decomposition, or similar). This lets the reader eyeball the systematic scatter. This wasn’t done here.”

        Is there a specific software package (or plugin/library) you are referring to? It sounds like you’ve attempted this kind of thing before. Did you find any useful papers on the subject?

        Again, I appreciate your input.

        Pete

      • JD Klein says :

        A large number of controls is necessary because the statistics are necessarily empirical, because we don’t know the underlying distributions.

        I don’t know of any packages; my experiences were with playing with my own codes. Once you have a word parser and a library of articles (and some programming skill in a high level language) you have a lot more leeway than when you use a canned package.

        The wikipedia article on ‘Stylometry’ hints at how uncertain the field is.

        Some links I found: “Exploring Textual Data” in google books, But a good 320 page review article seems to be: http://www.mathcs.duq.edu/~juola/papers.d/fnt-aa.pdf – this is the best source I’ve found.

        In particular, see Fig 5.1 by what I mean by plotting principal SVD components. And see Fig 5.2 for how two authors can be separated using this method. The example I saw elsewhere (and can’t find) was SVDs of Marlowe and Shakespeare, plotted as clusters, I think. The scatter of the clusters gave one a visual idea about the uncertainty of the method, as in Fig. 5.2.

        But note the chronological development of Henry James in Fig 7.1 – authors change. This is what I mean by naive p values being messed up by systematics.

  14. JD Klein says :

    For the sake of completeness (and so others can see it more easily, rather than digging through a huge PDF), here is a reproduction of Fig 5.1 from the Juola PDF paper on authorship attribution: http://i.imgur.com/dGq2D.png

    This figure shows the separation of several major authors graphically, using the two strongest component from a PCA (Principal Component Analysis) of frequencies of certain selected ‘function’ words. On the one hand, one can see that the works of each author fall into a cluster distinct from the other authors. On the other hand, the separation is not crystal clear. For instance, the variation within Henry James is larger than the separation between Henry James and Edith Wharton. And this is with entire books as the working sample, not short articles!

    With respect to Ron Paul, perhaps one should break his work into (say) 5000 word chunks, and break the comparison articles into 5000 word chunks, and the disputed articles into 5000 word chunks, and make a similar plot, with one point for each chunk. If Ron Paul and the disputed articles fall into a tight cluster but the other plausible authors don’t, I think you’d have a good case.

    • Pete Larson says :

      Mr. Klein,

      Please see my new post. I attempted some of what you indicated in the time I had.

      http://peterslarson.com/2011/12/31/determining-authorship-of-ron-pauls-newletters-through-text-analysis-part-2/

      Thank you for your help,

      Pete

      • JD Klein says :

        I think that this sort of analysis is what is necessary. The new conclusion seems to be that we can’t tell much from the writing samples.

        A few more thoughts (partly contained in comment in your 2nd blog post):

        Word length may be a very weak way of distinguishing authors. It’s the easiest one to use, but it doesn’t seem like it contains much information. A better choice might be non-subject-specific vocabulary, words like ‘of’ and ‘but’ that connect ideas, but are not specific to any subject. Also, the frequency of commas indicates subclauses. If I wanted to distinguish between Hemingway and Faulkner, I’d probably look at sentence length and comma counts, not word length.

  15. steve says :

    It just came out today that the author of the “How to Protect Yourself Against Urban Violence” is James B Powell not Ron Paul. Take a look at this youtube link http://www.youtube.com/watch?v=cDKvOlV-Pvo&feature=g-all-u&context=G266a42eFAAAAAAAABAA.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,578 other followers

%d bloggers like this: