Wikipedia: Sources and Methods

By sashi on 8-18-2018 in Wikipedia

How tweet it is….

This project began when I noticed a badly spun tweet being added to a biography on Wikipedia, sourced to a click-baity headline from Politico. Now, a month later, the decontextualized tweet has been removed after much discussion, and an exclusive article the subject of the biography had written for the Daily Mail has been disappeared without any discussion. The biographical entry remained on full-protect lockdown all throughout (§), because earlier manipulation of the article had led to bad press for Wikipedia and an Arbcom case.¹

This affair — along with recent highly publicized furors about public figures’ pithy snark — got me wondering just how many tweets were sufficiently notable to be included in Wikipedia. A fellow exile taught me the proper syntax for searching inside of citation templates — insource:”web.site” — and, ever since, I’ve enjoyed watching the unexpected portrait of an elephant emerge as I investigate the source-linking data.

Blind monks examining an elephant, Hanabusa Itchō (1652–1724)

There were 35,735 links to twitter in the elephant’s belly that day. Since then it has been fed just under a dozen tweets a day, so by now the number will have grown to over thirty-six thousand. No worries, though: the internal pressure has simultaneously been reduced each day by shedding a half dozen links to the Daily Mail. (This, because 50 people decided that this news outlet should be banned from Wikipedia back in February 2017 (§), at least in part because of their click-baity headlines.)

The English-language Wikipedia indulges in tweets much more than some other languages do. While Spanish Wikipedia does link to Twitter almost 30% as often, both German and French Wikipedia have limited themselves to fewer than a tenth of the links to Twitter the English site currently shows. But just how important a source does 36 thousand links make Twitter in the pecking order of sources on English Wikipedia anyway?

(note: comparisons have been rounded to the nearest whole number and are based on July-August 2018 numbers)

WikiSource Index

- Factor by which theguardian.com is more linked to than twitter.com: 3
- Factor by which nytimes.com is more linked to than theguardian.com: 2
- Percentage by which links to nytimes.com outnumber its nearest competing news outlet (bbc.co.uk): 31
- Factor by which the total number of links to nytimes.com is dwarfed by the most frequently linked source (archive.org): 7
- Ratio of twitter.com links to links to the Australian educational domain (edu.au): 1:1
- Percentage by which youtube.com is more frequently pointed to than the UK educational domain (edu.uk): 23
- Ratio of links to tronc.com papers as compared to twitter.com: 7:2
- Percent of the 124,483 links to tronc.com newspapers represented by the LA Times : 50
  ²
- Percent of the 124,483 tronc.com targets that Europeans can read without using a proxy: 0
- Percent of NY Daily News photographers fired in tronc.com‘s July 2018 budget slash: 100 (§)
- Number of links to the Church of England’s website: 660
- Relative frequency of links to the The Daily Beast and to the C of E: 8:1
- Number of links (in thousands) to 3 Mormon organizations: 12
- Factor by which this exceeds the number of links to vatican.va: 3
- Number of links (in thousands) to 3 brands located at the same address in Lehi, UT: 49
- Number of links (in thousands) pointing to 1 company at 1 Hacker Way, Menlo Park: 56
- Number of links (in thousands) pointing to 1 company at 1 Infinite Loop, Cupertino: 31
- Ratio of apple.com links to:
  - links targeting samsung.com: 60:1
  - links targeting ten of its legacy competitors: 2:1
  - links to Amazon: 1:2
  - links to Amazon if the Internet Movie Database is added to Amazon’s numbers: 2:9
  - links to Google: 1:20
- Relative population of the People’s Republic of China (CN) and the Republic of China (TW): 59:1
- Relative frequency of links to gov.cn and to gov.tw: 2:1
- Factor by which Google Books is more frequently linked to than the Library of Congress: 15
- Percentage by which links to Breitbart exceed the number of links to icij.com (the consortium responsible for the Panama Papers & the Paradise Papers): 244
- Relative population of Russia (RU) and the Ukraine (UA): 16:5
- Relative frequency of links to gov.ru and to gov.uk: 2:5
- Percentage by which links to nato.int outnumber links to tass.com: 35
- Relative population of India (IN) and Canada (CA): 36:1
- Relative frequency of links to gov.in and to gc.ca: 1:1
- Relative population of Singapore (SG) and Bangladesh (BD): 1:29
- Relative frequency of links to gov.sg and to gov.bd: 3:2
- Number of links (in thousands) by which a dozen video-game sites taken together surpass the entire UK government domain (gov.uk): 3
- Factor by which the number of links to the US armed forces exceed those pointing to the House, Senate, White House & Supreme Court: 2
- Factor by which the number of links to the 10 most cited social media sites exceed those pointing to the US armed forces: 7
- Percentage by which links to these same 10 sites outnumber links to the nytimes.com: 30
- Number of Wikipedia entries (in thousands) tagged as completely unsourced: 196³
- Number of times wiki/Wikipedia:Wikipedia_Signpost was cited in a template in mid-August 2018: 18
- Number of times both Wikitribune and The Gateway Pundit were (a bit earlier in August): 7
- Number of times Wikipediocracy was: 3
- Number of days Wikipedia pointed to its page “Enemy of the People” in a special “see also” section of a public official’s BLP: 14 (§)
Social Media

This is what the relevant Wikipedia editorial guideline has to say about blogs, tweets, facebook posts, and other user-generated content:

[S]elf-published media are largely not acceptable. Self-published books and newsletters, personal pages on social networking sites, tweets, and posts on Internet forums are all examples of self-published media.

Content from websites whose content is largely user-generated is also generally unacceptable. Sites with user-generated content include personal websites, personal blogs, group blogs, internet forums, the Internet Movie Database (IMDb), Ancestry.com, content farms, most wikis including Wikipedia, and other collaboratively created websites.

Wikipedia:Identifying Reliable Sources (commonly abbreviated WP:RS)

The four sites specifically mentioned above (twitter, IMDb, ancestry.com, & Wikipedia itself) are all in the 100 most frequently linked sites on English Wikipedia (#13, #37, #91 & #19 according to an insource Wikipedia query). Wikipedia links to itself much more often than any of the others, but not necessarily as a reference. Upon looking at the occurrences the insource query turns up, one forum member characterized them as:

A collection of stuffed up templates, hidden comments, and circular citations, with a few bollocked internal links in the wrong format for good measure!

Dysklyver: source

Die Zwitscher-Maschine, Paul Klee, 1922

It is primarily — though not exclusively — for this reason that this source has been struck through in the spreadsheet accompanying this article (like so: ~~Wikipedia~~). Google says Wikipedia points to itself 15.8 million times, which is much more realistic, given the syntax of [[internal links]]. This sort of Wikipedia-as-social-media-source is frequently gamed, as the last member of the index above (“enemy of the people”) suggests. Other social media addresses, like those in the next paragraph, lead to citation templates, rather than to hidden comments, so I consider the results from the insource:”wikipedia.org” search to be an anomaly due to the internal linking syntax.

The top-ten social media sources on the Index represent nearly 300,000 links (and adding such sites as Live Journal (3,111), Google Groups (2,961), Baidu (2,541), Medium (2,472), Wikia (1,864), Reddit (1,815), Patheos (732), TV Tropes (587), Deviant Art (547), and Yelp (532), should help to bring that figure over the 300K mark quite soon.). It would seem that there is either a problem with Wikipedian sourcing, or a problem with the policy not reflecting current practices.

But I’ve said enough, I would like for you to save some energy for the sourcelist! In it, you will find answers to all your sourcing data questions! So, to whet your appetite:
- Which sport do you think is the best represented on Wikipedia? basketball? football! it’s football already! insider baseball? cricket? curling? Watch for the pale yellow highlights as you scroll.
- Which churches? Scientology? AME? Watch for the pale orange-rose highlights.
- Which benefactors? GlaxoKleinSmith? Mozilla? Goldman Sachs? Apple? Google? … ? ⁴
- Which American university is the most linked to? Harvard? UC Berkeley? Brigham Young? Stanford? MIT? Watch for the green highlights…
References

The complete list of Wikipedia sources studied so far (categorized version). Feel free to comment on the article with any major (or minor) oversights (the latter version helps to spot them).

Methodology

For each website, three queries have been run: an insource:”web.site” search at Wikipedia (article namespace), a basic search at Wikipedia for “web.site” (article namespace), and a search at Google for site:en.wikipedia.org +”web site”. This last search includes all namespaces that Google is allowed to search, including most article talk pages and their archives, & the Wikipedia namespace. Throughout this article I’ve spoken about “links” and not “references” because this is a brute search for a string of characters: some occurrences of the string may appear in the “source” field of a reference, while another may occur in the “url” field. Insofar as the search process is far from infallible I’ve avoided searching for terms that could occur frequently in-text (e.g. scoop.it), though the data concerning some sites like gov.in should be viewed with some skepticism.

It is well, too, to keep in mind that you can’t query the same Wikipedia twice: on average 32 citation templates per hour are added to the site…

Finally, with the exception of Oxford & Cambridge university presses (included for the sake of comparison), I have only searched for web addresses. Many of the best entries are referenced primarily to published books, which, regardless of publisher, tend to be linked to google books. A look at the citations for the featured entry on the Balfour Declaration should serve as a useful point of comparison.

Footnotes

¹ a BBC interview with George Galloway (§),
the ArbCom evidence phase the Philip Cross “BLP issues” case (§),
coverage of the case on the Wikipediocracy forum (§),
an outlaw thread about the Daily Mail (HTD headgear recommended) (§).
² On June 18th 2018, tronc.com announced that the sale of the LA Times and the San Diego Union-Tribune had been completed, arguing that the relief from pension liabilities put them in a considerably stronger financial position. (§) As of August 24, 2018, Europeans still land on the GDPR page hosted by tronc.com if they click on a Times or Union Tribune link.
³ According to “Category: All Articles lacking source” (§)
⁴ anony-zakat (§).

Zeitgeist: Opération Blockhaus

Peter Hitchens, Spectator, “War of Words: my battle to correct Wikipedia” (§), August 2018.

Annalisa Merelli, Quartz, “Seeking Disambiguation: Running for office is hard when you have a porn star’s name. This makes it worse” (§), 18 August 2018

Acknowledgements

I would like to thank those members of Wikipediocracy, (where this article is also hosted), of Wikipedia Sucks (and so do its critics), and of the Gender Desk blog who, either encouraged me in this folly enterprise, offered suggestions on earlier drafts, or included sources in their posts that wound up being among those listed in this study. Finally, I would like to close by acknowledging that, concerning methods, I have left much of the work to the imagination of the reader.

Recent Posts

Recent Comments

Archives

Categories

Meta

Wikipedia: Sources and Methods

How tweet it is….

WikiSource Index

Social Media

References

Methodology

Footnotes

Zeitgeist: Opération Blockhaus

Acknowledgements

Submit a Comment Cancel reply