Things Internet Researchers Should know About Search Engines

This page lists useful tips for doing research with search engines, particularly in combination with the Search Engine Scraper and the Lippmannian Device.

Good query design

See our lecture on query design.
  • Consider what it takes to turn search into research.
  • Look into the search operators the search engine supports (here are Google’s, for example)
  • Use quotes around every word which need to be literally included, as some engines (such as Google) may:
    • makes automatic spelling corrections
    • persoanlizes search by using information such as sites visited before
    • includes synonyms of search terms (matching “car” when you search [automotive])
    • finds results that match similar terms to those in the query (finding results related to “floral delivery” when searching [flower shops])
    • searches for words with the same stem, e.g. “running” when [run] was submitted
    • makes some of the terms optional, like “circa” in [the scarecrow circa 1963]
  • Some queries might result in overly fresh results, or as Google puts it: “Search results, like warm cookies right out of the oven or cool refreshing fruit on a hot summer’s day, are best when they’re fresh”
  • Take into account transliterations: “osama bin laden” vs “osama ben laden” vs the Arabic spelling. Use e.g. Wikipedia’s articles in other languages.
  • Discrete and underspecified search terms often work well
  • When noting down queries in a research report one can use brackets. E.g. we queried [HIV] in google.co.uk and later refined the query to [“AIDS”] so that no synonyms are included.

Disentangling the researcher from the results

See our video tutorials on analyzing engine results and localizing web sources.

When using the Search Engine Scraper or Lippmannian Device with our Firefox toolbar, the researcher needs to take a few steps to ensure that day to day activities do not interfere with research.
  • Consider installing a separate version of Firefox, a so called research browser, used solely for research purposes.
  • In the research browser, make sure to log out of any services that may be linked to the search eninge. See our video on setting up Google for research; but this applies to other engines as well, e.g. Bing may be linked to your Microsoft account if you have one.
    • Even when logged out, a search engine may personalize results based on previously stored cookies. For the most neutral search results, clear your cookies before searching, or configure Firefox to not allow cookies at all.

Search Engine Peculiarities

The search engine scraper tools allow searching with a search engine of your choice. Most search engines have some peculiarities, which are useful to keep in mind while scraping and analysing results:

Baidu

  • Baidu is a Chinese search engine, and thus focuses on Chinese sites and results.
  • Baidu does not show the URLs of found sites in its results, but rather a redirect URL. If it is the URLs you are interested in, it is advised that you de-shorten those URLs yourself.

Bing

  • Bing supports many search engine operators.
  • Many of Bing's search query customizations (e.g. searching by region or tweaking safe search settings) are set via a cookie. This method of customizing a query is currently not supported by the scraper tools, but you can go to https://www.bing.com/account/ to change search settings. Make sure to do this after you've cleared your cookies.
  • The way Bing indicates result count estimates is a little complicated (it has various ways of phrasing the estimate, e.g. "approximately 5000 results" and "result 50-100 of 5000"). The scraper will attempt to parse this value, but if results look off it is a good idea to double-check.

DuckDuckGo

  • See the DuckDuckGo help for a list of supported operators.
  • DuckDuckGo does not offer (an estimate of) the total amount of results.

Google

  • A list of Google-supported query operators.
  • A CAPTCHA is triggered after a number of results when scraping results automatically (i.e. via a tool).
  • Amount of results varies from request to request. It usually stays approximately the same, but may fluctuate by small amount.
  • It is virtually never possible to get as many results as Google initially estimates, even when using a normal browser.
  • Google personalizes results to a great extent. For the best results, log out of all Google services, clear cookies, and opt out of further personalization.
    • Results are further personalized based on locale (i.e. location and language). This is difficult to avoid entirely, but in extreme cases you may want to consider using a browser in a different language.
  • Google publishes a transparency report on search result removals. This may give some insights in case of conspicuously 'missing' search results.
  • If a non-English language is selected, Google may include translated English language results.
  • How Google decides what ‘nationality’ a site has: Geotargeting factors uses cctld, geotargeting for gtlds (webmaster tools), server locaction, other signals (addresses and phone numbers). At the bottom of this list is a list of local domain Googles.
  • This is a Korean search engine, and will therefore be particularly biased towards (South) Korean sites and results.

Yahoo Japan

  • It is worth emphasizing that this is not the same search engine as the international Yahoo you are probably more familiar with. Yahoo Japan is a separate company and search engine, focused on the Japanese market.
  • Like Baidu, Yahoo Japan results link to an internal redirect URL rather than the actual page URL. Unlike Baidu, these redirects contain the actual URL, so the scraper extracts these, and there is no need to do this manually.

Yandex

  • Yandex also allows you to use a number of search operators.
  • CAPTCHAs are triggered rather easily, which makes scraping large numbers of results more difficult (or at least time-consuming)
  • Yandex does not offer an easily parseable estimate of the total number of results. The scraper will attempt to parse the number, but it may not be able to do so, e.g. if due to your location or browser Yandex returns results in a language that isn't English.

Analysing results

See our video tutorials on analyzing engine results and localizing web sources.

Harvesting and triangulating results

The symbiosis of search results and Wikipedia

  • Wilkinson and Huberman (2007) find evidence of a direct correlation between the visibility level of a certain article (measured in terms of its Google pagerank popularity level) and the number of edits received by that article. See Wilkinson, D.M., and B.A. Huberman, 2007. “ Assessing the Value of Coooperation in Wikipedia
  • It has been shown that the Google PageRank has a strong correlation with the number of times a Wikipedia page is viewed. See Spoerri, A., 2007. " What is popular on wikipedia and why?," First Monday.

Algorithm changes

Google

In what follows, Google algorithm changes that have resulted in new, or changing, modes of research that were not possible before the change type are listed; from the first named and confirmed Boston update in 2002 until June 2015. The timeline is by no means exhaustive. Google changes its algorithm 500-600 times per year. While most of these changes are minor, others are ‘major’ in that they have the biggest impact on (re-)search. A selection is made from the work by SEO consultancy MOZ, which keeps track of these major algorithm changes by tracking changes in results for a set of queries with their ‘Rank Tracker’ tool, community submissions and updates reported by Google. Table adapted from Weltevrede, Esther (2016). Repurposing digital methods. The research affordances of platforms and engines. Ph.D. Dissertation, Amsterdam, NL: University of Amsterdam (pp 120).

year update name update type key Google algorithm change
2003 Boston Anti-manipulation / Quality More emphasis on quality back-links
2003 Cassandra Anti-manipulation / Quality Cracking down on link-quality issues, such as co-linking from domains, hidden text & hidden links
2003 Dominic Anti-manipulation / Quality Improving on counting and reporting backlinks
2003 Emeralda Infrastructure Improvements on the index infrastructure
2003 Fritz Infrastructure Improvements on the index infrastructure
2003 Supplemental Index Anti-manipulation / Quality Update splitting off results of lesser quality into the "supplemental index"
2003 Florida Anti-manipulation / Quality Crack-down on low-value late 90s SEO tactics, like keyword stuffing
2004 Austin Anti-manipulation / Quality Crack-down on SEO-tactics, inc. deceptive on-page tactics, including invisible text and META-tag stuffing
2004 Brandy Semantic / Query Latent Semantic Indexing (LSI), anchor text relevance, synonyms and keywords, intro idea of link "neighbourhoods"
2005 Allegra Anti-manipulation / Quality Crack-down on suspicious-looking links
2005 Bourbon Anti-manipulation / Quality Improvements in how duplicate content and non-canonical (www vs. non-www) URLs were treated
2005 Personalized Search Personalization / Social Results take user's search histories into account
2005 Jagger Anti-manipulation / Quality Crack-down on low-quality links, including reciprocal links, link farms, and paid links
2005 Google Local/Maps Local Maps data is integrated into the Local Business Center
2005 Big Daddy Infrastructure Infrastructure update changing the way URL canonicalization, redirects a.o. technical issues are handled
2006 Supplemental Update Anti-manipulation / Quality Change to the supplemental index and how filtered pages were treated
2007 Universal Search Universal Integrating traditional search results with News, Video, Images, Local, and other verticals
2007 Buffy Semantic / Query Update to single-word search results and other small changes
2008 Dewey Universal Unspecified update to the index, reportedly pushing Google's own internal properties, including Google Books
2008 Google Suggest Semantic / Query Update displaying suggested searches in a dropdown below the search box and later powering Instant
2009 Vince Trust Big brands get a boost in the results
2009 Real-time Search Real-time / freshness Twitter feeds, Google News, newly indexed content, a.o. were integrated into a real-time feed on some SERPs
2010 Google Places Local "Places" originally only a part of Google Maps was now integrated more closely with local search results
2010 May Day Anti-manipulation / Quality Crack-down on low-quality pages ranking highly for long-tail searches
2010 Caffeine Real-time / Freshness Launch of new web indexing infrastructure resulting in a 50% fresher index
2010 Brand Update Trust Same domains are allowed to appear multiple times on a SERP
2010 Google Instant Semantic / Query Displaying search results as a query was being typed
2010 Social Signals Personalization / Social Social signals are included in determining ranking, including data from Twitter and Facebook
2010 Negative Reviews Trust Update to ranking based on negative reviews
2011 Panda Anti-manipulation / Quality Crack-down on thin content, content farms, sites with high ad-to-content ratios, and a number of other quality issues
2011 Freshness Update Real-time / Freshness Update primarily affecting time-sensitive results signaling a much stronger focus on recent content
2012 Search + Your World Personalization / Social Update pushing Google+ social data and user profiles into SERPs
2012 Venice Local More localized organic results and more tightly integrate local search data
2012 Penguin Anti-manipulation / Quality Crack-down on spam factors, including keyword stuffing and link schemes
2012 Knowledge Graph Semantic / Query Rolling out a SERP-integrated display providing supplemental object about certain people, places, and things
2012 Exact-Match Domain (EMD) Update Anti-manipulation / Quality Crack-down on low quality websites that have search terms in their domain names
2012 DMCA Penalty ("Pirate") Anti-piracy Crack-down on software and digital media piracy
2013 In-depth Articles Universal New type of result, dedicated to more ever-green, long-form content
2013 Hummingbird Semantic / Query Core algorithm update that powers changes to semantic search and the Knowledge Graph
2014 Payday Loan Anti-manipulation / Quality Crack-down on spammy queries
2014 Pigeon Local Altering local results and modifying how location cues are handled, creating closer ties between the local and core algorithm(s)
2014 HTTPS/SSL Update Trust Giving preference to secure sites
2014 Authorship Removed Trust Authorship bylines disappearing from all SERPs
2014 "In The News" Box Universal Change to News-box results expanding news links to a much larger set of potential sites
2014 Pirate 2.0 Anti-piracy Crack-down on software and digital media privacy
2015 Mobile Update / "Mobilegeddon" Mobile Mobile friendliness becomes a stronger ranking factor for mobile searches
2015 The Quality Update Anti-manipulation / Quality Core algorithm change impacting "quality signals"

Other useful observations

Period of observance Observation Reference / example Google service
2002 - 2012 The maximum amount of results served by Google is 1000. In this example query Google indicates it has indexed about 226,000,000 results, while one can not click beyond result 874 http://www.google.com/search?q=Things+one+should+know+about+google&hl=en&client=firefox-a&rls=org.mozilla:en-US:official&hs=6bK&num=100&start=900&sa=N all
2008 - 2012 Google Trends is based on 'sucessful queries.' How does Google Trends for Websites work? When you enter the address of a website into the search box, Trends for Websites shows you a graph reflecting the number of daily unique visitors (the number of people who visit a website) to that website. http://www.google.com/intl/en/trends/websites/help/index.html#1 trends
2004 - 2012 Screen scraping Google might get you blocked. DMI Google scraping experience all
2004 - 2012 the US version of Google (google.com) returns the most "international" results you can also go to http://google.com/ncr (No country redirect) google web search
2004 - 2012 cheat sheet / search operators http://www.google.com/help/cheatsheet.html google web search
2004 - 2012 cheat sheet / search operators http://jwebnet.net/advancedgooglesearch.html google web search
2004 - 2012 cheat sheet / search operators http://www.internettutorials.net/boolean.asp google web search
2012 cheat sheet / URL parameters http://code.google.com/intl/en/apis/customsearch/docs/xml_results.html
- 2012 different results are returned when one is logged in   all
2002-2012 the maximum nr of results returned by Google per query = 100 add &num=100 to the URI all
Topic revision: r21 - 02 Oct 2018, StijnPeeters
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback