Thanga

Wednesday, September 10, 2008

Data Mining Using Google

Google has quickly become one of the most well known words in the world and is used by millions daily, including myself. In an advanced database class back in university, we spent a couple of weeks studying the inner workings of search engines, and one topic which happened to come up was data mining using Google. Much to my surprise, out of a class of 80 fourth year computer engineers maybe four or five knew how to use Google to perform any sort of advanced queries.

Google (and many other search engines) has the ability not only to search on keywords, but also using a more “database-ish” query language to really narrow down your search results. Below is a summary of a few of the most useful lesser known features. Note: in the examples, replace cwire.org with your own domain.

Basic Usage:

  • Use quotation marks ” “ to locate an entire string.
    eg. “bill gates conference” will only return results with that exact string.
  • Mark essential words with a +
    If a search term must contain certain words or phrases, mark it with a + symbol. eg: +”bill gates” conference will return all results containing “bill gates” but not necessarily those pertaining to a conference
  • Negate unwanted words with a -
    You may wish to search for the term bass, pertaining to the fish and be returned a list of music links as well. To narrow down your search a bit more, try: bass -music. This will return all results with “bass” and NOT “music”.

General Tips: (I use many of these almost on a daily basis)

  • site:www.cwire.org
    This will search only pages which reside on this domain.
  • related:www.cwire.org
    This will display all pages which Google finds to be related to your URL
  • link:www.cwire.org
    This will display a list of all pages which Google has found to be linking to your site. Useful to see how popular your site is
  • spell:word
    Runs a spell check on your word
  • define:word
    Returns the definition of the word
  • stocks: [symbol, symbol, etc]
    Returns stock information. eg. stock: msft
  • maps:
    A shortcut to Google Maps
  • phone: name_here
    Attempts to lookup the phone number for a given name
  • cache:
    If you include other words in the query, Google will highlight those words within the cached document. For instance, cache:www.cwire.org web will show the cached content with the word “web” highlighted.
  • info:
    The query [info:] will present some information that Google has about that web page. For instance, info:www.cwire.org will show information about the CyberWyre homepage. Note there can be no space between the “info:” and the web page url.
  • weather:
    Used to find the weather in a particular city. eg. weather: new york

Advanced Tips:

  • filetype:
    Does a search for a specific file type, or, if you put a minus sign (-) in front of it, it won’t list any results with that filetype. Try it with .mp3, .mpg or .avi if you like.
  • daterange:
    Is supported in Julian date format only. 2452384 is an example of a Julian date.
  • allinurl:
    If you start a query with [allinurl:], Google will restrict the results to those with all of the query words in the url. For instance, [allinurl: google search] will return only documents that have both “google” and “search” in the url.
  • inurl:
    If you include [inurl:] in your query, Google will restrict the results to documents containing that word in the url. For instance, [inurl:google search] will return documents that mention the word “google” in their url, and mention the word “search” anywhere in the document (url or no). Note there can be no space between the “inurl:” and the following word.
  • allintitle:
    If you start a query with [allintitle:], Google will restrict the results to those with all of the query words in the title. For instance, [allintitle: google search] will return only documents that have both “google” and “search” in the title.
  • intitle:
    If you include [intitle:] in your query, Google will restrict the results to documents containing that word in the title. For instance, [intitle:google search] will return documents that mention the word “google” in their title, and mention the word “search” anywhere in the document (title or no). Note there can be no space between the “intitle:” and the following word.
  • allinlinks:
    Searches only within links, not text or title.
  • allintext:
    Searches only within text of pages, but not in the links or page title.
  • bphonebook:
    If you start your query with bphonebook:, Google shows U.S. business white page listings for the query terms you specify. For example, [ bphonebook: google mountain view ] will show the phonebook listing for Google in Mountain View.
  • phonebook:
    If you start your query with phonebook:, Google shows all U.S. white page listings for the query terms you specify. For example, [ phonebook: Krispy Kreme Mountain View ] will show the phonebook listing of Krispy Kreme donut shops in Mountain View.
  • rphonebook:
    If you start your query with rphonebook:, Google shows U.S. residential white page listings for the query terms you specify. For example, [ rphonebook: John Doe New York ] will show the phonebook listings for John Doe in New York (city or state). Abbreviations like [ rphonebook: John Doe NY ] generally also work.

Putting it all Together:

Now it’s time to start to get creative with our search terms and really narrow down our results. Now that we have the basics, let’s start to combine them all into one search term.

Example #1: Search for some MP3s
Let’s say you’re a Beatles fan and want to see if you can find some of their songs on the Internet without using Kazaa, etc. Try this query:

“index of” + “mp3″ + “beatles” -html -htm -php
or you could try this query:
* “index of/mp3″ -playlist -html -lyrics beatles

Right away on the first few results returned by Google you can download MP3s.

Example #2: Mixing some techniques together

Here’s a simple exercise. We’ll mix around a few terms to get more accurate results. Let’s say we want to research sleep recommendations. One assumption could be that research papers on this topic would most likely be on an educational website — perhaps with a .edu domain. We could try this query:

sleep recommendations site:edu

Maybe we’re in my situation, and am thinking of applying to grad school. Let’s see if we can find the Graduate Studies Admissions Requirements at the University of Toronto. We could try this query:

grad school admission requirements site:utoronto.ca

Summary:

After reading this article, you might be thinking “well, I could probably find those results without remembering these advanced search terms”. Well, the truth is that you probably could. The reason you want to start to use these advanced search tips is because they will help you find what you’re looking for faster. They greatly help narrow down the results, and more often than not, the information you were looking for will be in the first two or three results.

Labels: