Thanga

Wednesday, September 10, 2008

Data Mining Using Google

Google has quickly become one of the most well known words in the world and is used by millions daily, including myself. In an advanced database class back in university, we spent a couple of weeks studying the inner workings of search engines, and one topic which happened to come up was data mining using Google. Much to my surprise, out of a class of 80 fourth year computer engineers maybe four or five knew how to use Google to perform any sort of advanced queries.

Google (and many other search engines) has the ability not only to search on keywords, but also using a more “database-ish” query language to really narrow down your search results. Below is a summary of a few of the most useful lesser known features. Note: in the examples, replace cwire.org with your own domain.

Basic Usage:

  • Use quotation marks ” “ to locate an entire string.
    eg. “bill gates conference” will only return results with that exact string.
  • Mark essential words with a +
    If a search term must contain certain words or phrases, mark it with a + symbol. eg: +”bill gates” conference will return all results containing “bill gates” but not necessarily those pertaining to a conference
  • Negate unwanted words with a -
    You may wish to search for the term bass, pertaining to the fish and be returned a list of music links as well. To narrow down your search a bit more, try: bass -music. This will return all results with “bass” and NOT “music”.

General Tips: (I use many of these almost on a daily basis)

  • site:www.cwire.org
    This will search only pages which reside on this domain.
  • related:www.cwire.org
    This will display all pages which Google finds to be related to your URL
  • link:www.cwire.org
    This will display a list of all pages which Google has found to be linking to your site. Useful to see how popular your site is
  • spell:word
    Runs a spell check on your word
  • define:word
    Returns the definition of the word
  • stocks: [symbol, symbol, etc]
    Returns stock information. eg. stock: msft
  • maps:
    A shortcut to Google Maps
  • phone: name_here
    Attempts to lookup the phone number for a given name
  • cache:
    If you include other words in the query, Google will highlight those words within the cached document. For instance, cache:www.cwire.org web will show the cached content with the word “web” highlighted.
  • info:
    The query [info:] will present some information that Google has about that web page. For instance, info:www.cwire.org will show information about the CyberWyre homepage. Note there can be no space between the “info:” and the web page url.
  • weather:
    Used to find the weather in a particular city. eg. weather: new york

Advanced Tips:

  • filetype:
    Does a search for a specific file type, or, if you put a minus sign (-) in front of it, it won’t list any results with that filetype. Try it with .mp3, .mpg or .avi if you like.
  • daterange:
    Is supported in Julian date format only. 2452384 is an example of a Julian date.
  • allinurl:
    If you start a query with [allinurl:], Google will restrict the results to those with all of the query words in the url. For instance, [allinurl: google search] will return only documents that have both “google” and “search” in the url.
  • inurl:
    If you include [inurl:] in your query, Google will restrict the results to documents containing that word in the url. For instance, [inurl:google search] will return documents that mention the word “google” in their url, and mention the word “search” anywhere in the document (url or no). Note there can be no space between the “inurl:” and the following word.
  • allintitle:
    If you start a query with [allintitle:], Google will restrict the results to those with all of the query words in the title. For instance, [allintitle: google search] will return only documents that have both “google” and “search” in the title.
  • intitle:
    If you include [intitle:] in your query, Google will restrict the results to documents containing that word in the title. For instance, [intitle:google search] will return documents that mention the word “google” in their title, and mention the word “search” anywhere in the document (title or no). Note there can be no space between the “intitle:” and the following word.
  • allinlinks:
    Searches only within links, not text or title.
  • allintext:
    Searches only within text of pages, but not in the links or page title.
  • bphonebook:
    If you start your query with bphonebook:, Google shows U.S. business white page listings for the query terms you specify. For example, [ bphonebook: google mountain view ] will show the phonebook listing for Google in Mountain View.
  • phonebook:
    If you start your query with phonebook:, Google shows all U.S. white page listings for the query terms you specify. For example, [ phonebook: Krispy Kreme Mountain View ] will show the phonebook listing of Krispy Kreme donut shops in Mountain View.
  • rphonebook:
    If you start your query with rphonebook:, Google shows U.S. residential white page listings for the query terms you specify. For example, [ rphonebook: John Doe New York ] will show the phonebook listings for John Doe in New York (city or state). Abbreviations like [ rphonebook: John Doe NY ] generally also work.

Putting it all Together:

Now it’s time to start to get creative with our search terms and really narrow down our results. Now that we have the basics, let’s start to combine them all into one search term.

Example #1: Search for some MP3s
Let’s say you’re a Beatles fan and want to see if you can find some of their songs on the Internet without using Kazaa, etc. Try this query:

“index of” + “mp3″ + “beatles” -html -htm -php
or you could try this query:
* “index of/mp3″ -playlist -html -lyrics beatles

Right away on the first few results returned by Google you can download MP3s.

Example #2: Mixing some techniques together

Here’s a simple exercise. We’ll mix around a few terms to get more accurate results. Let’s say we want to research sleep recommendations. One assumption could be that research papers on this topic would most likely be on an educational website — perhaps with a .edu domain. We could try this query:

sleep recommendations site:edu

Maybe we’re in my situation, and am thinking of applying to grad school. Let’s see if we can find the Graduate Studies Admissions Requirements at the University of Toronto. We could try this query:

grad school admission requirements site:utoronto.ca

Summary:

After reading this article, you might be thinking “well, I could probably find those results without remembering these advanced search terms”. Well, the truth is that you probably could. The reason you want to start to use these advanced search tips is because they will help you find what you’re looking for faster. They greatly help narrow down the results, and more often than not, the information you were looking for will be in the first two or three results.

Labels:

Thursday, May 15, 2008

Thiruthangal Nadar College

Wonderful college where I did my under graduation. Thiruthangal Nadar College

Labels:

Thursday, September 27, 2007

Secured Socket Layer

What is HTTPS?

HTTPS (Hyper Text Transfer Protocol Secure) is a secure version of the Hyper Text Transfer Protocol (http). It is a secure means of transferring data using the https protocol in the World Wide Web where secure e-commerce transactions, such as online banking transactions and other financial transactions are involved. In other words, https encrypts the session with a digital certificate i.e., HTTP over SSL (Secure Sockets Layer) which can be used by Web browsers and HTTPS - capable client programs. So if the website begins with https:// instead of http://, it is a secure site. Almost 99% of the browsers can connect to web servers either using http or https. The address bar in the browser will begin with https instead of http, if the website is secured. Most of the web browsers like IE, Firefox etc., display a pad ”lock” icon to indicate the website is secure, which also displays https in the address bar. This padlock icon is displayed only when an SSL certificate is installed by their web server. If the padlock icon and the web link begin with https then it can be concluded that the site is legitimate and secure to provide confidential information or carry financial transactions. Hence the SSL work with the combination of programs including the browser programs which takes care of the encryption/decryption routines that exist on the web hosting servers. Most typically HTTP data is sent over TCP/IP port 80, whereas SSL HTTP data is sent over port 443.

How SSL Works?

SSL uses a cryptographic system that uses two keys to encrypt data, firstly a public key known to everyone and the second is the private key known only to the recipient. SSL Encryption is a unique and effective way to achieve data and ecommerce security. So when a SSL - Digital Certificate is installed on the website, a padlock icon can be seen at the bottom area of the navigator and also the address in the address bar will begin with "https" instead of http during a secure session, which means that the data is encrypted.

Why is an SSL http certificate required?

Internet trends have boomed in popularity and are creating more opportunities for both commercial and non-commercial sectors. In today's date the Internet has paved way for e-commerce business at large thereby increasing more online transactions. Due to fear of scam, people are not ready to submit their private details on the web unless they know the transaction/information they provide would be secure. The best way to provide web security and gain more customers is by installing an SSL Http to prove the legitimacy of your website.

What is SSL Encryption and why is it required?

SSL Encryption or Https is a technique used to safeguard private information which is sent via Internet. To prove the site's legitimacy, the SSL encryption uses a PKI (Public Key Infrastructure) - public/private key, to encrypt IDs, documents, or messages to securely transmit the information in the World Wide Web. In order to show that our transmission is encrypted, most browsers will display a small icon that would look like a pad “lock” or a key and the URL begins with "https" instead of "http”. SSL Encryption or https from a digital certification authority will help a secure site with confidential information on web.

Encryption Systems

Computer encryption is based on the science of cryptography, which has been used throughout history. Before the digital age, the biggest users of cryptography were governments, particularly for military purposes. The existence of coded messages has been verified as far back as the Roman Empire. But most forms of cryptography in use these days rely on computers, simply because a human-based code is too easy for a computer to crack.

Most computer encryption systems belong in one of two categories:

Symmetric-key encryption

Public-key encryption

Symmetric Key

In symmetric-key encryption, each computer has a secret key (code) that it can use to encrypt a packet of information before it is sent over the network to another computer. Symmetric-key requires that you know which computers will be talking to each other so you can install the key on each one. Symmetric-key encryption is essentially the same as a secret code that each of the two computers must know in order to decode the information. The code provides the key to decoding the message. Think of it like this: You create a coded message to send to a friend in which each letter is substituted with the letter that is two down from it in the alphabet. So "A" becomes "C," and "B" becomes "D". You have already told a trusted friend that the code is "Shift by 2". Your friend gets the message and decodes it. Anyone else who sees the message will see only nonsense

Public Key

Public-key encryption uses a combination of a private key and a public key. The private key is known only to your computer, while the public key is given by your computer to any computer that wants to communicate securely with it. To decode an encrypted message, a computer must use the public key, provided by the originating computer, and its own private key. A very popular public-key encryption utility is called Pretty Good Privacy (PGP), which allows you to encrypt almost anything.

To implement public-key encryption on a large scale, such as a secure Web server might need, requires a different approach. This is where digital certificates come in. A digital certificate is basically a bit of information that says that the Web server is trusted by an independent source known as a certificate authority. The certificate authority acts as a middleman that both computers trust. It confirms that each computer is in fact who it says it is, and then provides the public keys of each computer to the other.

The Process of Symmetric and Public Key in action.

For example in case of an email - The sending computer encrypts the document with a symmetric key, then encrypts the symmetric key with the public key of the receiving computer. The receiving computer uses its private key to decode the symmetric key. It then uses the symmetric key to decode the document.

Public Key: SSL

A popular implementation of public-key encryption is the Secure Sockets Layer (SSL). Originally developed by Netscape, SSL is an Internet security protocol used by Internet browsers and Web servers to transmit sensitive information. SSL has become part of an overall security protocol known as Transport Layer Security (TLS).

In your browser, you can tell when you are using a secure protocol, such as TLS, in a couple of different ways. You will notice that the "http" in the address line is replaced with "https," and you should see a small padlock in the status bar at the bottom of the browser window.

Public-key encryption takes a lot of computing, so most systems use a combination of public-key and symmetry. When two computers initiate a secure session, one computer creates a symmetric key and sends it to the other computer using public-key encryption. The two computers can then communicate using symmetric-key encryption. Once the session is finished, each computer discards the symmetric key used for that session. Any additional sessions require that a new symmetric key be created, and the process is repeated.

Hashing Algorithms

The key in public-key encryption is based on a hash value. This is a value that is computed from a base input number using a hashing algorithm. Essentially, the hash value is a summary of the original value. The important thing about a hash value is that it is nearly impossible to derive the original input number without knowing the data used to create the hash value. Here's a simple example:

Input number

Hashing algorithm

Hash value

10,667

Input # x 143

1,525,381

You can see how hard it would be to determine that the value 1,525,381 came from the multiplication of 10,667 and 143. But if you knew that the multiplier was 143, then it would be very easy to calculate the value 10,667. Public-key encryption is actually much more complex than this example, but that is the basic idea.

Public keys generally use complex algorithms and very large hash values for encrypting, including 40-bit or even 128-bit numbers. A 128-bit number has a possible 2128 or 3,402,823,669,209,384,634,633,746,074,300,000,000,000,000,000,000,000,000,000,000,000,000 different combinations! This would be like trying to find one particular grain of sand in the Sahara Desert.

Authentication

As stated earlier, encryption is the process of taking all of the data that one computer is sending to another and encoding it into a form that only the other computer will be able to decode. Another process, authentication, is used to verify that the information comes from a trusted source. Basically, if information is "authentic," you know who created it and you know that it has not been altered in any way since that person created it. These two processes, encryption and authentication, work hand-in-hand to create a secure environment.

There are several ways to authenticate a person or information on a computer:

Password - The use of a user name and password provides the most common form of authentication. You enter your name and password when prompted by the computer. It checks the pair against a secure file to confirm. If either the name or the password does not match, then you are not allowed further access.

Pass cards - These cards can range from a simple card with a magnetic strip, similar to a credit card, to sophisticated smart cards that have an embedded computer chip.

Digital signatures - A digital signature is basically a way to ensure that an electronic document (e-mail, spreadsheet, text file) is authentic. The Digital Signature Standard (DSS) is based on a type of public-key encryption method that uses the Digital Signature Algorithm (DSA). DSS is the format for digital signatures that has been endorsed by the U.S. government. The DSA algorithm consists of a private key, known only by the originator of the document (the signer), and a public key. The public key has four parts, which you can learn more about at this page. If anything at all is changed in the document after the digital signature is attached to it, it changes the value that the digital signature compares to, rendering the signature invalid.

Recently, more sophisticated forms of authentication have begun to show up on home and office computer systems. Most of these new systems use some form of biometrics for authentication. Biometrics uses biological information to verify identity. Biometric authentication methods include:

Fingerprint scan

Retina scan

Face scan

Voice identification

Checking for Corruption

Another secure-computing need is to ensure that the data has not been corrupted during transmission or encryption. There are a couple of popular ways to do this:

Checksum - Probably one of the oldest methods of ensuring that data is correct, checksums also provide a form of authentication because an invalid checksum suggests that the data has been compromised in some fashion. A checksum is determined in one of two ways. Let's say the checksum of a packet is 1 byte long. A byte is made up of 8 bits, and each bit can be in one of two states, leading to a total of 256 (28 ) possible combinations. Since the first combination equals zero, a byte can have a maximum value of 255.

If the sum of the other bytes in the packet is 255 or less, then the checksum contains that exact value.

If the sum of the other bytes is more than 255, then the checksum is the remainder of the total value after it has been divided by 256.

Let's look at a checksum example

Byte 1

Byte 2

Byte 3

Byte 4

Byte 5

Byte 6

Byte 7

Byte 8

Total

Checksum

212

232

54

135

244

15

179

80

1,151

127

·

· 1,151 / 256 = 4.496 (round to 4)

· 4 x 256 = 1,024

· 1,151 - 1,024 = 127

Cyclic Redundancy Check (CRC) - CRCs are similar in concept to checksums, but they use polynomial division to determine the value of the CRC, which is usually 16 or 32 bits in length. The good thing about CRC is that it is very accurate. If a single bit is incorrect, the CRC value will not match up. Both checksum and CRC are good for preventing random errors in transmission but provide little protection from an intentional attack on your data. Symmetric- and public-key encryption techniques are much more secure.

All of these various processes combine to provide you with the tools you need to ensure that the information you send or receive over the Internet is secure. In fact, sending information over a computer network is often much more secure than sending it any other way. Phones, especially cordless phones, are susceptible to eavesdropping, particularly by unscrupulous people with radio scanners. Traditional mail and other physical mediums often pass through numerous hands on the way to their destination, increasing the possibility of corruption. Understanding encryption, and simply making sure that any sensitive information you send over the Internet is secure (remember the "https" and padlock symbol), can provide you with greater peace of mind.

Thursday, April 12, 2007

Useful Tech Terms

ADSL - Asymmetric Digital Subscriber Line
AIFF - Audio Interchange File Format
API - Application Program Interface
ATM - Asynchronous Transfer Mode
Blog - Web Log
CDMA - Code Division Multiple Access
CGI - Common Gateway Interface
CISC - Complex Instruction Set Computing
DDR - Double Data Rate
DHCP - Dynamic Host Configuration Protocol
EPS - Encapsulated PostScript
GPRS - General Packet Radio Service
GPS - Global Positioning System
GSM - Global system for Mobile communication
IDE - Integrated Device Electronics
IEEE - Institute of Electrical and Electronics Engineers
Ipod - iPod is a portable music player developed by Apple Computer
IRC - Internet Relay Chat
ISDN - Intergrated Services Digital Network
LCD - Liquid Crystal Display
MAC add - Media Access Control Address
MIDI - Musical Instrument Digital Interface
NIC - Network Interface Card
NTFS - New Technology File System
OCR - Optical Character Recognition
open GL - Open Graphics Library
PCB - Printed Circuit Board
PCI - Peripheral Component Interconnect
PHP - Hypertext Preprocessor
RAID - Redundant Array of Independent Disks
RDRAM - Rambus Dynamic Random Access Memory
SATA - Serial Advanced Technology Attachment
SDSL - Symmetric Digital Subscriber Line
TFT - Thin-Film Transistor
VGA - Video Graphics Array
VRML - Virtual Reality Modeling Language
WAIS - Wide Area Information Server
Wi-Fi - Wireless Fidelity

Wednesday, January 10, 2007

Latest Japan CPUs





Wednesday, January 25, 2006

WISH YOU ALL QUALITY LIFE

Hi Folks,




Somebody once asked god "What surprises you most, about mankind ???" God Replied.." They lose there health to make money and then, lose their health, By thinking anxiously about the future, They forget the present... such that, they live neither for the present nor for the future, they live as if they will never die and they die, as if they had never lived...

"WISH YOU ALL QUALITY LIFE"