To be or not to be

February 21, 2008

Amazing Indexing Speed of Google

Filed under: Indexing, Search — Tags: , , , , — tdas @ 6:12 pm

Google seems to index documents, moments within it is published. It is amazing how they can achieve something like this,staying in the boundaries of hardware and software limitation. This post by the way is a test, to see how fast they index this ;)

Just to prove what I meant by FAST, I actually took a screen shot from the results page, showing the time they indexed it. I happen to know a little bit about search engines and how they work in general, but this simply blows my mind away.Kudos to Google and its engineers. Someday I hope to figure out…

Google Indexing speed

February 17, 2008

How To Extract lines from a file in Unix?

Filed under: How To, Shell Script — Tags: , , , , — tdas @ 8:27 pm

To extract k lines from a text file in Unix, we can use a combination of head and tail.

head -20 file.dat | tail -10 //this will gives us line number [10-20] from file.dat

Another elegant and easy solution for extracting a range of lines from a text file in unix would be using sed.

cat file.dat | sed -n ‘10,20p’ > output.dat // this will also extract lines [10-20] from file.dat

How To Convert Lower Case to Upper case and Vice Versa Unix?

Filed under: How To, Shell Script — Tags: , , — tdas @ 6:32 pm

To convert lower case charactera to upper case and vice versa is a fairly common task in the computer world. In Unix this can be done very easily by using the tr command.

To convert a file containing lower case characters to upper case characters :

tr ‘[:lower:]‘ ‘[:upper:]‘ < foo.dat //Note this will change everything to upper case

To convert a file containing upper case characters to lower case characters :

tr ‘[:upper:]‘ ‘[:lower:]‘ < foo.dat //Note this will change everything to lower case

How To Section

Filed under: How To — Tags: , , — tdas @ 5:56 pm

I have been thinking about creating a How To section, where I would explain how to do simple things in the CS World. Topics can range from simple shell script to Databases to programming languages.  There are a few websites that provides an exhaustive list of HowTo’s,  out of them I really like the WikiHow website. But anyways, the main purpose of creating this section is motivated by fact that often times I forget the most trivial commands at work and end up searching the web for a solution. What better than having my own little section about all the common how to’s I have compiled over the past few years.

February 11, 2008

Unix Sort

Filed under: Shell Script — Tags: , , — tdas @ 3:20 am

The Unix sort command is one of the most useful/powerful commands I have ever used. Below I have listed some cool things you can do with the sort command:

Sort and output to the same file (-0) : sort -o foo.dat foo.dat

Sort and keep only unique values (-u): sort -u -o foo.dat foo.dat

Sort numbers (-n): sort -n -o foo.dat foo.dat

Sort numbers in reverse (-r) : sort -n -r -o foo.dat foo.dat

Union of two files : sort file1 file2 | uniq > file3

Intersection of two files : sort file1 file2 | uniq -d >file3

    Information Extraction 101

    Filed under: Information Extraction — Tags: , — tdas @ 2:48 am

    Lately I have been fascinated with the topic of Information Extraction(IE) and its application towards the web data. More specifically, I am interested in extracting meaningful and structured information from unstructured text data(i.e. Web Pages) and the definition of IE seems to fit the bill. I found this lecture by Kamal Nigam about various Information Extraction techniques. For people, interested in IE this should be a great starting point.

    Text Information Extraction – Kamal Nigam

    February 3, 2008

    Speed up Grep

    Filed under: Shell Script — Tags: , — tdas @ 5:02 pm

    GNU grep is very slow in the UTF-8 locale. It is orders of magnitude faster in the C locale. To check your current
    locale, type the following at shell prompt: locale

    LANG=en_US.UTF-8
    LC_CTYPE=”en_US.UTF-8″
    LC_NUMERIC=”en_US.UTF-8″
    LC_TIME=”en_US.UTF-8″
    LC_COLLATE=”en_US.UTF-8″
    LC_MONETARY=”en_US.UTF-8″
    LC_MESSAGES=”en_US.UTF-8″
    LC_PAPER=”en_US.UTF-8″
    LC_NAME=”en_US.UTF-8″
    LC_ADDRESS=”en_US.UTF-8″
    LC_TELEPHONE=”en_US.UTF-8″
    LC_MEASUREMENT=”en_US.UTF-8″
    LC_IDENTIFICATION=”en_US.UTF-8″
    LC_ALL=

    In the above example, my locale is en_US.UTF-8. If you are
    grep’ing very large files, you can greatly improve the speed by changing
    the locale to C. In bash, you would type: export LC_ALL=C

    Then type locale again, the display should look something like this :

    LANG=en_US.UTF-8
    LC_CTYPE=”C”
    LC_NUMERIC=”C”
    LC_TIME=”C”
    LC_COLLATE=”C”
    LC_MONETARY=”C”
    LC_MESSAGES=”C”
    LC_PAPER=”C”
    LC_NAME=”C”
    LC_ADDRESS=”C”
    LC_TELEPHONE=”C”
    LC_MEASUREMENT=”C”
    LC_IDENTIFICATION=”C”
    LC_ALL=C

    Future version of grep are planned to address this issue. Until then,
    use the C locale with grep. If you are frequently using grep to search for large text files, you should include it in your  .bash_profile.

    Grep – The Magic Unix Command

    Filed under: Shell Script — Tags: , , , — tdas @ 4:50 pm

    I have been using the grep command for a while now, and I have been in awe of it ever since. Assuming the readers have some basic knowledge of Unix commands and grep in general I would like to mention a couple really cool features of grep that I find really handy.

    Looking for the exact match : Imagine you are looking to extract a specific pattern from a text file, but you do not want the other information in the matching line, then use; grep -o.  For example, if you want to extract the domain name from the URL : http://www.cs.dal.ca/abc/report.html?report=34A.  Use the following command:  ‘ cat http://www.cs.dal.ca/abc/report.html?report=34A | grep -o “www.[^\/]*“; this will return www.cs.dal.ca.

    Looking for adjacent lines : A lot of the times, when performing a search on a text file using grep, we want to see the adjacent lines for the match. grep supports this feature by using the -B, -A -C options.

    grep -A 5 “^abc” file.dat // this will return the line starting with abc in file.dat and 5 lines after it.

    grep -B 5 “^abc” file.dat // this will return the line starting with abc in file.dat and 5 lines before it.

    grep -C 5 “^abc” file.dat // this will return the line starting with abc in file.dat and 5 lines before & after it.

     

    Hopefully, these tricks will be of help to someone :)

      Google let me down :(

      Filed under: Personal — tdas @ 3:22 pm

      I have been an ardent fan of Google for the past 5 years or so. I seem to like every product that Google rolls out. About 2 years ago I joined the Blog bandwagon and signed up for an account in Blogspot. I will admit that I am not an active blogger, but lack of a Category feature in Blogspot is shocking. May be I missed something, but reading a few forum postings, I think there is no easy way you can nicely group your posts in Blogspot. (P.S. Obviously there are hacks available). To cut the long story short, I have been disappointed by Google for the first time and thats the main reason I have decided to give WordPress a shot. The title of the blog does not relate at all to the content of the blog, its there more for historical reasons. So here I am posting my first blog entry, hopefully this will the first of many more to come.

      Blog at WordPress.com.