To be or not to be

September 6, 2008

Extracting Spam Blogs with Co-citation Clusters – WWW 2008

Filed under: Search — Tags: , , , — tdas @ 7:25 pm

This poster paper was published in the recently concluded WWW 2008 conference in Beijing.The author suggests that spam blogs (splogs) form co-citation clusters because they share advertisement links between each other.The experiments were based on 691,674 blogs, collected during a week in 2007, from the Japanese blogosphere. The author reports that high out-degree blogs are usually spam blogs(95% of the time). An iterative spam traversal algorithm is implemented to extract spams using co-citation cluster analysis of known spam blogs(seed set).The seed set was automatically generated using high out-degree pages and pages containing adult and commercial keywords.

Comments :
* He does not mention what is meant by high out-degree.
* Method seems simple and intuitive, but more exhaustive experiments would provide more insight.

July 23, 2008

Automatic HomePage Finding

For my Machine Learning course , I have to develop an intelligent system that can automatically identify Official Websites for search queries. For people interested in the details, you can read more about it here Project Proposal. I have to submit the completed project by August 5th, and I haven’t started implementing it. So, in any case I have decided to maintain a journal of the next few days leading to the completion of the project. I will jot down all the ideas, updates and everything else on this blog. Without further adieu, lets get the ball rolling …

Update : I successfully completed my project and handed it, on time :) . For the mathematically inclined, I achieved a ten-fold cross validation accuracy of 80.48%. For details, read the complete report.Automatic HomePage Identification

July 10, 2008

GenieKnows where your business is …

Filed under: Personal, Search — Tags: , — tdas @ 1:48 pm

After four years in the making and patiently waiting, Genieknows finally launched its local search engine. Its a tremendous achievement for a very small company, who dares to dream big. I still remember back in early 2005, when I joined the R&D team (P.S. I was the third member :P ), I was kinda skeptical about the prospect of our product. Working weekends, overtime, we slowly trudged towards our goal. Keeping the emotional rants aside, I just want this wonderful feeling to sink in. Obviously this is not a finished product and we are continually trying to improve it. We are very well aware of the stiff competition that we face, neither do we have the resources nor the financial clout to compete directly with the other G ;) ,but that does not bother us. As Randy Pausch so rightly said, “Do the right thing and the good things will come to you…”, Thats all we are trying to do, “Keeping it real and Keeping it local”. In some future posts I’d discuss some of the cool features of our product and may be some secrets about ranking higher in Genieknows search :P

February 21, 2008

Amazing Indexing Speed of Google

Filed under: Indexing, Search — Tags: , , , , — tdas @ 6:12 pm

Google seems to index documents, moments within it is published. It is amazing how they can achieve something like this,staying in the boundaries of hardware and software limitation. This post by the way is a test, to see how fast they index this ;)

Just to prove what I meant by FAST, I actually took a screen shot from the results page, showing the time they indexed it. I happen to know a little bit about search engines and how they work in general, but this simply blows my mind away.Kudos to Google and its engineers. Someday I hope to figure out…

Google Indexing speed

Blog at WordPress.com.