Sunday, August 14, 2005

Trust rank and trust.

Make the following search on a SE:

THE PROTOCOLS OF THE LEARNED ELDERS OF ZION

Who can decide between false pages and true pages?

http://dbpubs.stanford.edu:8090/pub/showDoc.Fulltext?lang=en&doc=2004-17&format=pdf&compression=

http://advogato.org/trust-metric.html

Objective search is not an easy game. When you use mathematical algorithms, you must know what they say and don't say. You must know their limitations. That a lot of people vote for the same person, does not imply that that person is more trustworthy. It simply says that he got a lot of votes. You are not more right, even if most people agree with you.

"Academic citation literature has been applied to the web, largely by counting citations or backlinks to a given page. This gives some approximation of a page's importance or quality. PageRank extends this idea by not counting links from all pages equally, and by normalizing by the number of links on a page. PageRank is defined as follows:

We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.

PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web. Also, a PageRank for 26 million web pages can be computed in a few hours on a medium size workstation. There are many other details which are beyond the scope of this paper."
http://www-db.stanford.edu/~backrub/google.html

A natural modification is this

PR(A) = (1-d) + (d1*(PR(T1)/C(T1)) + ... + dn*(PR(Tn))/C(Tn))

where d1+d2+ ... + dn = d.

If you have a true metric that other sites are valued against, human beings or perhaps AI could be used to set di=0 if page i is false. I never promised you a rose garden.

Related link:
http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/SearchEngines.html
http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/SrchEngCriteria.pdf

Kjell Gunnar Bleivik

Wednesday, August 10, 2005

Canonical links.

1. Most SE's weights inbound links. Link spamming is a well known concept.
2. Some SE's like Teoma, http://www.teoma.com/ use expert links. Expert links is about outgoing links.
3. Some say, you are who you link to, but you also are who links to you. If you have a too open mind, people will start throwing garbage at you. Teoma adds a new dimension and level of authority to search results through its approach, known as Subject-Specific PopularitySM.

"Subject-Specific Popularity analyzes the relationship of sites within a community, ranking a site based on the number of same-subject pages that reference it, among hundreds of other criteria. In other words, Teoma determines the best answer for a search by asking experts within a specific subject community about who they believe is the best resource for that subject. By assessing the opinions of a site's peers, Teoma establishes authority for the search result. Relevant search results ranked by Subject-Specific Popularity are presented under the heading "Results" on the Teoma.com results page".
http://sp.teoma.com/docs/teoma/about/searchwithauthority.html

There was one first site on the web, and this was then the most important. As a new site were introduced, the possibility for links between these sites were open. How should we measure the importance of the tow sites
1. When there were no links between the sites?
2. When links were introduced?

Does importance mean the number of surfers (hours used) to (on) the two sites, or should experts vote for their relevance? When the number of sites increases, the number of links increases to infinity. There is a possibility of oubound and inbound links, and some outbound links is more or less relevant. The same may apply to the number of inbound links. Is a newspaper more important on a subject than an expert journal (magazine) because of more inbound links to the newspaper?

If you know principal component analysis and canonical correlation, ti shoud not be difficult to find links on a site, that gives real value to the site, what we will call canonical links. These links, may in the spirit of Teoma be bouth inbound and outbound.

Headake: Is the concept of linking inflated? Is it possible to concentrate on other factors than links, to measure the rank of a site? This discussion
will never end as long as there are ranking algorithms, SE's and surfers on the internet.


Related articles:
http://seoarticles.seoforgoogle.com/Why-You-Need-Outbound-Links.cfm
http://pr.efactory.de/e-outbound-links.shtml

Comments on WebproWorld if it is timely,

http://www.webproworld.com/viewtopic.php?t=50056&postdays=0&postorder=asc&&start=25


Kjell Gunnar Bleivik

Tuesday, August 09, 2005

Imagination is more important than knowledge.

“Imagination is more important than knowledge”.
Albert Einstein.

The equation x*x+1=0 has no real solution, but it has a (two) complex solution(s). You must be able to abstract from the real line to understand that. The Rieman integral over a point (zero length) is zero, nevertheless the Dirac Delta function, a distribution, has all its mass in origo, so the integral over that point equals one. Again you have to abstract and it is easy if you think of a triangle, with … please help med Professor GoogleBot, what is the English word

Search term: “Area of a triangle”

First hit: http://www.aaamath.com/B/geo78_x6.htm

Of course it is base. Thank you professor Googlebot, I have used you for a long time as a dictionary, and you are faster than my “The Concise Oxford English Dictionary.” I do not use an online dictionary, since you are on my toolbar, and you are fast enough for me. Neither do I know of a good free Norwegian / English and have no time to search for one, since I am writing now.

So in order to imagine a function with an integral (area) = 1, you have to take a triangle with base 2/n and height n and then the are = ((2/n)*n)/2 = 1. And when n goes to infinity, the base collapses to a point, a Dirac pulse, that is “so infinite” that it’s integral in origo equals 1.


Kjell Bleivik
http://www.multifinanceit.com/

Sunday, August 07, 2005

Green and red lights in searching.

I have collected links for 10 years, http://www.multifinanceit.com/. It started as a hobby, but as more and more asked for my collection and updated versions, I used the opportunity to build an ad driven website round the collection. What are my tools to uncover the net? I use so many, that I do not remember, and the tools differ all the times. Sometimes I follow links from great sites, and then I may find great sites. If I should search for African sites, I would start on the Central Bank web pages and follow links. So what does pagerank and Google mean to me? Google is my work horse, and I mostly find what I am searching on Google. In finance we used to use Yahoo, and when I doublcheck, http://www.twingine.com/
Search term: “Financial stability” I used to like the first hits by Yahoo best. I am told that the Yahoo directory is the best on China, and it was when I checked half a year ago. It was better than the Google directory and dmoz. But more and more young people (they are the future) told me to use Google, and now it is a great workhorse. I had other toolbars installed, but I now only use the Google toolbar http://toolbar.google.com/ Do not mix that with the Google deskbar http://deskbar.google.com/.

What does that toolbar with the green indicator ( if it is not installed, install it from the options menu) and Googlerankings http://googlerankings.com/? Mean to me?

1. First of all, when Google dance its wildest dance, during the great updates, like the Florida http://searchenginewatch.com/searchday/article.php/3285661 and last http://www.webpronews.com/insiderreports/searchinsider/wpn-49-20050602ClinkGoogleGuySpillsBourbonCheers.html burbon update, I may turn it off.
2. Google rankings is mostly for SEO’s so I do not use it, since I am not in that business. If you are in that business, Marketleap http://www.marketleap.com/ is another great toole where I may check my position every half year.
3. The green indicator tells the following. It may tell that a site is stable and mature and most probably not a spam or scam site. But it may also tell me that the site is made by SEO specialists, specialists in manipulating the SE’s and has little value. Sites in dmoz, the largest human edited directory in the world SHOULD be better, but that is not always my experience. To use the language of an economist, ceteris paribus the more green the indicator, the better.
4. Contact information on a site is very important if it is a cite offering financial services and or an ecommerce site selling goods and services. But it is neither a necessary nor sufficient condition for inclusion in my collection.
5. Bad sites are never included. By a bad site, I mean a site where I am told by Spyware Docotor (real time) ongard guide, that the site may contain harmful content. For example that applies to some (especially Asian) SE’s and directories.
6. There are some sites with no contact information beyond an email address. The site owners participate in Google AdSense https://www.google.com/adsense/ . I will never know how Google do their Due Dilligence http://duediligencechecklist.com/ . Mostly the sites are not included. In the words of dmoz, it MAY be sites that offer not net value to my linkcollection’s real estate.
It may even be a negative factor.
7. Spam emails is a great tool to find spam sites. I have my own list with 16 categories where one category is “email spammer.” Unserious emails from even large companies, should I include a new category, unserious companies? Do you remember the movie: “Far from the madding crowd” by Julie Christie. Time for a new movie. Far from the mobbing crowd” with Dustin Hoffman etc.
There is a tool for developers, the Google Deaskbar plug-ins.
“With the Google Deskbar API, you can write plug-ins to add your own features to the Google Deskbar. Plug-ins can be written in any .NET language, such as C# or Visual Basic.NET.”
http://toolbar.google.com/deskbar/help/api/index.html

And if you are a mobile surfer, try

http://www.google.com/mobile/

And do not forget to check the Google sitemap http://www.google.com/sitemap.html, at least every half year.

But there is much more under the sun than Google. My preferred tool for site search is the http://www.mamma.com/ search box. It is a Canadian SE. One obvious advantage for future development, Canadian’s are bilingual and mamma is related to the professional Intearctive Marekting Agency company http://www.digitalarrow.com/ . And their SE may be very good if you are building a directory, a portal or working in the SEO business.

“Mamma.com is a "smart" metasearch engine — every time you type in a query Mamma simultaneously searches a variety of engines, directories, and deep content sites, properly formats the words and syntax for each, compiles their results in a virtual database, eliminates duplicates, and displays them in a uniform manner according to relevance. It's like using multiple search engines, all at the same time” http://www.mamma.com/info/about.html
Eliminates duplicates, that increases efficiency.
But their rSort ranking algorithm (based on the Condorcet Method) uses duplicates when they rank results.
“Spammers often have difficulty spamming more than one engine at the same time, as different spamming methods must be used for each search engine. Spam results will tend to receive fewer votes from multiple sources. A spammer may have top ranking on one search engine, but they won't achieve it on Mamma unless they're able to spam ALL of our sources, an insurmountable task for even the best spammer”.
http://www.mamma.com/info/about.html
Should I use you more mamma? Checked the KW’s, when I last visited my daughter. One of the most used KW’s was mamma.

Tuesday, August 02, 2005

Make your own (Meta) SE.

1. High dimensions are no problem for an economist used to world trade models of millions of equations. In C++ a 5 dimensional object is a Pointer*****. So if you have a supercomputer with petabytes of storage capacity, you can make a n-dimensional timeseries, that is a (n+1)dimensional datastructure. You have one dimension for each cathegory, e.g. for each country in the world. In that structure you store your data.

2. Queries from this structure are done by projections. Analogy http://www.craigslist.org/

3. You need fast iterators, generic iterators on function objects if you want to do more advanced datamanipulations on the (n+1)-dimensional structure.

4. Additional links:
http://lucene.apache.org/
http://lucenebook.com/
http://www.manning.com/books/hatcher2
http://www-db.stanford.edu/~backrub/google.html
http://www.intex.com/ Specialist on datastructures.
http://www.data.sungard.com/ Data solution.
http://www.multifinanceit.com/search.htm Cathegory: "Advanced topics"

Kjell Gunnar Bleivik
http://www.multifinanceit.com/