Friday, May 6, 2011

Indexing A Site

Before a site appears in search results, a search engine must index it. An indexed site will have been visited and analyzed by a search robot with relevant information saved in the search engine database. If a page is present in the search engine index, it can be displayed in search results otherwise, the search engine cannot know anything about it and it cannot display information from the page.

There are two ways to allow a search engine to learn about a new site:

   - Submit the address of the site manually using a form associated with the search engine, if available. In this case, you are the one who informs the search engine about the new site and its address goes into the queue for indexing. Only the main page of the site needs to be added, the search robot will find the rest of pages by following links.

   - Let the search robot find the site on its own. If there is at least one inbound link to your resource from other indexed resources, the search robot will soon visit and index your site. In most cases, this method is recommended. This may actually be quicker than manually adding it to the submission queue. The Google search engine is the quickest of the bunch.

   Try to make your site friendly to search robots by following these rules:

   - Try to make any page of your site reachable from the main page in not more than three mouse clicks. If the structure of the site does not allow you to do this, create a so-called site map that will allow this rule to be observed.

   - Remember that search engines index no more than the first 100-200 KB of text on a page. Hence, the following rule – do not use pages with text larger than 100 KB if you want them to be indexed completely.

   You can manage the behavior of search robots using the file robots.txt. This file allows you to explicitly permit or forbid them to index particular pages on your site.

   The databases of search engines are constantly being updated; records in them may change, disappear and reappear. That is why the number of indexed pages on your site may sometimes vary. One of the most common reasons for a page to disappear from indexes is server unavailability. This means that the search robot could not access it at the time it was attempting to index the site. After the server is restarted, the site should eventually reappear in the index.

   You should note that the more inbound links your site has, the more quickly it gets re-indexed. You can track the process of indexing your site by analyzing server log files where all visits of search robots are logged.