RE: Website Crawling

Website Crawling

A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion.  Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or—especially in the FOAF community—Web scutters.

This process is called Web crawling or spidering.  Many sites, in particular search engines, use spidering as a means of providing up-to-date data.  Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches.  Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code.  Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for sending spam).

A Web crawler is one type of bot, or software agent.  In general, it starts with a list of URLs to visit, called the seeds.  As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier.  URLs from the frontier are recursively visited according to a set of policies.

The large volume implies that the crawler can only download limited number of the Web pages within a given time, so it needs to prioritize its downloads.  The high rate of change implies that the pages might have already been updated or even deleted.          -- Wikipedia.

How big is the World Wide Web ?  Billions, not millions.  That is a bunch of website pages !  But that is just a part of the whole iceberg.  Just the indexed pages by the search engines that is seen by the general public.

Prevent Crawling

No Index, No Follow

Are there certain areas of your website that you do not want indexed ?  Some that you want private ?  Maybe you have areas that are only meant for the employees of your company.  Maybe you have a page with insider information that you don't want the whole world to have access to.  Maybe these website pages need to be excluded from being indexed.

There are instructions that can be put into the back end of the website to make this possible.  Index, Follow -- No Index, Follow -- Index, No Follow -- No Index, No Follow.  Index refers to whether or not you want that particular page to be indexed.  Follow refers to whether or not you want the search engine spider to follow the links on that page.

Instructing the search engines to do exactly what you would like in this manner, will help your website be more successful for your individual business needs.

Here, preventing too much crawling will help your PageRank overall.

301 Redirect

To a search engine, http://www.mywebsite.com and http://mywebsite.com, are seen as 2 totally different websites, even though, typing either, someone is probably going to wind up in the same place.  This can divide your possible PageRank into two parts and hurt the potential of searches for your website.

A 301 Redirect can merge these website addresses together from a search engine's perspective, and therefore, cut it's workload and increase your PageRank to it's maximum potential.

OK.  Caution Now.  Here is some code for you that should work in most Joomla! situations.  And note that I said most.  If you don't know what to do with this stuff -- don't try this at home, and please leave it to the professionals !  Add this into your home directory .htaccess file and replace mywebsite and .com with your own relevant domain name information.

     # Always use www in the domain
     RewriteCond %{HTTP_HOST} ^([a-z.]+)?mywebsite\.com$ [NC]
     RewriteCond %{HTTP_HOST} !^www\. [NC]
     RewriteRule .? http://www.%1mywebsite.com%{REQUEST_URI} [R=301,L]

 

This is a 301 redirect.

And if you need a good code editor, we like Bluefish.  It's free and open source.  Much less expensive than Dreamweaver ( $ 399.00 ).  We like free !

      Step Ten -- Increasing Prominence

RE: can assist you with any of these basic SEO tasks.  Please contact us with any requirements that you may have.