How Search Engines Index Websites

When a search engine looks for websites that it feels appropriately match a search query, it does so by looking at a search engine index. The index is a pre-existing collection of data, which allows the search engine to work out what web pages have the most relevance to the query entered. So, for example – if a user is searching for “Red Diesel use laws”, then the search engine in which that user query has been entered will interrogate its index to match existing web pages with the search terms “Red Diesel” and “use laws”.

A search engine uses an index so that it can return all those thousands or millions of pages as quickly as possible. The search engine index immediately informs the search engine how many pages on the web contain the phrase or phrases it is looking for. By interrogating the index to retrieve pages that contain the correct words, the search engine is able to sort all probably relevant pages on the Internet from those that have zero relevance, textually speaking. By ranking the pages it has lighted on, according to the frequency with which words in a search query occur as proper phrases, or embedded in genuinely grammatical English, the search engine is then able to “guess” which pages have the highest relevance to the query it has been given. That “guess”, framed as it is in an interlocking set of fiendishly complicated rules about grammar, slang, probability and so on, is almost always very accurate.

The search engine index is created by search engine spiders, which “crawl” web pages in order to gather information on what they are “about”. Naturally, anything as logical as a computer program (which is basically what a search engine is) is going to find it easier to index pages that have themselves been logically ordered. Anyone looking to improve the ranking of their website in a search return (which as we have seen is a direct result of a search engine querying the index), then, needs to make sure that all the information on their site is collected in a fashion currently “approved” by search engine spiders.

The spiders “like” information that is clearly headed, and nested in an appropriate fashion: so the most relevant parts of a site’s textual information will be assumed to appear as primary headings (you may know these as “heading 1” in web speak) in a sites overall structure, while slightly less relevant text appears as secondary headings (“heading 2”) and so on. Any site, then, that contains the phrases “Red Diesel” in all of its primary headings is going to be returned as relevant to our hypothetical query. A site that contains both phrases – “Red Diesel” and “use laws” in its secondary headings will be seen as more relevant to our query than one that only refers to Red Diesel: while any site that has “Red Diesel” and “use laws” in its primary headings will end up topping the list.

Search engine spiders are designed to look at more than just words and phrases, of course – though everything they log in the search engine index does have to be categorisable in some way, otherwise they cannot logically refer to them. This is a great help to web developers, who deliberately attempt to submit search engines’ preferred conventions throughout their sites. If you can arrange all of the pages in your website so that a search engine spider will find it easy to index, that site is more likely to be returned when pertinent queries are entered.

The rules that govern the indexing process are continually changing, as the search engines attempt to stay one step ahead of the optimisation game. Understanding the indexing process, then, is not enough. If you want your site to submit search engines’ preferred criteria, you need to be partnered with a company that knows what search engines current criteria are.