A Programmer’s View of SEO: A conversation with Jaimie Sirovich, co-author of Professional Search Engine Optimization with PHP

In their new book Professional Search Engine Optimization with PHP (Wiley Publishing, Inc., 2007), Jaimie Sirovich and Cristian Darie argue that the days of trying to outsmart a search engine through keyword densities and HTML tags are pretty much over. Search engines are now driven by a sophisticated set of algorithms that mimic how we human beings actually evaluate the importance of a webpage. Trying to reverse engineer those algorithms has become a fool’s errand―their operation is so complex that even the computer scientists at Google no longer fully understand how a given page ranking came about.

Sirovich and Darie argue that a better approach is to consider where search engines can stumble―such as when encountering duplicate content. The overriding goal of technical SEO, they say, is to make certain that the website architecture is search engine friendly―so that your site is not under-valued. Their book stresses the importance of the site architecture in achieving SEO. A good architecture actually assists a search engine in traversing, and therefore “⁠understanding,⁠”⁠ the site.

The authors make for an interesting duo. Sirovich (http://www.seoegghead.com) is a search engine marketing consultant, with a degree in computer science in Queens, New York. Darie (http://www.cristiandarie.ro) is a software engineer and author in Bucharest. Sirovich once bought one of Darie’s books, objected to a section, and said so very publically on a website. Darie emailed a response, and after their argument cooled, they kept corresponding. That email exchange turned into a book idea and then into a collaboration. All of this has taken place entirely over the Internet. The two have never met.

“It was an interesting process because no one has written a technical book like this, even though the subject is discussed each year at search engine conferences,⁠”⁠ Sirovich says. “⁠SEO is tricky because it’s a hybrid discipline: part marketing, part technology.⁠”⁠ Their book is about the latter.

You note in the book that SEO has gotten a lot more complicated than when people just worried about keyword density. What has changed?: Search engines have gotten much more sophisticated. They used to have relatively simple algorithms that were easy to reverse engineer. One used to try to repeat a keyword a certain amount of times―not too many, but enough. It was all fairly straightforward. Now it’s much trickier. Google, Yahoo! and Microsoft are developing algorithms that try to approximate human behaviors and sentiments. They essentially ask: what would a human use or do when rating a given website? In a sense, they are trying to measure the amount of “⁠buzz⁠”―outside attention―a given site attracts in addition to actual page content.
So if you want to get noticed on the Internet, you need to be “buzz-worthy,” which is not completely a matter of technique.: Right―in some ways, search engine marketing has become more like traditional marketing, though it’s not what our book primarily addresses. We address the technical aspects of SEO. One good way to look at the problem is to consider how Web designers actually sit down and create a website. They start out by thinking about their audience―who they are and what they require. To cite an obvious example, if you are designing a website for an optometrist, you might want to keep the font size large because your visitors may not have great vision.; We suggest that you consider another kind of visitor that stops by periodically. I’m referring to search engine spiders. They may not be human, but they can play an important role in your site’s success. This means you should base part of your Web design on making sure the spider’s “⁠experience⁠”⁠ is a good one. A lot of companies miss this point because the spider is largely invisible. The designers don’t have direct access to it. They can’t do focus groups. They can’t go directly to the Google or Yahoo! algorithms and ask what they think. But you can really mess up a search engine by doing things that seem completely okay― and sometimes they’re really not.
What’s the distinction between a spider and the search engine algorithm?: Sometimes the two are interchangeable, and sometimes they’re not. The spider is just a piece of a search engine’s software application that goes out on the Web and traverses its content. Frequently what happens is that the spider will simply list the URLs in a database, and then another piece of the application will actually get the content. Spiders operate with a certain degree of logic, but their behavior is pretty basic. So you want to make sure that when the spider makes its rounds and stops by your site, that it doesn’t get confused.; For example, you want to avoid “⁠spider traps,⁠”⁠ in which a set of Web pages cause a spider to make an infinite number of requests without acquiring any substantial content. This can happen when a dynamic website generates different URLs that host very similar or identical content. Of course, the spider cannot be completely ignorant of duplicate content―otherwise, it would crash whenever it hit a spider trap, or at the very least, would waste an enormous amount of time fetching useless content. But even so, duplicate content is a major problem for SEO―because as a result, a website may not get spidered as often, or as comprehensively.
A spider trap sounds like an easy trap to fall into.: That’s true. You can throw spiders a curve ball even if you are just following standard practice. Another example is the use of tracking parameters. Typically, marketing will want to know if a user performs a certain action on a website. So a parameter will be added to the URL that tracks when, say, a user clicks on a particular link.
This is part of analytics.: Yes, it is now standard practice. But here again, this can result in an enormous number of pages that really say the same thing. Which means a search engine must try to figure out if those pages are really equivalent. It must guess which page is the authoritative page. Otherwise, you essentially divide up your “⁠link equity⁠”―that is, the value that is transferred by a particular link to another URL. If a Web page is linked from many websites, search engines will consider it more important than a page with fewer links. So ideally, you want all the links pointing to a single page that is defined by a URL. You don’t want to link to five different pages that contain the same information.
So generally speaking, duplicate content is a big problem for SEO.: It’s one of the hugest architecture problems, and one of the biggest areas of concern in our book. In fact, many other topics we cover are in some way related to, or are solutions for, duplicate content. For obvious reasons, search engines spend a lot of time eliminating duplicate content. But you’ll do yourself a huge favor and probably make your site rank better if you eliminate as much of it as possible to begin with.; For example: search engines use links to determine not just which sites are popular, but which pages are the most popular. So if you have lots of pages, and a few those have lots of links to them, a search engine will assume that page is important. That’s all well and good. But imagine now that you have three nearly identical pages, with people linking to all three. Now you are dividing up the equity, so that even if the page is very popular, it won’t register as such. And as the number of duplicate pages increases, so does the problem. Given enough duplicate content, a search engine will just give up. That’s the extreme case.
Eliminating duplicate content seems like a legitimate way to gain attention from a search engine. But you also talk about so-called “black hat” techniques that are more controversial. And some, like cloaking, which can be “used for both good and evil.”: We define “⁠white hat⁠”⁠ as techniques that follow the guidelines of the search engine and “⁠black hat⁠”⁠ as those that do not, and which may also exploit the work of others. We like what Dan Thies wrote in his book on search engine marketing―the difference is whether you, as an SEO consultant, see yourself as a lawyer or an accountant.; Cloaking means you are delivering different contents to the search engine than you are to humans. It’s a good example of a technique that wears both hats. Cloaking can be used for ethical purposes to present information or remove things that would confuse the spider. It’s controversial, but even large sites have been known to do it.
I noticed that you used the New York Times site as an example of cloaking.: We did pick a prominent example. In this case, I’d say the use is definitely controversial, because they are using the technique not to aid the spider, but to make money by monetizing their content. The New York Times requires that users pay for some premium content. But search engines aren’t so restricted. As a result, search engines are indexing the content of many pages from the site in the SERP―the search engine results page. But those results aren’t actually available to you. When you go ahead and click on a link, the result essentially says: “⁠give me your money.⁠”⁠
So the boundary line between black hat and white hat is more blurred than people might think.: Yes, because ostensibly Google will say that you’re going to Google Hell for cloaking. So why isn’t the New York Times in Google Hell? It’s a grey area. Google will never say that cloaking is okay, but they are allowing it here. Obviously, cloaking is not entirely forbidden.
Your book is specifically about doing SEO with PHP. Why PHP in particular?: The idea of using a developer language is that we wanted to show real world examples. We didn’t want to just speak about the technology in SEO in the abstract -- we wanted to show it in action. In fact, PHP is just one way to approach it, and that’s where we started. We have another edition coming out for Microsoft’s ASP.NET. We’ve selected those two languages simply because they are the most popular technologies for writing Web applications. Someone who uses Java could also use our book; they would just have to expend a little more effort to understand how the principles apply.
You have written that SEO is organic to the design process―you can’t leave SEO until the end of the project.: We mean that a website’s architecture is the foundation of SEO―it “⁠grounds⁠”⁠ all future search engine marketing efforts. Obviously, content itself is also important: copywriting, graphics, and good page design, as well as an understanding of user psychology. But as we write, architectural minutiae can make or break you, and even small mistakes in implementation can become a big problem later on.
You say that it’s much harder to do SEO with an existing website than with a new one.: The process is much more painful. Of course this is true for many software projects. Legacy concerns always haunt the developer. Windows 3.1 haunted Microsoft in making it compatible with Windows 95. Apple has gotten around this problem by simply abandoning their old architectures. They don’t have to worry about legacy all the time because they just toss it out the window, and then assert they are superior to Microsoft. That may be true―the resulting architecture may be cleaner―but the practice is often disruptive to their customers.
So do you suggest to developers that it is sometimes better to start from scratch and completely rebuild the site?: It depends on the situation. If you have a website that works and is getting indexed and you are getting traffic, the answer is no: you just want to make iterative improvements. But there are some websites that are hopeless―like those designed entirely in Flash or AJAX. Doing so will make a site largely invisible to search engines, because you are embedding all of your content in an application. Instead, one should use Flash or AJAX inside an HTML-based page where applicable. The approach I would take is to build the core of the site in HTML. Then, wherever I need something that’s highly interactive or animated, I would use Flash or AJAX―selectively and sparingly. But you don’t want to use either technology for the entire site. Search engines look for information, not applications.; A lot of art-oriented or creative websites use Flash because it’s easier to create interactive, cool looking content with it. But if you begin to notice you don’t have any traffic, the best answer may be a complete redesign of the site.
You talk about three search engines: Google, Yahoo! and MSN. Is Google the 2000 pound gorilla?: Google is the 2000 pound gorilla. You can certainly monetize and exploit traffic from MSN and Yahoo! but you’re primarily looking at Google these days. That could change down the road. I see lots of improvement from Microsoft, in particular, so it’s hard to predict long-term. But according to traffic statistics, Google accounts for over 50 percent of the traffic that people get. And I think those statistics tend to understate the case. From what I see much of the time, Google often represents 70 or 80 percent of a typical site’s traffic.
Even so, does each search engine require a different SEO strategy?: The concepts in our book are pretty much universal among search engines. The algorithms might differ somewhat in how they analyze popularity, or, the quality of the algorithms might differ. One engine might do a better job than the other of ranking documents, for example. But the architectural concerns are the same. If you design a website that works for Google, for the most part it’s going to work for Yahoo! and MSN architecturally.
What kind of response have you gotten to the book?: Pretty positive. We’ve got a bunch of five-star reviews at Amazon. We’re seeing two audiences: total computer nerds with an interest in search engine marketing, like us, as well as people who are interested in online marketing books. We’re happy about that, because we really set out to speak to both groups. Our message is that these people need to talk to each other.
They need to talk to each other even though the technical and marketing aspects are so different?: There’s no way to avoid it. A marketer needs to know about how search engines work in relation to websites, and a developer needs to understand how search engines affect marketing. So they need to talk to each other, which means they can’t speak entirely different languages. Marketing people need to know the importance of architecture, and, for example, the pitfalls of designing spider traps. And equally important, technical people need to know why ranking number one in Google could be worth a lot of money. Plenty of developers could write a 10-page explanation on how a search engine works, yet don’t know how much a high ranking for, say, Viagra, is worth.
Where do you think SEO is headed? Are we going to continue to see an arms race between the search engines and Web designers?: Some of this competition is going away, including the whole idea of trying to reverse-engineer search engines, because doing so is becoming more and more difficult. Increasingly, people will focus more on posting creative, useful content―because eventually, that’s what garners lots of links. But some things won’t change. I think the architectural aspects of SEO, the whole concept of our book, will be valid for a long time.; In the first chapter, we have a simple diagram of three boxes, labeled “⁠Search Engine,⁠”⁠ “⁠Site Architecture⁠”, and “⁠Content.⁠”⁠ What it is saying is that the architecture of a website is the lens through which a search engine looks at your content. It’s the most trivial diagram in the world, but it pretty much conveys the whole message of the book.