Hacker News new | past | comments | ask | show | jobs | submit login
SEO hidden text and AJAX (devx.lt)
81 points by Bogdanas on July 30, 2015 | hide | past | favorite | 47 comments



Ok, I think this doesn't mean much and is a flawed experiment. The test url is: https://devx.lt/en/node/9 which contains ALL the content used in this experiment. All 9 tests are done on the same page, at once.

in the 4th experiment, he searches for "site:devx.lt GetJar is an independent mobile phone app store founded in Lithuania in..." and the result is returned WITHOUT any bold text. So this was a "most relevant" result, not a "found" case. Now the problem is, these paragraphs contain same words. Google returns "LinkedIn /ˌlɪŋkt.ˈɪn/ is a business-oriented social networking service. Founded in..." because, for one it at least contains "founded" part and further the paragraph I'm sure there is more. Since the results are limited to "site:divx.lt", and paragraphs contain same words, of course all 9 tests end with "found", as it is the only page close to being relevant.

I'd like to see the same test repeated, -NOT- all tests on the same page.

Some tests return bolded results, so I consider them to be true. I think 1-2-4-5 are false positives.


We also can't be sure that "literal searches in quotes" are considered the same way for SERP ranking etc, which I guess would be the motive for hiding text.


also these text contain double-quotes so I wouldn't be surprised if they aren't even considered

few examples:

Its mission statement from the outset was "to organize the world's information and make it universally accessible and useful,"

140-character messages called "tweets"

LinkedIn filed for an initial public offering in January 2011 and traded its first shares on May 19, 2011, under the NYSE symbol "LNKD"


What you did is proofing that google indexes hidden text. What you did not proof is how that impacts your ranking. Google will know that this text is hidden, although it is indexed and that is what is important.


Exactly. I see many comments here about hidden text, switching user agent, css and bunch of things that don't work anymore. The reason why we used to use that stuff was because Google used to read the whole page and use Keyword Density as one of the many ranking factors. Since those factors aren't used anymore and you can also get penalized, there's no reason to do it. Unless you want a page with just one picture and no text, but still get the benefit of the content, there's no reason to do it.


I suspect there will always be a way to hide the text in a way that a search engine won't be able to detect. You can do this without applying any properties directly to the element containing text (raise other elements above it), hide it with JavaScript, or any number of such techniques.


I believe Google is moving towards graphical understanding of the text as well and it should be able to render the pages and see what is going on the page... those days will soon be over and technical excellency in SEO will talk.


What if you use custom blank fonts? Or simply detect the User Agent and serve a different CSS file?


I think its known Google can masquerade as a "normal" visitor if it suspects you of cloaking (non identifying UA string and coming from a non disclosed google crawl IP)

https://support.google.com/webmasters/answer/66355?hl=en

I think they also use this to notify you in webmastertools when/if your site is hacked and is doing it to avoid detection by the normal user.


Or detect the user agent and simply serve different content? I'm pretty sure that Google penalties this though.


Yep. Here you go: https://support.google.com/webmasters/answer/66355

Here's the full "Don't do this sh!t" article from Google: https://support.google.com/webmasters/answer/35769#quality_g...


When hidden text and query are exact matches as well as unique, of course Google has no option but to show the only result it has.

What's important though is not that it's indexed or not. What's important is the weight that the text has on SEO ranking. I'd venture to guess that if two sites had the exact same text and one was hidden and the other wasn't, the latter would show higher in Google results.


A few years ago it sites would get penalised for hiding tags as this was used by some -often dodgy- sites to cheat their search rankings (eg lots of hidden text about popular search items to direct people to an anti-virus scam page).

So then Google started giving heavy precedence to larger text (this also puts emphasis on titles), which lead to some CMSs having a "tag cloud" where tags are meshed together with more relevant / popular search items being in a larger font than less relevant / popular tags.

I think Google has moved on again since then and now detects if keywords appear in a natural sentence or if it's an artificial cloud of tags designed purely for SEO - but I could be wrong there. However this is the sort of detail I would have expected / hoped an article with "SEO" in the title* to comment on.

* Article's title rather than the HN submission title.


Not sure who the audience is here. If the text is in the DOM, it's gonna get indexed. Computer's don't care if text is white on a white background.

That being said, Google getting dynamically generated content (ajax) is newish (I don't think it's new enough to be "news").

More interesting is if Google can "parse" things like <noscript> or if the CSS hides content, or if text is too small and decide to rank based on that (contributing factors to a "quality" rating).

I'd guess "yeah, probably". After all, they can rank based on if your site is mobile-friendly, and that must take some interesting metrics to decide.

Google specifically fighting "black hat" SEO techniques is really, extremely old news. Google being good at indexing "all the things" - also old news.


Computers might not care if text is white on white, but humans do. So just because the text is in the DOM doesn't necessarily mean it'll get indexed. It's definitely in Google's interest to fight against techniques like these. They already mitigate against things like keyword spam, etc.


    Google getting dynamically generated content (ajax) is newish
    (I don't think it's new enough to be "news").
They've been doing it for years.

    More interesting is if Google can "parse" things like <noscript> or
    if the CSS hides content, or if text is too small and decide to
    rank based on that (contributing factors to a "quality" rating).
Signs point to Googlebot using a browser, so it doesn't need to "parse" things, per se. It just loads the page and checks what happens.

    Google specifically fighting "black hat" SEO techniques is really,
    extremely old news. 
I have a slang dictionary website. It showed thousands of citations of slang use from major publications, TV shows, movies, etc. Google penalized the site because of that.[1]

I wanted to show the citations because they're a major way that I differentiate my site from other slang dictionaries like Urban Dictionary. So I used AJAX to load the citations, hoping that Google wouldn't index that and penalize the site.

I was wrong - but that doesn't mean what I was doing was black hat. It was the opposite.

[1] I'm fairly confident that Google penalized the site in part because of the citations, which was performed algorithmically. But it may also be the case that former Google employee, and head of the web spam team, Matt Cutts played a primary role in manually - and permanently - penalizing the site.


Google definitely penalizes keyword stuffing via white text/small text/etc. and have done so for years and years.


This proves nothing.

Sure, it indexes the text, but that has no bearing on SEO.

Most people using hidden text use it as a keyword stuffing tool, stuffing more relevant keywords into the page in order to get higher page rankings.

A lot of SEO techniques and how Google views them comes down to intent. It's pretty clear using hidden text and other gray and black hat methods will be picked up by Google and penalized because its clear the intent is to try and gain an advantage in the SERPS. This has been true going on ten years or more:

source: https://moz.com/google-algorithm-change#2000

Cassandra — April 2003 Google cracked down on some basic link-quality issues, such as massive linking from co-owned domains. Cassandra also came down hard on hidden text and hidden links.


Right, the whole article seems to prove something fairly obvious:

"So here You have it. Google can index (and indexes) hidden text and dynamically inserted text."

Then concludes the entire thing with something completely unrelated:

"If content is relavant [sic] to Your website, You won't get penalty from google."


The last line is related, but is based on some faulty logic. The penalty isn't going to be evident if you're doing a search like:

in:mysite.example.com "explicit text search"

All you're seeing is that your site was indexed. The author posits that since the page is visible in the index, there was no penalty. That's not true at all, of course. If there were 1,000,000 results for a more generalized search like "explicit text search," the author's pages could well be very, very low and that could very well be due to gleaned greyhat techniques.


Well-said.

As another commenter put it, Indexing != SEO. Google has for years been quite dedicated to sniffing out webspam and penalizing those who use black hat techniques.

You can't hack SEO. It may lead to some temporary gains - but inevitably the house of cards will crumble, and you'll find yourself having burned the domain you built with penalties.

TL;DR - If you hide something on a page, you may in fact be indexed. But in terms of rankings, your content will likely wind up on page 9,743 of the SERPs.

From the horse's mouth: https://support.google.com/webmasters/answer/66353

Thanks to the OP for the original article, though - I'm giving a talk on 'SEO for Developers' next month, and this is a great example of false conclusions and how NOT to approach SEO.


A couple of months ago, a Google engineer did an interview on this very subject where, I believe, he mentioned that text would be indexed and penalty would only be given when possible. This was interesting because obvious white text on white background would be penalized but not dynamically hidden text through Javascript because they were not able to reliably determine if the practice was intended and legitimate (a carousel for example) or malicious.

If anyone is able to find that interview, I'd be extremely thankful.


In my experience google will find and index hidden text, but this may negatively affect your ranking.


I would say just to be safe, use the <details> and <summary> tags with a polyfill if you need to hide and then reveal text.

It's more meaningful, and it shows that you're not trying to be deceptive. I've not tested it myself, as it is relatively new and I've not had opportunity or time yet.

edit: link to <details> docs https://developer.mozilla.org/en-US/docs/Web/HTML/Element/de...


Just because they will index it doesn't mean it won't harm your ability to rank for that text.

Google bot is a headless vesion of chrome - so while they can see that stuff, they also know when stuff isn't "visible" to the user and treat it accordingly.

The problem with tests like this is that you either need to test with site: or with made up terms - but the algorithm isn't static - it changes based on the corpus of relevant results. (e.g. if there's only 3 relevant results, then they won't apply spam penalties or panda weights, etc)

When you have such a small scale test, the corpus of results is always small - so it's not accurate.

I'm confident that if you tried these techniques on a site with content currently ranking in a highly competitive area, then changed it to one of these, your rankings would fall.


> Google bot is a headless vesion of chrome

I highly doubt this is the case, what is your source?


Everything Google has said that the bot does, plus a knowledge of coding. They aren't using a LYNX style browser to render javascript and determine the position of ads on pages. They have to be using a headless type of actual browser for that - and it's a safe bet they aren't using firefox or IE or safari.

They aren't telling us how Googlebot is coded, but it's pretty easy to deduce.

http://ipullrank.com/googlebot-is-chrome/ for more details


> but it's pretty easy to deduce.

Indeed, I would try to make bots display the value window.chrome to verify this assertion.


They wouldn't identify as chrome - it'd still be a custom browser based on the chrome code.


Indexing != SEO

A better test is to insert mentions to another site for relevant keywords. Then measure impact on ranking for the other site.

My guess is likely zero to negative impact (due to hidden text being penalized)


Is the argument that Google doesn't index hidden text, or that Google penalizes the use of hidden text?

This experiment proves what we knew -- Google indexes hidden text; however, it doesn't prove or test what people want to know -- if Google penalizes results that hide text.


Yeah, ultimately this doesn't demonstrate anything. We know Google can index text if it's hidden - the only issue with hidden/obfuscated text is that if Google determines it to be so, there's likely an SEO penalty to be paid.

The searches done here are so explicit that all it's demonstrating is that pages that are indexed by Google exist in the index. We have no information on how that would compare against competition (poorly, likely).

And to boot, OP may have degraded his/her own blog's overall SEO.

In short: not very valuable.


The article ends with: "Summary So here You have it. Google can index (and indexes) hidden text and dynamically inserted text. If content is relavant to Your website, You won't get penalty from google."

I think that's disingenuous. The panda algorithm is believed to specifically target duplicate content. You're saying that if I had a website full of content stolen from more authoritative web sources that I wouldn't lose any ability to rank? I think you're wrong :)


    I think that's disingenuous. 
It's not disingenuous, it's just plain incorrect. :)

    The panda algorithm is believed to specifically target duplicate content.
My dictionary website was Panda-penalized because it showed citations of slang use from books, news articles, TV shows, movies, etc. The citations were loaded via AJAX.

I have more details in a prior comment on this post, here: https://news.ycombinator.com/item?id=9977372


Hm. "Found" does not mean "OK". If this were a serious, site wide proposition in a competitive vertical on a high ranking domain, you'd be risking everything going down this route.

A much nicer way to solve the problem I think these tests are trying to address is to implement pre-rendering of your ajax pages. Take a look at something like prerender.io.


Isn't this something everyone already knows? Basically if its in the code it's gonna get indexed. The real issue here is how those different methods rank against other "options" with similar search results, searching specifically among your own site isn't gonna yield anything that interesting.


I was concerned about this when recently releasing a meteor.js site, whether Google would index the pages and content.

Meteor serves pages as a single-page-application (same as AngularJs and other ajax frameworks), so if you simply view the source of the home page, you'll see mostly empty HTML. I didn't add any extra modules for SEO and simply published the site.

It looks like Google has indeed indexed the pages and text https://www.google.com/webhp?sourceid=chrome-instant&ion=1&e...


GoogleBot has understood JavaScript for years. Their mobile page tester is proof of this fact.


For tests 7 and 8, what if the Javascript was stored in an external file, which of course is generally best practice? For 7, would the string still be parsed and indexed? For 8, would the file be downloaded and indexed? I'd be surprised!


Regarding ajax, I've heard Google was indexing high profile sites, not _every_ site, due to the cost of running JavaScript.

Is that still the case? It would have been better if the test was done on a brand new domain.


I am the co-founder of SEO4Ajax and I can confirm that, everyday, I see Ajax sites which are not properly indexed on Google because their bots can't interpret JavaScript correctly, when they try to.


I really like to not care, because I design for humans and not search engines, but the sad part is, I can't.

Startup idea: Make a search engine that make a OCR on the rendered page ...


It doesn't mean that ranking is just as effective with hidden text, only that Google can index hidden text. Big difference.


You are not going to get a penalty for hidden text or trying to hide a text, you will still be indexed but that doesn't mean you will rank for what you are intending to rank for. Google is now miles better at detecting whats trying to trick it and whats not and if you try to compare these and see what ranks better the ones that are trying to hide stuff will rank much lower or wont rank for certain keywords.


Showing that it gets indexes does not show how well it would rank wrt normally visible text, though.


Is google crawling the site every search, or is it using the same indexed version for all tests?


I assumed that this was a long running experiment, getting Google to reindex a new cached version between every test.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: