Architecture: Filters, counts and product searchs
I've been thinking about this for some time and thought this would be the best place to post since you guys know what you're talking about.
For a long time ecommerce sites have had "Filters" in the left nav. But recently, I've been noticing some of them have superquick "stackable" filters (e.g. checkboxes which you can keep on ticking) and a count next to the the number of results if you tick each box (e.g. http://www.zappos.com/womens-casual-shoes~94 - same with Amazon Ebay)
Now, they use ajax pagination to load only the main div rather than the whole page which will speed everything up obviously, not to mention they have some pretty hardcore hardware, but the output just seems crazy quick for everything it's counting in the background (< 0.5 seconds). Even though they have mega high traffic, I still think the number of possible combinations of filters seems to me that if a cache ID was implemented on the counts then hit rate would be very low because so many URLs would be unique and a lot of the queries would be raw. They would also need to expire cache in categories when products are added, I imagine.
So, does anyone have any idea what they are using for this? I'm starting to think it's not mysql + memcached... their stuff is just too fast.. can it be? especially with millions of rows?
e.g. They can't surely be doing one query for every count? e.g. "Slippers (1286)","Comfort (7068)", or are they?
OR perhaps they getting the current result set of product IDs using a 3rd party search tool like amazon A9 and then using dynamic language to parse the results and calculate the counts coupled with a global cache based on the URL?
Or are they using something totally different to mysql which I am not yet aware of?
Sorry to use specific examples of sites (I checked the rules and this seems OK) but just wanted to be clear what I was talking about.
Any input / ideas on how they do this, or the best implementation of this would be appreciated.
[quote]They can't surely be doing one query for every count? e.g. "Slippers (1286)","Comfort (7068)", or are they?[/qoute]
They could be.
But you know, they COULD be doing those counts in JS code. If the total number of items is in the low thousands, say, that would kind of make sense. Take the load off the server.
Ah, thanks. I never considered JS - interesting idea. Looking at their code in the example I gave, the counts appear to be prepopulated in the original wget source so in this cause I guess it's server side.
I'm just trying to figure out how this can be done fast with 200k products, each with 50 odd attributes in a joined table with a regularly updated product record set.
I guess the third option would be to perform one mysqlquery, get all attributes for every current product then increment variable variables in the while loop and put the load onto PHP rather than MySQL.
I know there are various ways to achieve this, but if your target was speed without scaling out the hardware to read slaves during peaks how would you guys do it in the example I gave please?
Do all the counts ahead of time and just remember them? In a separate "counts" table?
You would presumably only need to update them when a product or product variant was added, no?
Sorry I didn't reply sooner, and for bumping a now old-ish thread, and thanks again. I have been thinking about your response and testing various things. I now think you are right in that the counts need to be precalculated and stored in a "counts" table, and then just re-calculated when an update occurs to a product involved in the counts.
The bit I'm having trouble getting my head around is how to calculate all of the counts efficiently, even on a nightly cron. Because the number of counts required grows exponentially with both the number of products and the number of count options. I believe there are millions of different counts in my example (Zappos) as they are stackable (i.e. you can select multiple options from each field).
Right now, I am caching on the fly with 3 hour cache after a cache miss. This causes data to be out of date sometimes, but when you get a cache hit its super quick - just not so quick on the cache misses. So I have inconsistent page load times. The DB servers could also be vulnerable to (D)DOS if someone threw rarely-used random parameters at it.
I need to take another step back to think about it. I wish I could find some more info on it... but I'd like to update this thread when I make progress.
|All times are GMT +1. The time now is 03:39 PM.|
Powered by vBulletin®
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.