Learned lessons from the largest players (Flickr, YouTube, Google, etc)

I would like to write today about some learned lessons from the biggest player in the high Scalable Web application. I will divide the lessons into 4 points:

  • Start slow, and small, and measuring the right thing.
  • Vertical Scalability vs. Horizontal Scalability.
  • Every problem has its own solution.
  • General learned lesson

Start slow, and small, and measuring the right thing:

If you working in new website; don’t buy too much equipment just because you want your website to be faster. Start slow, and small, and using the right statistics you will know the right direction. Measuring the right thing is fundamental and important. Selecting a benchmark and comparing the results seems, initially at least, to be a simple problem, but a host of mistakes are made during this process. For example you shouldn’t ask ‘How many registered user’ instead you should ask:

  • How much time it’s take to answer the request?
  • How many concurrent users do you have in your Application?
  • What is the most used feature?

This question will give you good points about the future direction, so I don’t suggest to buy too much equipment from the start. The best way to collect this data is use a logging system that usually don’t consume much CPU cycle.

Vertical Scalability vs. Horizontal Scalability:

If you need scalability, urgently, going to vertical scaling is probably will to be the easiest, but be sure that Vertical scaling, gets more and more expensive as you grow, and While infinite horizontal linear scalability is difficult to achieve, infinite vertical scalability is impossible.
On the other hand Horizontal scalability doesn’t require you to buy more and more expensive hardware. It’s meant to be scaled using commodity storage and server solutions. But Horizontal scalability isn’t cheap either. The application has to be built ground up to run on multiple servers as a single application.

Every problem has its own solution:

This don’t mean that I don’t believe in the multiple solutions, but actually this mean that always there are better solution, and to found a good solution first you should understand the semantic segment of the problem well. Some examples:

Latency Challenges:

Latency is the time it takes packets to flow from one part of the world to another. Everyone knows it exists. The second fallacy of distributed computing is “Latency is zero”. Yet so many designs attempt to work around latency instead of embracing it. This is unfortunate and in fact doesn’t work for large-scale systems.  So how the biggest player work to face this challenge:

  1. Decreases the response size (HTML, CSS, JS).
  2. Geo-distributed clusters.
  3. Move to Asynchronous Architecture.

YouTube (Thumbnails problem):

Surprisingly difficult to do efficiently. Because there are about 4 thumbnails for each video so there are a lot more thumbnails than videos. Thumbnails are hosted on just a few machines. Saw problems associated with serving a lot of small objects:

  • Lots of disk seeks and problems with inode caches and page caches at OS level.
  • Ran into per directory file limit. Ext3 in particular. Moved to a more hierarchical structure. Recent improvements in the 2.6 kernel may improve Ext3 large directory handling up to 100 times, yet storing lots of files in a file system is still not a good idea.
  • A high number of requests/sec as web pages can display 60 thumbnails on page.
  • Under such high loads Apache performed badly.
  • Used squid (reverse proxy) in front of Apache. This worked for a while, but as load increased performance eventually decreased. Went from 300 requests/second to 20.
  • Tried using lighttpd but with a single threaded it stalled. Run into problems with multiprocesses mode because they would each keep a separate cache.
  • With so many images setting up a new machine took over 24 hours.
  • Rebooting machine took 6-10 hours for cache to warm up to not go to disk.

To solve all their problems they started using Google’s BigTable, a distributed data store:

  • Avoids small file problem because it clumps files together.
  • Fast, fault tolerant. Assumes its working on a unreliable network.
  • Lower latency because it uses a distributed multilevel cache. This cache works across different collocation sites.

Storage and caching:

How we will store the data and what kind of caching we need. Always there are trade-off between the data structure and the algorithm; this mean if you used a fancy data structure it will affect the algorithm and how much time we need to retrieve it and save it. So choice how we will design our storage system (DB, Disk-based Data Structure, etc) is important because it will affect the performance. Also using the right caching model for your situation is important. For more information about caching go there. Important notes about DBs, and your storage system:

  • Keep the data design simple, so you will be able to change it, or redesign it if you need in the future. Simple not mean simpler do not ignore feature just think how I can store/save it better.
  • Much better cache locality which means less IO.
  • Went to database partitioning.
  • Avoid the distributed transaction.
  • Try to spreads writes and reads.

General learned lesson:

  • Creativity must flow from everywhere.
  • Innovation can only come from the bottom. Those closest to the problem are in the best position to solve it. Any organization that depends on innovation must embrace chaos. Loyalty and obedience are not your tools.
  • Hide updates using Ajax. Updates are slow so big bang updates of many entities will appear slow to users. Instead, use Ajax to update the database in little increments. As a user enters form data update the database so the update cost is amortized over many calls rather than one big call at the end. The result is a good user experience and a more scalable app
  • Be sensitive to the usage patterns for your type of application.
    • Do you have event related growth? For example: disaster, news event.
    • Flickr gets 20-40% more uploads on first work day of the year than any previous peak the previous year.
    • 40-50% more uploads on Sundays than the rest of the week, on average
  • Be sensitive to the demands of exponential growth. More users means more content, more content means more connections, more connections mean more usage.
  • Plan for peaks. Be able to handle peak loads up and down the stack.
  • Go stateless. Statelessness makes for a simpler more robust system that can handle upgrades without flinching.
  • Prioritize. Know what’s essential to your service and prioritize your resources and efforts around those priorities.
  • Pick your battles. Don’t be afraid to outsource some essential services. YouTube uses a CDN to distribute their most popular content. Creating their own network would have taken too long and cost too much. You may have similar opportunities in your system.
  • Shard. Sharding helps to isolate and constrain storage, CPU, memory, and IO. It’s not just about getting more writes performance.

Some points in the general learned lesson, chould conflict with each other, but I meant that because every problem has it’s own solution.


Haytham El-fadeel is a researcher in Computer Science, Software Engineer and has interest in every topic related to Research, Innovation, and Software Development. In his blog – you will found blogs, and articles about :

  • Researches, Innovation, and Ideas.
  • Algorithms.
  • Code Optimization.
  • Software Development, and more.


An Africa based entrepreneur in the pursuit of opportunities without regard to resources currently controlled striving to build services that have real-world value for my beloved continent and beyond while having fun along the way.

Share this:

Comments are closed, but trackbacks and pingbacks are open.