System Design - pastebin.com

Introduction 

On a pastebin site, users can paste or write or store text for a specific period of time and the same content can be accessed/shared via a unique URL. The idea behind this system is that people should be able to share large amounts of text online in a simple and convenient manner with other people.

Users should be allowed to paste their content and paste should be accessible via a unique URL.
 
On a pastebin site, users can paste or write or store text for a specific period of time and the same content can be accessed/shared via a unique URL. The idea behind this system is that people should be able to share large amounts of text online in a simple and convenient manner with other people.
 
Users should be allowed to paste their content and paste should be accessible via a unique URL.

 

Registered users can edit or delete their paste
 
Paste would be removed from the system after 1 year. / Paste URL would be expired after a 1 year period.
 
The system should be highly available and reliable in terms of creating and accessing the paste.
 
Maximum 2 MB of content can be allowed per paste.
 
Paste would be removed from the system after 1 year. / Paste will be expired after a 1 year period.
 
CreatePaste(userId,text,expirationInDays): Key – This API is responsible to create the paste in database. If the user is logged in then userId would be available otherwise it would be a null value as the anonymous paste is also allowed. ExpirationInDays has one of these values {1,2,3,7,30,60,90,180,365}
 
ReadPaste(key): text – This API will expect to pass key as a parameter and it will return text. If the key is not available then 404 would be returned from API which needs to be handled by the call of the API.

 
 
 System design
 
 
 
 

Functional requirements

Non-functional requirements

Constraints

Estimations

Reading of content would be more than the writing of content. Let's assume that when 1 user will write the content, it will be read by 10 users. It means it will be a read-heavy system than write.

Traffic: Lets assume that daily we are receiving 1 million of write requests then there will be 10 million read requests (write = x, read= 10x) of the pastes. it means we need a highly scalable system to handle such huge traffic.

Storage: let's assume that we store text for 1 year. As we will receive 30 million write requests per month, So 360 million requests we are going to store for 1 year. (30 million * 12 months). So total storage we will require for 1 year is 900 TB of data. (360 millions object * 2500 kb (2 MB text + 500 kb for user and other telemetries))

APIs offered by this system

Database

There is a relation between the user and created paste hence we can use any RDBMS here.

TrafficManager will receive all user's requests. Based on the location of the request, it will forward the request to the nearest hosted API or in the nearest region. We will host the instance of API in all the regions (one hosting in each continent) so latency would be minimal.

Initially, We will generate offline keys in UnUsedKeys table. Whenever PasteAPI receives a request for creating of new paste, that request will be forwarded to KeyGenerationService. This service will pick the first key from the UnUsedKeys table and respond back to the PasteAPI. Background thread in KeyGenerationService will remove that key from the UnUsedKeys table.

KeyCleanUp service will be responsible for cleaning the keys. We can run this service at the end of every day. This service will move all the expired keys from the UserContent table to the UnUsedKeys table.

UnUsedKeys and UserContent both tables should be highly available so to achieve this, we can use the replication of the database (master-slave). So whenever one node goes down, another node can serve the request. Also, one node can accept the request for creating the paste and another node can serve the read requests.

Caching

To reduce the latency, we can use caching at a couple of places.

1) Whenever KeyGenerationService receives the request for a new key, instead of reading it from the database, it can return the key from the cache itself.

2) Another potential place for using cache is while reading the paste. Assume that whenever a new paste will be created, it's highly possible that paste will be retrieved by multiple users (10x). So we can place those demanded pastes in to cache. We can use LRU (at Least recently used) as a cache eviction policy.

Offline key generation service

Earlier, we have decided that we will generate offline keys into the UnUsedKeys table. But the questions here are like why we store keys offline and why not generate keys when needed, how many keys we will insert initially, and what would be the length of those keys.

We need to generate a unique key each time whenever we receive a write request. Problem is that after generating a key, we need to check whether the generated key is being used by another paste or not. If yes, we need to regenerate the key and this process continues until we don’t get the unique key. So due to this reason, we can generate keys offline.

The paste URL would look like this: {sitename.com/{key}}. The key can contains capital letters, small letters, and numbers. The total of these is 62. If we add any two special characters, the total would be 64 characters. Using base64 encoding, an 8 letter long key would result in 64^8 = Approx. 281 trillion strings and 6 letters long key would result in 64^6 = Approx. 68 billion strings.

While estimating, we have assumed that we will receive 1 million requests daily so for 1 year we would need 365 million keys so 6 letter keys would be sufficient.

Telemetries

It would always be good to store the telemetries like visitor country, date and time of access and UI widgets clicks, change events, etc.

Happy designing!