Seconds system design

High concurrency:

Yes, high concurrency this is a point we do not even have to think about, a moment so many people come in this is not high concurrency when is it?

Right, the thing about seconds is that it’s such a short, instantaneous rush of users.

Normal store marketing is a very low price with the SMS, APP accurate push, to attract a particularly large number of users to participate in this spike, cool business bitter development ah.

Seconds we all know that if the marketing is really in place, the price is attractive, hundreds of thousands of traffic I think it is not a problem at all, that single Redis I feel that 3-4W QPS can still be able to top, but then higher on the no way, that data casually engaged in a hot commodity seconds may be more than.

A large number of requests come in, we need to consider a lot of points, cache avalanche, cache breakdown, cache penetration of these points I mentioned before are likely to occur, there is a problem to hit the hanging DB that would be very difficult, the activity fails the user experience is poor, the activity popularity is gone, and finally take the blame or the development.

But where a spike, are afraid of overselling, I’m here to give an example of just diapers, if replaced by 100 MacBook Pro, the business budget funds to sell 100 can earn a point can also create momentum, the results of you write the wrong program to sell out more than 200, you don’t ship the user to complain about you, the platform to seal your store, you ship on the blood loss, how do you do? (Nothing to see Ao Beng’s article directly not afraid)

Then finally can only kill a development sacrifice to the sky to relieve the anger, the price of seconds would have been low, basically are not very profitable, oversold on the horror of ah, so oversold is also a very critical point.

Malicious requests:

With such a low price, if I get it, I’ll sell it for a profit, right? Even if I don’t sell it, I won’t lose money. The users know it, you know it, and other people with ulterior motives (hackers, scalpers…) Other people with ulterior motives (hackers, scalpers…) will definitely know.

That’s easy, I know when you grab, I get a few dozen machines to do some scripting, I also simulate the request of about 100,000 people, so do I mean I basically have an 80% success rate.

The real situation may be far more than that, because the speed of the machine request than the human hand speed is often too fast, in Guizhou’s Ao Beng I go home every year to grab high-speed rail tickets are seconds, I do not know if there is no credit of the scalper, I want to Diss you, scalper. Jay Len concert tickets can not be grabbed, I also Diss you.

Tip: Popularization, small news learned, scalper ticket system, than many small companies in the country’s system is still hanging a lot of architectural design is the top, I use the top service plus the top of the architectural design, you still want to see the concert? Do you still want to go home?

But do not use scalper I go home is difficult, we have too many children like me to go home for New Year’s Eve in Yunnan, Guizhou and Sichuan 555!

Link Exposure:

The first few questions we may be very good understanding, a see this some partners may be more confused, what is the link exposure ah?

I believe it is a development students are not unfamiliar with this picture, know a little line of the boy can open Google’s developer mode, and then look at your web page code, some of them have a URL, but I write VUE when the event triggered and then go to call the file inside the interface to see the source code can not be seen, but I can click on it to see the address of your request ah, but it seems that you can be on the button in the second before the gray.

No matter how it looks there is a danger, leaving aside all the stuff out there you are blocking, you are selling this stuff really cheaply and excessively, there is temptation, can you guarantee that the development will not be moved? The development knows the address, in the seconds of their own request in advance. (Development: how the hell is it me again)

Database:

Tens of thousands per second or even hundreds of thousands of QPS (number of requests per second) directly to the database, basically to hit the library hang up, and your service is not only to do the seconds but also involves other business, you did not do degradation, flow restriction, meltdown or anything else, and other hang up, small companies may be the whole site crash 404.

Anyway, no matter how you spike how to hang, you don’t mess up the other right, messing up is not something that can be fixed by killing a programmer.

Programmer: I’m so fucking hard!

The problems are listed, then how to design and how to solve these problems is the next thing to consider, we treat the symptoms.

I will go from top to bottom of the spike system I designed to give you an overview of what our normal e-commerce spike system does at each layer, the problems, difficulties, etc. at each layer.

Let’s start with the front end:

The seconds system is commonly a mall webpage, H5, APP, small program, these items.

In the front end of this layer in fact we can do a lot of things, if you use node to do, even can directly handle the whole spike, but node should actually belong to the back end, so I do not discuss node Service.

Resource staticization:

Seconds are generally specific commodities and page templates, now are generally separated from the front and back end, the page is generally not through the back end, but the front end also want their own servers ah, that can be put into the cdn server ahead of time, anyway, all the steps that can enhance the efficiency of all do a little bit, to reduce the pressure on the server when the real seconds.

Add salt to the seconds link:

We said above if the link is exposed in advance may be someone directly access the url in advance of the second, then there are small partners to say I do a time check on the good ah, then I tell you, know the address of the link than the page manually clicked on the still have a great advantage.

I know the url, then I constantly get the latest Beijing time through the program, can reach the level of milliseconds, I will request in 00 milliseconds, I dare say absolutely than your manual point of the success rate is too much, and I can send a millisecond N request, maybe you sell 100 products I took all.

So how can this be avoided?

Simple, make the URL dynamic, even the people who write the code don’t know, you just encrypt a random string to make the url through a digest algorithm like MD5, and then get the url through the front-end code backend checks to pass.

This can only prevent a portion of the hackers who do not have the patience to continue to crack down, and those who have the patience to study it can still crack it. There are many such woolly parties in the e-commerce scene, so how do you do it?

I’ll talk about it later.

Current limiting here I think should be divided into front end limiting and back end limiting.

Physical control:

Have you noticed that the buttons are usually grayed out until the seconds are up, and can only be clicked when the time is up?

This is because of the fear of people frantically requesting the server in the last few seconds seconds when the time is almost up, and then basically the server hangs before the seconds are up.

This time you need the cooperation of the front-end, regularly go to request your back-end server, get the latest Beijing time, to the point of time and then give the button available status.

The button has to be grayed out for a few seconds even after it’s clickable, otherwise he’ll just as easily keep clicking it after it starts.

You bet your ass you weren’t like this when you were spiking?

Front-end flow restriction: this is very simple, the general spike will not let you keep clicking, usually one or two clicks and then a few seconds before you can continue to click, which is also a means of protecting the server.

Back-end flow restriction: spike is definitely involved in the subsequent order generation and payment and other operations, but are only successful and lucky to get to that point, that once 100 products sold out, RETURNED a FALSE, the front-end directly to the end of the spike, and then your back-end also shut down the subsequent invalid requests for the intervention of the.

Tip: The real flow limiting will also limit the addition of flow-limiting components such as Ali’s Sentinel, Hystrix, etc. I won’t expand on this here, but will talk about physical flow limiting. I’m not going to expand here, just talk about the physical flow restriction.

We sell 1,000 items, the request has 10W, we do not need to put 100,000 are put in, you can put 1W request to come in, and then operate, because the seconds for the user itself is a black box, so how do you do it they do not perceive, as for why put 1W in, not just 1000, because it will lose some woolgathering users, as for how to judge the back of the wind control stage I will say.

Nginx:

Nginx we must not be unfamiliar with it, this thing is a high-performance web server, concurrency also casually top tens of thousands of dreams, but our Tomcat can only top a few hundred concurrent ah, that’s simple ah load balancing, a service a few hundred, that’s how to get a little bit more in the spike when renting a little bit of traffic machine.

Tip: As far as I know, a major domestic factory just rented out all the servers in Asia during last year’s Chinese New Year event, and small companies also like to buy traffic machines during the Double 11 period to top off the pressure.

Doesn’t that make you feel like you can top your clusters a lot more in comparison.

Malicious request interception also need to use it, the general number of requests for a single user is too exaggerated, unlike the artificial request in the gateway that layer has to be intercepted, otherwise the request is more he grabbed not to get is one thing, the server pressure goes up, it may take up the network bandwidth or the server to break down, cache breakdown and so on.

I can clearly tell you that all the previous measures still can’t stop a lot of wool party, because they are a professional team, they can register a lot of accounts to grips your wool, and do not use the machine request, just use the group control, the operation is almost exactly the same as the real user.

What then? Is it insoluble?

This time you need the intervention of the wind control students, before the request reaches the back end, the wind control can analyze the account behavior according to the probability of this account robot, I am now responsible for the company’s certain special system, each user’s behavior is sent to our big data team to analyze the processing, to give you the corresponding label.

Then the hacker actually has a solution: raise the number

They go to the black market to buy accounts that real users have a lot of records on, and when they do, they don’t hang around and help them go shopping and stuff, so that the system can’t recognize whether they’re a black number or a real user’s number.

What to do?

Pass-kill! Yes there is no way but to pass, pass means that we analyze the probability that this user is a real user through the ducts is not as large as the probability of other users, then we consider him a machine and discard his request.

The previous limit flow we put in 10,000 requests, but our real inventory is only 1,000, then we counted out the most likely to be the real user of the 1,000 people for the spike, discard the other requests, because the spike is supposed to be a black-box operation, the user level is imperceptible, so the design allows real users to buy things, but also reduces the probability of their own woolgathering.

Wind control can be said to be the last threshold of the traffic into the last, so many companies are very strong wind control, ant gold service wind control if you have understood that you know, your funds in the Alipay was stolen, they are able to do the full amount of compensation is for a reason.

Service single duty:

Designing a system that can withstand high concurrency is still a single duty, I think.

What do you mean? We all know that nowadays the design is a microservices design idea, and then use a distributed deployment method.

That is, we order is an order service, user login management, etc. There is a user service, etc., that why we do not give the second kill also open a service, we put the second kill of the code business logic together.

The advantage of single duty is that even if the spike is not resisted, the spike library collapses and the service hangs, it will not affect other services. (High Availability)

Redis Cluster:

Before not that single Redis can not top it, that simple to find a few more brothers ah, spike is supposed to read more write less, that you are not instantly think of what I mentioned to you before, Redis clusters, master-slave synchronization, read-write separation, we also engage in a bit of Sentinel, turn on persistence directly invincible and highly available!

Inventory preheating:

The essence of the spike, is the grabbing of inventory, each spike of users to you go to the database to query the inventory check stock, and then deduct the inventory, leaving aside the performance factor, do not you think that this is so cumbersome, business developers are not friendly, and the database can not top ah.

Dev: You’re fucking looking out for me for once.

What then?

We all know that databases can’t be topped but his brother Redis, a non-relational database, can!

That’s not simple, we have to start the spike before you through the timed task or operation and maintenance students in advance to load the inventory of goods into Redis, so that the whole process is in Redis to do, and then wait for the introduction of the spike, and then asynchronous to modify the inventory is good.

But with Redis there is a problem, we said above we use the master-slave, that is, we will go to read the inventory and then judge and then there is inventory before going to reduce the inventory, the normal situation is not a problem, but the high concurrency of the situation is a big problem.

**More product a few times! ** For example, the inventory is now only 1 left, we have high concurrency well, 4 servers together to find the query is still 1, then we all think they grabbed, they go to deduct the inventory, then the results became -3, yes only one is really grabbed, others are oversold. What should I do?

Redis itself supports transactions, and he has a lot of atomic commands, people can also use LUA, you can also use his pipeline, optimistic locking he also knows support.

Current limiting & degrading & fusing & isolating:

Why do this, not afraid of ten thousand just in case, in case you really can’t top it, limit the flow, can’t top it to block part of the way out, but can’t say no, degradation, degradation is still hit hanging, melting, at least don’t affect the other systems, isolation, you are independent, but you will call the other systems, you’ll soon be dead, you don’t want to drag down the brothers ah.

Message queue (peak shaving):

As soon as the term, many partners know, right MQ, you buy something less you directly 100 requests to change the library I think it is no problem, but in case of seconds 10,000, 100,000 it? The server hangs, the programmer again to take the blame.

Seconds is this kind of instantaneous traffic is very high, but usually there is no flow of the scene, the message queue is perfectly suited to such a scenario ah, cut the peak to fill the valley.

Tip: Maybe small partners say that our business can not reach this level of volume, there is no need. But I would like to say that we write code, we should not write a code with logical loopholes, at least after the company’s volume up, others look at the code actually do not have to change the code, a look at the code author is AoBing? Something!

You can put it on the message queue, and then a little bit of consumption to change the inventory on it, but a single product in fact, a modification is enough, I’m talking about a certain point of more than one product together with the scene of the seconds, like a double eleven zero.

Database with MySQL as long as the connection pool is set up reasonably generally is not a big problem, but generally large companies do not lack of money and spike such activities are very frequent, I was in the company is such a spike sale such a scene has been uninterrupted.

Separately to the seconds to create a database for the seconds to serve, the table design is also after all possible simpler, now the deployment of the Internet architecture are divided into libraries.

As for the table depends on how you design it, the place where the index should be set is still to set the index, after building remember to use explain to see the SQL execution plan. (It’s okay if you don’t understand it, MySQL chapter goes to Kang Kang)

Distributed transactions

Why don’t I put this on the back end instead of the end?

Because any of the above steps are likely to be wrong, and we are in a different service inside the error, that involves distributed transactions, but distributed transactions we want to be sure to succeed or something that is not right, or that, a few requests lost on the loss, to ensure that the time limit and the availability of reliable services.

So TCC and Final Consistency are not really a good fit, TCC is expensive to develop, and all interfaces have to be written three times as it relates to the three phases of TCC.

Ultimately consistency is basically relying on rotating operations to ensure that an operation is bound to succeed, and that’s a huge loss of timeliness.

The **two-part (2PC) and three-part (3PC)**, which people find less reliable, come in handy; they don’t necessarily guarantee that the data will end up being consistent, but they’re ok in terms of efficiency.

Seconds system design

High concurrency:

Malicious requests:

Link Exposure:

Database:

Resource staticization:

Add salt to the seconds link:

Nginx:

Service single duty:

Redis Cluster:

Inventory preheating:

What then?

Current limiting & degrading & fusing & isolating:

Message queue (peak shaving):

Distributed transactions

By lzz

Related Post

Leave a Reply Cancel reply

You Missed

8 Python practical scripts, save them for future use!

Python logging library logging summary – probably the best article summarizing the logging library so far

I hear you know Python?

An article on collection manipulation functions in Kotlin