WebSocket: From Beginner to Expert in 5 Minutes

I. Content overview

The advent of WebSocket has given browsers the ability to communicate in both directions in real time. This article goes from strength to strength, describing the details of how WebSocket establishes a connection, exchanges data, and the format of data frames. In addition, security attacks against WebSocket are briefly described, as well as how the protocol defends against similar attacks.

Second, what is WebSocket

HTML5 began to provide a web technology for full-duplex communication between the browser and the server, belonging to the application layer protocols. It is based on the TCP transport protocol and multiplexes the handshake channel of HTTP.

For most web developers, the above description is a bit boring, but really just remember a few things:

WebSocket can be used in the browser
Supports two-way communication
It’s easy to use.

1. What are the advantages

Speaking of advantages, the comparative reference here is the HTTP protocol, which in a nutshell: supports two-way communication, is more flexible, more efficient, and better scalable.

Supports bi-directional communication for better real-time performance.
Better binary support.
Less control overhead. After the connection is created, when the ws client and server exchange data, the protocol-controlled packet header is smaller. Without the header, the header of the packet from the server to the client is only 2~10 bytes (depending on the length of the packet), and if the client to the server, you need to add an additional 4-byte mask. The HTTP protocol, on the other hand, needs to carry the full header for each communication.
Support for extensions. ws protocol defines extensions that allow the user to extend the protocol or implement customized sub-protocols. (e.g. support for customized compression algorithms, etc.)

For the latter two points, students who have not studied the WebSocket protocol specification may not be intuitive enough to understand, but does not affect the learning and use of WebSocket.

2. What needs to be learned

For the study of network application layer protocols, the most important is often the connection establishment process, data exchange tutorial. Of course, there is no escaping the format of the data, as it directly determines the capabilities of the protocol itself. A good data format makes the protocol more efficient and scalable.

The following article is organized around the following points:

How to establish a connection
How to exchange data
data frame format
How to maintain the connection

III. Introductory examples

Before formally introducing the details of the protocol, let’s look at a simple example to have a visualization. The example includes a WebSocket server, and a WebSocket client (web side). The full code can be found here.

The server side here uses the ws library. Compared to the familiar socket.io , the ws implementation is lighter and more suitable for learning purposes.

1. Service side

The code is as follows, listening on port 8080. When a new connection request arrives, the log is printed and a message is sent to the client. When a message is received from the client, the same log is printed.

var app = require('express')();
var server = require('http').Server(app);
var WebSocket = require('ws');

var wss = new WebSocket.Server({ port: 8080 });

wss.on('connection', function connection(ws) {
    console.log('server: receive connection.');
    
    ws.on('message', function incoming(message) {
        console.log('server: received: %s', message);
    });

    ws.send('world');
});

app.get('/', function (req, res) {
  res.sendfile(__dirname + '/index.html');
});

app.listen(3000);

2、Client

The code is as follows, initiating a WebSocket connection to port 8080. After the connection is established, the log is printed and a message is sent to the server. When a message is received from the server, the same log is printed.

<script>
  var ws = new WebSocket('ws://localhost:8080');
  ws.onopen = function () {
    console.log('ws onopen');
    ws.send('from client: hello');
  };
  ws.onmessage = function (e) {
    console.log('ws onmessage');
    console.log('from server: ' + e.data);
  };
</script>

3. Running results

The server-side and client-side logs can be viewed separately and will not be expanded here.

Server-side output:

server: receive connection.
server: received hello

Client Output:

client: ws connection is open
client: received world

IV. How to establish a connection

As mentioned earlier, WebSocket reuses the HTTP handshake channel. Specifically, the client negotiates with the WebSocket server to upgrade the protocol through HTTP requests. After the protocol upgrade is completed, the subsequent data exchange follows the WebSocket protocol.

1、Client: apply for protocol upgrade

First, the client initiates a protocol upgrade request. As you can see, the standard HTTP message format is used and only the GET method is supported.

GET / HTTP/1.1
Host: localhost:8080
Origin: http://127.0.0.1:3000
Connection: Upgrade
Upgrade: websocket
Sec-WebSocket-Version: 13
Sec-WebSocket-Key: w4v7O6xFTi36lq3RNcgctw==

The meaning of the first part of the focus request is as follows:

Connection: Upgrade : Indicates upgraded protocols
Upgrade: websocket : Indicates an upgrade to the websocket protocol.
Sec-WebSocket-Version: 13 : Indicates the version of the websocket. If the server does not support this version, it needs to return a Sec-WebSocket-Version header containing the version number supported by the server.
Sec-WebSocket-Key : This is paired with Sec-WebSocket-Accept in the server response header later on, and provides basic protection against malicious connections, or unintentional connections for example.

Note that the above request omits some of the unfocused request prefixes. Since this is a standard HTTP request, request prefixes such as Host, Origin, Cookie, etc. are sent as usual. During the handshake phase, security restrictions, permission checks, etc. can be performed with the relevant request prefixes.

2. Server-side: responding to protocol upgrades

The server returns the following content, the status code 101 indicates protocol switching. This completes the protocol upgrade, and all subsequent data interactions will be in accordance with the new protocol.

HTTP/1.1 101 Switching Protocols
Connection:Upgrade
Upgrade: websocket
Sec-WebSocket-Accept: Oy4NRAQ13jhfONC7bP8dTKb4PTU=

Note: Each header ends with \r\n and an extra blank line \r\n is added to the last line. In addition, the HTTP status codes that the server responds with can only be used during the handshake phase. After the handshake phase, only specific error codes can be used.

3. Calculation of Sec-WebSocket-Accept

Sec-WebSocket-Accept Calculated from Sec-WebSocket-Key in the client request header.

The formula is:

Splice Sec-WebSocket-Key with 258EAFA5-E914-47DA-95CA-C5AB0DC85B11 .
The digest is calculated by SHA1 and converted to a base64 string.

The pseudo-code is as follows:

>toBase64( sha1( Sec-WebSocket-Key + 258EAFA5-E914-47DA-95CA-C5AB0DC85B11 )  )

Verify the previous returns:

const crypto = require('crypto');
const magic = '258EAFA5-E914-47DA-95CA-C5AB0DC85B11';
const secWebSocketKey = 'w4v7O6xFTi36lq3RNcgctw==';

let secWebSocketAccept = crypto.createHash('sha1')
	.update(secWebSocketKey + magic)
	.digest('base64');

console.log(secWebSocketAccept);
// Oy4NRAQ13jhfONC7bP8dTKb4PTU=

V. Data frame format

The exchange of client-side and server-side data cannot be separated from the definition of the data frame format. Therefore, before actually explaining the data exchange, let’s look at the data frame format of WebSocket.

WebSocket client, server communication is the smallest unit of frame (frame), from 1 or more frames to form a complete message (message).

Sender: cuts the message into multiple frames and sends them to the server;
Receiver: receives the message frames and reassembles the associated frames into a complete message;

The focus of this section is to explain the format of the data frame. Detailed definitions can be found in RFC6455 section 5.2.

1. Overview of data frame formats

The uniform format of WebSocket data frames is given below. Students familiar with the TCP/IP protocol should not be unfamiliar with such a diagram.

From left to right, the units are bits. For example, FIN and RSV1 each occupy 1 bit, and opcode occupies 4 bits.
Contents include identification, opcode, mask, data, data length, etc. (Expanded in the next subsection)

  0                   1                   2                   3
  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
 +-+-+-+-+-------+-+-------------+-------------------------------+
 |F|R|R|R| opcode|M| Payload len |    Extended payload length    |
 |I|S|S|S|  (4)  |A|     (7)     |             (16/64)           |
 |N|V|V|V|       |S|             |   (if payload len==126/127)   |
 | |1|2|3|       |K|             |                               |
 +-+-+-+-+-------+-+-------------+ - - - - - - - - - - - - - - - +
 |     Extended payload length continued, if payload len == 127  |
 + - - - - - - - - - - - - - - - +-------------------------------+
 |                               |Masking-key, if MASK set to 1  |
 +-------------------------------+-------------------------------+
 | Masking-key (continued)       |          Payload Data         |
 +-------------------------------- - - - - - - - - - - - - - - - +
 :                     Payload Data continued ...                :
 + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
 |                     Payload Data continued ...                |
 +---------------------------------------------------------------+

2、Data frame format details

In response to the previous format overview diagram, here is a field-by-field explanation, if there is any ambiguity, you can refer to the protocol specification, or leave a message to exchange.

FIN: 1 bit.

If it is 1, it means it is the last fragment of the message, if it is 0, it means it is not the last fragment of the message.

RSV1, RSV2, RSV3: 1 bit each.

Normally all zeros, but when the client and server negotiate the use of WebSocket extensions, these three flags can be non-zero, and the meaning of the value is defined by the extension. If there is a non-zero value, and does not use the WebSocket extension, the connection error.

Opcode: 4 bits.

The value of the Opcode determines how the subsequent data payload should be parsed. If the Opcode is not recognized, then the receiver should fail the connection. The optional opcodes are as follows:

%x0: Indicates a continuation frame. When Opcode is 0, it indicates that data fragmentation is used for this data transmission and the currently received data frame is one of the data fragments.
%x1: indicates that this is a text frame (frame)
%x2: indicates that this is a binary frame.
%x3-7: Reserved opcode for subsequently defined non-control frames.
%x8: Indicates that the connection is disconnected.
%x9: Indicates that this is a ping operation.
%xA: Indicates that this is a pong operation.
%xB-F: Reserved opcode for subsequently defined control frames.

Mask: 1 bit.

Indicates whether the data load should be masked. When sending data from the client to the server, the data needs to be masked; when sending data from the server to the client, the data does not need to be masked.

If the data received by the server has not been masked, the server needs to disconnect.

If Mask is 1, then a masking key is defined in Masking-key and this masking key is used to demask the data load. All data frames sent from the client to the server have a Mask of 1.

The algorithms, uses of masks are explained in the next subsection.

Payload length: the length of the data load in bytes. It is 7 bits, or 7+16 bits, or 1+64 bits.

Assume that the number Payload length === x if

x is 0~126: the length of the data is x bytes.
x is 126: the subsequent 2 bytes represent a 16-bit unsigned integer whose value is the length of the data.
x is 127: the subsequent 8 bytes represent a 64-bit unsigned integer (the highest bit is 0), and the value of this unsigned integer is the length of the data.

In addition, if the payload length occupies more than one byte, the binary representation of the payload length is in network order (big endian, significant bits first).

Masking-key: 0 or 4 bytes (32 bits)

All data frames transmitted from the client to the server, the data load is masked, Mask is 1 and carries a 4-byte Masking-key. if Mask is 0, there is no Masking-key.

Note: The length of the load data, excluding the length of the mask key.

Payload data：(x+y)

Load data: includes extension data, application data. Among them, extension data x bytes, application data y bytes.

Extended Data: The extended data data is 0 bytes if no extension is negotiated for use. All extensions must declare the length of the extended data or how the length of the extended data can be calculated. In addition, how the extension is used must be negotiated during the handshake phase. If extended data exists, then the load data length must include the length of the extended data.

Application Data: arbitrary application data that follows the extended data (if extended data exists) and occupies the remainder of the data frame. The length of the load data minus the length of the extended data gives the length of the application data.

3. Masking algorithm

The masking-key is a 32-bit random number picked by the client. The masking operation does not affect the length of the data load. The following algorithm is used for both masking and demasking operations:

First, the assumptions:

original-octet-i: is the i-th byte of the original data.
transformed-octet-i: is the i-th byte of the transformed data.
j: results for i mod 4 .
masking-key-octet-j: for mask key jth byte.

The algorithm is described as follows: original-octet-i is differentiated from masking-key-octet-j to obtain transformed-octet-i.

j = i MOD 4
transformed-octet-i = original-octet-i XOR masking-key-octet-j

VI. Data transfer

Once the WebSocket client and server have established a connection, subsequent operations are based on the passing of data frames.

WebSocket distinguishes the type of operation according to opcode . For example, 0x8 means disconnect, 0x0 – 0x2 means data interaction.

1. Data slicing

Each message of a WebSocket may be sliced into multiple data frames. When the receiver of a WebSocket receives a data frame, it determines, based on the value of FIN , whether the last data frame of the message has been received.

FIN=1 indicates that the current data frame is the last data frame of the message, at this time the receiver has received the complete message and can process the message. fin=0, the receiver also needs to continue to listen to receive the remaining data frames.

In addition, opcode indicates the type of data in a data exchange scenario. 0x01 It means text, and 0x02 means binary. And 0x00 is more special, it indicates the continuation frame, as the name suggests, the data frame corresponding to the complete message has not been received yet.

2. Data slicing example

It’s more visual to look directly at the example. The following example from MDN is a good demonstration of data slicing. The client sends a message to the server twice, the server receives the message and responds to the client, here we mainly look at the message sent by the client to the server.

First message.

FIN=1, indicates that it is the last data frame of the current message. The server can process the message after receiving the current data frame. opcode=0x1, indicates that the client is sending a text type.

Second message.

FIN=0, opcode=0x1, indicates that the text type is being sent and the message is not yet complete, there are subsequent data frames.
FIN=0, opcode=0x0, indicates that the message has not been sent yet, there are still subsequent data frames, and the current data frame needs to be picked up after the previous data frame.
FIN=1, opcode=0x0, indicates that the message has been sent, there is no subsequent data frame, the current data frame needs to be picked up after the previous data frame. The server can assemble the associated data frames into a complete message.

Client: FIN=1, opcode=0x1, msg="hello"
Server: (process complete message immediately) Hi.
Client: FIN=0, opcode=0x1, msg="and a"
Server: (listening, new message containing text started)
Client: FIN=0, opcode=0x0, msg="happy new"
Server: (listening, payload concatenated to previous message)
Client: FIN=1, opcode=0x0, msg="year!"
Server: (process complete message) Happy new year to you too!

VII. Connection Hold + Heartbeat

In order to maintain real-time two-way communication between the client and the server, WebSocket needs to ensure that the TCP channel between the client and the server remains connected. However, for connections that have not had data exchanged for a long period of time, it may be a waste of the connection resources included if they are still maintained for a long period of time.

But do not rule out some scenarios, the client, the server although there is no data exchange for a long time, but still need to keep the connection. At this time, heartbeat can be used to realize.

Sender->Receiver: ping
Receiver -> Sender: pong

The operations of ping and pong correspond to the two control frames of WebSocket, opcode 0x9 , 0xA respectively.

For example, a WebSocket server sending a ping to a client would only require the following code (using the ws module)

ws.ping('', false, true);

VIII. Role of Sec-WebSocket-Key/Accept

As mentioned earlier, Sec-WebSocket-Key/Sec-WebSocket-Accept in its main role is to provide basic protection and reduce malicious connections, accidental connections.

The roles are broadly summarized below:

Avoid receiving illegal websocket connections on the server side (e.g. http client accidentally requesting a connection to a websocket service, when the server side can simply reject the connection)
Ensure that the server side understands the websocket connection. Since the ws handshake phase uses the http protocol, it is possible that the ws connection is processed and returned by an http server, in which case the client can make sure that the server side recognizes the ws protocol by using Sec-WebSocket-Key. (It’s not 100% safe, for example, there are always boring http servers that just handle Sec-WebSocket-Key but don’t implement the ws protocol.)
When initiating an ajax request in a browser and setting a header, Sec-WebSocket-Key and other related headers are disabled. This prevents the client from accidentally requesting a websocket upgrade when sending an ajax request.
This prevents the reverse proxy (which doesn’t understand the ws protocol) from returning the wrong data. For example, if the reverse proxy receives two requests for upgrading the ws connection, the reverse proxy will cache the return of the first request, and then return the cached request when the second request arrives (a meaningless return).
The main purpose of Sec-WebSocket-Key is not to ensure the security of the data, because the formula for calculating the conversion of Sec-WebSocket-Key, Sec-WebSocket-Accept is public and very simple, and the most important role is to prevent some common accidental situations (unintentional).

Emphasize: the conversion of Sec-WebSocket-Key/Sec-WebSocket-Accept can only bring basic guarantees, but whether the connection is secure or not, whether the data is secure or not, whether the client/server is legitimate or not ws client, ws server, in fact, there is no practical guarantee.

IX. Role of the data mask

The role of data masks in the WebSocket protocol is to enhance the security of the protocol. However, the data mask is not meant to protect the data itself, as the algorithm itself is publicly available and the operations are not complicated. Other than encrypting the channel itself, there doesn’t seem to be much effective way to secure the communication.

So why introduce mask calculations at all, there doesn’t seem to be much to gain other than increasing the amount of calculator arithmetic (which is a point of confusion for a number of students).

The answer is still two words: security. But not to prevent data leakage, but to prevent problems such as proxy cache poisoning attacks (proxy cache poisoning attacks) that existed in earlier versions of the protocol.

1. Proxy cache pollution attack

The following is an excerpt from a 2010 speech on security. It mentions the security problems that can result from flaws in the protocol implementation of proxy servers. Slam the source.

“We show, empirically, that the current version of the WebSocket consent mechanism is vulnerable to proxy cache poisoning attacks. Even though the WebSocket handshake is based on HTTP, which should be understood by most network intermediaries, the handshake uses the esoteric “Upgrade” mechanism of HTTP [5]. In our experiment, we find that many proxies do not implement the Upgrade mechanism properly, which causes the handshake to succeed even though subsequent traffic over the socket will be misinterpreted by the proxy.”
[TALKING] Huang, L-S., Chen, E., Barth, A., Rescorla, E., and C.
Jackson, “Talking to Yourself for Fun and Profit”, 2010,

Before formally describing the attack steps, we assume the following participants:

Attackers, servers controlled by the attackers themselves (referred to as “evil servers”), resources forged by the attackers (referred to as “evil resources”)
Victims, resources that victims want to access (referred to as “justice resources”)
The server that the victim actually wants to access (the “justice server”).
intermediate proxy server

Attack step one:

The attacker’s browser makes a WebSocket connection to the Evil Server. According to the previous section, it starts with a protocol upgrade request.
The protocol upgrade request actually arrives at the proxy server.
Proxy Server Forwards protocol upgrade requests to the Evil Server.
The evil server agrees to connect and the proxy server forwards the response to the attacker.

Due to a flaw in upgrade’s implementation, the proxy server assumes that it was previously forwarding a normal HTTP message. Therefore, when the protocol server agrees to connect, the proxy server assumes that the session is over.

Attack step two:

The attacker sends data to the Evil Server over the WebSocket interface on a previously established connection, and the data is carefully constructed text in HTTP format. It contains the address of the Justice resource, and a spoofed host (pointing to the Justice server). (see later message)
The request reaches the proxy server. Although the previous TCP connection is reused, the proxy server thinks it is a new HTTP request.
Proxy servers request evil resources from evil servers.
Evil servers return evil resources. The proxy server caches the evil resource (the url is correct, but the host is the address of the good server).

By this point, the victim can make his or her entrance:

The victim accesses the justice resources of the justice server through a proxy server.
Proxy server Checks the url, host of this resource and finds a local copy of the cache (forged).
The proxy server will return the evil resources to the victim.
Victim: deceased.

P.S. The carefully constructed “HTTP request message” mentioned earlier.

Client → Server:
POST /path/of/attackers/choice HTTP/1.1 Host: host-of-attackers-choice.com Sec-WebSocket-Key: <connection-key>
Server → Client:
HTTP/1.1 200 OK
Sec-WebSocket-Accept: <connection-key>

2. Current solutions

The initial proposal was to encrypt the data. Based on security and efficiency considerations, a compromise was eventually adopted: masking the data load.

It is important to note that the browser is only restricted from masking the data load here, but the bad guys are perfectly capable of implementing their own WebSocket clients, servers, and not following the rules, and the attack can proceed as usual.

But adding this restriction to the browser can greatly increase the difficulty of the attack, as well as the impact of the attack. Without this restriction, all you need to do is put a phishing site on the Internet to trick people into visiting it, and all of a sudden you can launch a widespread attack in a short period of time.