Part 6 - A simple HTTP parser in python
(The changes introduced in this post start here.)
Let’s implement our own little HTTP
request parser, such that we’re not reliant on 3rd party libraries and gain some more understanding about the HTTP
message format. We’ll just shamelessly copy the flow of the httptools.HttpRequestParser
, so that we can just drop our replacement in without any other changes required to our code.
The general plan
As you can see in this image from MDN, the message structure is quite rigid and there are not a lot of rules necessary to implement a simple parser.
The first line is always the start line with a fixed number of positional arguments (HTTP
method, request url/path, HTTP
version) delimited by whitespace. Each following line is a Name: Value
header pair until an eventual empty line. If there is a Content-Length
header, then after the empty line you must also expect a message body of the specified length.
So by splitting on lines and parsing the information of every line, while keeping track of where in the message structure we currently are, we can parse out the information step-by-step. One wrinkle that slightly complicates the issue is that the data can potentially come in as many sequential chunks. So we always need to be aware of that the current line we’re looking at might not yet be completely received.
For example, instead of receiving the following request in a single chunk.
b"GET /index.html HTTP/1.1\r\nHost: localhost:5000\r\nUser-Agent: curl/7.69.1\r\nAccept: */*\r\n\r\n"
We may instead receive several, sequential chunks like so:
b"GET /index.html "
b"HTTP/1.1\r\nHost"
b": localhost:5000"
b"\r\nUser-Agent: "
b"curl/7.69.1\r\nA"
b"ccept: */*\r\n\r"
b"\n"
An httptools inspired implementation
Let’s first create a new file splitbuffer.py
where we’ll implement a simple buffer utility class. It should be possible to continuously feed data into it and then pop elements out of it based on a separator.
#./splitbuffer.py
class SplitBuffer:
def __init__(self):
self.data = b""
def feed_data(self, data: bytes):
self.data += data
def pop(self, separator: bytes):
first, *rest = self.data.split(separator, maxsplit=1)
# no split was possible
if not rest:
return None
else:
self.data = separator.join(rest)
return first
def flush(self):
temp = self.data
self.data = b""
return temp
Here is a little example of how it would work.
>>> buffer = SplitBuffer()
>>> buffer.feed_data(b"abc,")
>>> buffer.feed_data(b"defg")
>>> buffer.feed_data(b",hi,")
>>> assert buffer.pop(separator=b",") == b"abc"
>>> assert buffer.pop(separator=b",") == b"defg"
>>> buffer.feed_data(b",jkl")
>>> assert buffer.pop(separator=b",") == b"hi"
>>> assert buffer.pop(separator=b",") == b""
>>> assert buffer.pop(separator=b",") is None
>>> assert buffer.flush() == b"jkl"
We’ll use the .pop
behavior while we’re in the start/headers part of the message (where entries are separated by newlines) and we’ll use the .flush
behavior while grabbing the body (which should not be separated at all, just consumed as is).
In a new file http_parse.py
we can now implement our own little HTTP
parser.
#./http_parse.py
from splitbuffer import SplitBuffer
class HttpRequestParser:
def __init__(self, protocol):
self.protocol = protocol
self.buffer = SplitBuffer()
self.done_parsing_start = False
self.done_parsing_headers = False
self.expected_body_length = 0
def feed_data(self, data: bytes):
self.buffer.feed_data(data)
self.parse()
def parse(self):
if not self.done_parsing_start:
self.parse_startline()
elif not self.done_parsing_headers:
self.parse_headerline()
elif self.expected_body_length:
data = self.buffer.flush()
self.expected_body_length -= len(data)
self.protocol.on_body(data)
self.parse()
else:
self.protocol.on_message_complete()
def parse_startline(self):
line = self.buffer.pop(separator=b"\r\n")
if line is not None:
http_method, url, http_version = line.strip().split()
self.done_parsing_start = True
self.protocol.on_url(url)
self.parse()
def parse_headerline(self):
line = self.buffer.pop(separator=b"\r\n")
if line is not None:
if line:
name, value = line.strip().split(b": ", maxsplit=1)
if name.lower() == b"content-length":
self.expected_body_length = int(value.decode("utf-8"))
self.protocol.on_header(name, value)
else:
self.done_parsing_headers = True
self.parse()
We use 2 flags self.done_parsing_start
and self.done_parsing_headers
to keep track of whether we’ve already parsed the start line or the headers, respectively. If we encounter a Content-Length
header while parsing the headers we’ll set the self.expected_body_length
so that we know how much body we have left to read before the message is over. The most important part is the parse()
method which tries to continuously parse out the next thing (start line, header line, a chunk of the body) from the raw data coming in through feed_data()
.
The parser takes a protocol
input on instantiation which expects the exact same callbacks to be implemented as httptools.HttpRequestParser
does. So, in order to start using our own parser instead of httptools
we only need to replace the import in server.py
.
#./server.py
@@ -1,7 +1,7 @@
from typing import Tuple
import socket
import threading
- from httptools import HttpRequestParser
+ from http_parse import HttpRequestParser
from http_request import HttpRequestParserProtocol
from http_response import make_response
Notes
Of course, there is basically no error handling, no checking for a correct format, etc. We’re just assuming that the client will send a well-formed HTTP
request, otherwise everything will break. But I think that’s fine, we just want to see how all the basic pieces fit together and not get bogged down in trying to handle all corner cases.
Also, to emphasize this point again, the message structure we’ve looked at here is HTTP/1.1
. HTTP/2
has a somewhat different way of formatting the messages. If you want to read more about that, check here.
Now we can be very proud of ourselves to have reinvented the wheel here. It has resulted in a far slower and buggier version of what httptools
already provides. Great job. The next step in this series will be to finally, actually implement our very own WSGI
server.