As a favour to a good friend who needed a really simple thing done I decided to write this Lambda function. Basically: load a file, search through it for a particular string and return a flag indicating if it was found or not. Simple, right?
Ok. That wasn't really what I had been asked to do but for the purposes of illustration and this particular treatise it is the core of the matter and what tripped me up. Anyway, to the point at hand: I figured: store the file in S3; read it into Lambda and do a search. Front end it with API Gateway and you've got a super-duper cheap (and quick) way of searching this text. Cheap, easy, fast. Just what the doctor ordered.
Only one small obstacle - it's a big file. 800 megabytes or so. But that's not a big deal, right? Lambda has 1.5 GB of memory available (at the time of doing this - September 2017) so no problems! Just retrieve the file from S3 and we're done. As long as that doesn't take too long then we're good.
Let's write a bit of code - ensuring that Lambda has 1.5 GB of memory, we set the function timeout to (say) 20 seconds and we assign a suitable role.
import boto3 def LoadObject(): S3 = boto3.client("s3") try: Response = S3.get_object(Bucket="randombucket", Key="randomkey") print("Object size: "+Response["ResponseMetadata"]["HTTPHeaders"]["content-length"]) FileContents = Response["Body"].read() except Exception as e: print("Error getting object: "+e) def lambda_handler(event, context): LoadObject()
The error message from Lambda is interesting:
Execution result: failed START RequestId: 61fd3a48-a8b2-11e7-af8a-f741ba48d44e Version: $LATEST Object size: 839775828 END RequestId: 61fd3a48-a8b2-11e7-af8a-f741ba48d44e REPORT RequestId: 61fd3a48-a8b2-11e7-af8a-f741ba48d44e Duration: 17005.67 ms Billed Duration: 17100 ms Memory Size: 1536 MB Max Memory Used: 1536 MB RequestId: 61fd3a48-a8b2-11e7-af8a-f741ba48d44e Process exited before completing request
Say what? We ran out of memory? But we're only loading just over 800 MB! How did we use 1.5 GB of memory and still not load the file?
With a little bit of investigating, it seems that Python (more likely Boto3 or perhaps the underlying HTTP/HTTPS retrieval library) is allocating memory for some type of buffer before handing the object over to be assigned into a string. What happens if we try and read just a part of the file? Let's work our way down a bit - starting at 800 MB.
import boto3 def LoadObject(): S3 = boto3.client("s3") try: Response = S3.get_object(Bucket="randombucket", Key="randomkey") print("Object size: "+Response["ResponseMetadata"]["HTTPHeaders"]["content-length"]) FileContents = Response["Body"].read(amt=800000000) except Exception as e: print("Error getting object: "+e) def lambda_handler(event, context): LoadObject()
Execution result: failed START RequestId: c66fefcc-a8b9-11e7-8a4a-33fdf56d58d4 Version: $LATEST Object size: 839775828 END RequestId: c66fefcc-a8b9-11e7-8a4a-33fdf56d58d4 REPORT RequestId: c66fefcc-a8b9-11e7-8a4a-33fdf56d58d4 Duration: 14292.27 ms Billed Duration: 14300 ms Memory Size: 1536 MB Max Memory Used: 1536 MB RequestId: c66fefcc-a8b9-11e7-8a4a-33fdf56d58d4 Process exited before completing request
Nope. Turns out the sweet spot is somewhere a bit over 700 MB (which makes sense because 700 * 2 = 1400 which is less than 1500 or roughly 1.5 GB) but let's play it safe and go with 500 MB as a number. Now, can we do two reads of 500 MB and load the file?
import boto3 def LoadObject(): FileContents = "" S3 = boto3.client("s3") try: Response = S3.get_object(Bucket="randombucket", Key="randomkey") print("Object size: "+Response["ResponseMetadata"]["HTTPHeaders"]["content-length"]) for i in range(0, 2): FileContents += Response["Body"].read(amt=500000000) except Exception as e: print("Error getting object: "+e) def lambda_handler(event, context): LoadObject()
Execution result: succeeded START RequestId: c68b22f7-a8ba-11e7-91b9-d1069e754502 Version: $LATEST Object size: 839775828 END RequestId: c68b22f7-a8ba-11e7-91b9-d1069e754502 REPORT RequestId: c68b22f7-a8ba-11e7-91b9-d1069e754502 Duration: 12269.91 ms Billed Duration: 12300 ms Memory Size: 1536 MB Max Memory Used: 1163 MB
Yes! Ok. Loaded, now we can go to work. Let's search for the string we're interested in. Note that if we reduce the read buffer down to (say) 250 MB and do four reads we reduce our memory consumption from 1.1 GB to 900 (ish) MB so that might be a factor later (because it is crazy difficult to see momentary memory usage in Python so I can't really tell how much memory is being used at any particular second) - so if the file size increases it may be a problem but for now let's go with it. And around twelve seconds to pull the file from S3 isn't a terrible hit for a first search for this use case so we're cooking with gas now!
First though we need to put that string into a global variable and we should make sure that we only load it on cold function starts.
import boto3 GlobalFileContents = "" def LoadObject(): global GlobalFileContents if len(GlobalFileContents) == 0: S3 = boto3.client("s3") try: Response = S3.get_object(Bucket="randombucket", Key="randomkey") for i in range(0, 2): GlobalFileContents += Response["Body"].read(amt=500000000) except Exception as e: print("Error getting object: "+e) def lambda_handler(event, context): LoadObject()
Execution result: failed START RequestId: 685a7a14-a8bb-11e7-9a18-0b874190d810 Version: $LATEST END RequestId: 685a7a14-a8bb-11e7-9a18-0b874190d810 REPORT RequestId: 685a7a14-a8bb-11e7-9a18-0b874190d810 Duration: 13491.09 ms Billed Duration: 13500 ms Memory Size: 1536 MB Max Memory Used: 1537 MB RequestId: 685a7a14-a8bb-11e7-9a18-0b874190d810 Process exited before completing request
Huh? Didn't we just solve this problem? What if we reduce the size of the amount we're reading from S3 and increase the number of loops? Turns out - no. This doesn't work either. It appears that Python might be allocating a local variable within the function to store the file (or some other mechanism) before it allocates the global variable space - and we can't fit 800 MB * 2 into the Lambda function memory. Fine, Python. Have it your way.
import boto3 GlobalFileContents = "" def LoadObject(): global GlobalFileContents if len(GlobalFileContents) == 0: LocalContents = "" S3 = boto3.client("s3") try: Response = S3.get_object(Bucket="randombucket", Key="randomkey") for i in range(0, 2): LocalContents += Response["Body"].read(amt=500000000) GlobalFileContents = LocalContents except Exception as e: print("Error getting object: "+e) def lambda_handler(event, context): LoadObject()
Execution result: succeeded START RequestId: e4fb623c-a8bb-11e7-bce8-e7c17002a174 Version: $LATEST END RequestId: e4fb623c-a8bb-11e7-bce8-e7c17002a174 REPORT RequestId: e4fb623c-a8bb-11e7-bce8-e7c17002a174 Duration: 12320.69 ms Billed Duration: 12400 ms Memory Size: 1536 MB Max Memory Used: 1163 MB
Excellent. Luckily Python passes variables by reference so we only get one copy of the buffer now. It allocates the space inside the function; references it globally and then (lucky for me) doesn't destroy it when the function exits. Ok. NOW we can search.
import boto3 GlobalFileContents = "" def LoadObject(): global GlobalFileContents if len(GlobalFileContents) == 0: LocalContents = "" S3 = boto3.client("s3") try: Response = S3.get_object(Bucket="randombucket", Key="randomkey") for i in range(0, 2): LocalContents += Response["Body"].read(amt=500000000) GlobalFileContents = LocalContents except Exception as e: print("Error getting object: "+e) def lambda_handler(event, context): LoadObject() SearchText = "FindMe" Found = GlobalFileContents.find(SearchText) return(Found)
Execution result: succeeded START RequestId: 1eeebdde-a8bc-11e7-b523-530dce425102 Version: $LATEST END RequestId: 1eeebdde-a8bc-11e7-b523-530dce425102 REPORT RequestId: 1eeebdde-a8bc-11e7-b523-530dce425102 Duration: 13072.65 ms Billed Duration: 13100 ms Memory Size: 1536 MB Max Memory Used: 1163 MB
Sweet! And if we run it again we can see that we're not loading the file a second time and getting the benefit of warm function starts.
Execution result: succeeded START RequestId: 56e8ba66-a8bc-11e7-a60e-a19bf1c32d99 Version: $LATEST END RequestId: 56e8ba66-a8bc-11e7-a60e-a19bf1c32d99 REPORT RequestId: 56e8ba66-a8bc-11e7-a60e-a19bf1c32d99 Duration: 592.48 ms Billed Duration: 600 ms Memory Size: 1536 MB Max Memory Used: 1163 MB
600 milliseconds to search 800-odd MB of text isn't bad, right? Cool. Now, let's see what happens if we call the function from API GW. Luckily we can simulate this right from the console by crafting a test event.
{ "queryStringParameters": { "SearchString": "FindMe" } }
import boto3 GlobalFileContents = "" def LoadObject(): global GlobalFileContents if len(GlobalFileContents) == 0: LocalContents = "" S3 = boto3.client("s3") try: Response = S3.get_object(Bucket="randombucket", Key="randomkey") for i in range(0, 2): LocalContents += Response["Body"].read(amt=500000000) GlobalFileContents = LocalContents except Exception as e: print("Error getting object: "+e) def lambda_handler(event, context): LoadObject() SearchText = event["queryStringParameters"]["SearchString"] Found = GlobalFileContents.find(SearchText) return(Found)
Execution result: failed START RequestId: 1b909e6a-a8e1-11e7-91b0-852b465f0ad0 Version: $LATEST : MemoryError Traceback (most recent call last): File "/var/task/lambda_function.py", line 26, in lambda_handler Found = GlobalFileContents.find(SearchText) MemoryError END RequestId: 1b909e6a-a8e1-11e7-91b0-852b465f0ad0 REPORT RequestId: 1b909e6a-a8e1-11e7-91b0-852b465f0ad0 Duration: 13520.09 ms Billed Duration: 13600 ms Memory Size: 1536 MB Max Memory Used: 1163 MB
Here we go again. Why are we getting a memory error? According to the Python wisdom online this is because we have run out of memory. Well duh. But Lambda says we're only using our regular 1.1 GB or so. Hmmm. What if Python is trying to allocate more memory than is available but is failing on that request? That'd produce that error message. But why would it be trying to allocate more memory? Searching by a string works - why would searching by a string that comes from the event that is passed to Lambda be any different?
import boto3 GlobalFileContents = "" def LoadObject(): global GlobalFileContents if len(GlobalFileContents) == 0: LocalContents = "" S3 = boto3.client("s3") try: Response = S3.get_object(Bucket="randombucket", Key="randomkey") for i in range(0, 2): LocalContents += Response["Body"].read(amt=500000000) GlobalFileContents = LocalContents except Exception as e: print("Error getting object: "+e) def lambda_handler(event, context): LoadObject() OriginalSearchText = "FindMe" SearchText = event["queryStringParameters"]["SearchString"] print(OriginalSearchText) print(SearchText) Found = GlobalFileContents.find(SearchText) return(Found)
Execution result: failed START RequestId: a684d388-a8e1-11e7-b2cd-4543dc3a3fa5 Version: $LATEST FindMe FindMe : MemoryError Traceback (most recent call last): File "/var/task/lambda_function.py", line 30, in lambda_handler Found = GlobalFileContents.find(SearchText) MemoryError END RequestId: a684d388-a8e1-11e7-b2cd-4543dc3a3fa5 REPORT RequestId: a684d388-a8e1-11e7-b2cd-4543dc3a3fa5 Duration: 10855.90 ms Billed Duration: 10900 ms Memory Size: 1536 MB Max Memory Used: 1163 MB
No clues there - looks identical to me (and to string comparison operators too). So what is "find" doing that is so unusual? Much digging happens at this point with many hours passing while following curious trails down rabbit holes much to no avail until eventually, by accident a random posting somewhere has something of interest so I do this:
import boto3 GlobalFileContents = "" def LoadObject(): global GlobalFileContents if len(GlobalFileContents) == 0: LocalContents = "" S3 = boto3.client("s3") try: Response = S3.get_object(Bucket="randombucket", Key="randomkey") for i in range(0, 2): LocalContents += Response["Body"].read(amt=500000000) GlobalFileContents = LocalContents except Exception as e: print("Error getting object: "+e) def lambda_handler(event, context): LoadObject() SearchText = event["queryStringParameters"]["SearchString"] print(type(SearchText)) Found = GlobalFileContents.find(SearchText) return(Found)
Execution result: failed START RequestId: edc9e9c4-a8e1-11e7-a949-8d49f899d7ad Version: $LATEST <type 'unicode'> : MemoryError Traceback (most recent call last): File "/var/task/lambda_function.py", line 28, in lambda_handler Found = GlobalFileContents.find(SearchText) MemoryError END RequestId: edc9e9c4-a8e1-11e7-a949-8d49f899d7ad REPORT RequestId: edc9e9c4-a8e1-11e7-a949-8d49f899d7ad Duration: 12401.82 ms Billed Duration: 12500 ms Memory Size: 1536 MB Max Memory Used: 1163 MB
So the event JSON payload is unicode? What does "find" do when comparing a unicode string to a utf-8 string? It converts the utf-8 string to unicode first - and in the process attempts to consume way more memory than we have available. Aha! I don't really care if I lose any special unicode characters (because there aren't any in the target string), so:
import boto3 GlobalFileContents = "" def LoadObject(): global GlobalFileContents if len(GlobalFileContents) == 0: LocalContents = "" S3 = boto3.client("s3") try: Response = S3.get_object(Bucket="randombucket", Key="randomkey") for i in range(0, 2): LocalContents += Response["Body"].read(amt=500000000) GlobalFileContents = LocalContents except Exception as e: print("Error getting object: "+e) def lambda_handler(event, context): LoadObject() SearchText = event["queryStringParameters"]["SearchString"].encode("utf-8") Found = GlobalFileContents.find(SearchText) return(Found)
Execution result: succeeded START RequestId: 6b3c2b34-a8e2-11e7-9528-9721462c0542 Version: $LATEST END RequestId: 6b3c2b34-a8e2-11e7-9528-9721462c0542 REPORT RequestId: 6b3c2b34-a8e2-11e7-9528-9721462c0542 Duration: 12167.18 ms Billed Duration: 12200 ms Memory Size: 1536 MB Max Memory Used: 1163 MB
Phew. Finally. There endeth the lesson.