Compute MD5 or SHA hash of large file efficiently on iOS and Mac OS X September 7, 2010

XcodeComputing cryptographic hashes of files on iOS and Mac OS X using the CommonCrypto APIs is fairly easy, but doing it in a way that minimizes memory consumption even with large files can be a little more difficult… The other day, I was reading what some people were saying about this on a forum about iPhone development, and they thought they found the trick, but they still had a growing memory footprint with large files because they forgot something fundamental about memory management in Cocoa.

Updated

  • Friday, October 1, 2010: removed comment about the fact that I used character arrays on the heap with the more modular solution described at the end of the post; this is now fixed, and that more general solution is now as efficient as the simple one described here.
  • Sunday, October 17, 2010: added link to a simple GitHub repository that I created to show exactly how to integrate my function FileMD5HashCreateWithPath with a simple iOS or Mac application.

What was wrong with that solution?

Even though they had a solution to read bytes from the file progressively instead of reading everything at once, it did not improve the memory consumption of their program when computing hashes of large files. The mistake they made is that the bytes read in the while loop were in an autoreleased instance of NSData. So, unless they create a local autorelease pool within the while loop, the memory will just accumulate, until the next autorelease pool is drained. But I think it would be very inefficient to add an autorelease pool in the while loop, because you would end up allocating a new object in every pass of the loop.

So, in my opinion, the right question is: how do we read those bytes without getting an autoreleased object?

How to get around that problem?

I looked for a solution, and I couldn’t find anything that would do the same thing as -[NSFileHandle readDataOfLength:] at the Foundation level without returning an autoreleased object. So I thought: we have to go deeper. I looked for something similar in Core Foundation, and sure enough, I found the CFReadStream API.

And since I was going to do this using Core Foundation to read those bytes, I decided to go all the way with Core Foundation, with a solution in pure C.

Here’s how you can compute efficiently the MD5 hash of a large file with CommonCrypto and Core Foundation:

FileMD5Hash.c

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
// Standard library
#include <stdint.h>
#include <stdio.h>

// Core Foundation
#include <CoreFoundation/CoreFoundation.h>

// Cryptography
#include <CommonCrypto/CommonDigest.h>

// In bytes
#define FileHashDefaultChunkSizeForReadingData 4096

// Function
CFStringRef FileMD5HashCreateWithPath(CFStringRef filePath,
                                      size_t chunkSizeForReadingData) {
   
    // Declare needed variables
    CFStringRef result = NULL;
    CFReadStreamRef readStream = NULL;
   
    // Get the file URL
    CFURLRef fileURL =
        CFURLCreateWithFileSystemPath(kCFAllocatorDefault,
                                      (CFStringRef)filePath,
                                      kCFURLPOSIXPathStyle,
                                      (Boolean)false);
    if (!fileURL) goto done;
   
    // Create and open the read stream
    readStream = CFReadStreamCreateWithFile(kCFAllocatorDefault,
                                            (CFURLRef)fileURL);
    if (!readStream) goto done;
    bool didSucceed = (bool)CFReadStreamOpen(readStream);
    if (!didSucceed) goto done;
   
    // Initialize the hash object
    CC_MD5_CTX hashObject;
    CC_MD5_Init(&hashObject);
   
    // Make sure chunkSizeForReadingData is valid
    if (!chunkSizeForReadingData) {
        chunkSizeForReadingData = FileHashDefaultChunkSizeForReadingData;
    }
   
    // Feed the data to the hash object
    bool hasMoreData = true;
    while (hasMoreData) {
        uint8_t buffer[chunkSizeForReadingData];
        CFIndex readBytesCount = CFReadStreamRead(readStream,
                                                  (UInt8 *)buffer,
                                                  (CFIndex)sizeof(buffer));
        if (readBytesCount == -1) break;
        if (readBytesCount == 0) {
            hasMoreData = false;
            continue;
        }
        CC_MD5_Update(&hashObject,
                      (const void *)buffer,
                      (CC_LONG)readBytesCount);
    }
   
    // Check if the read operation succeeded
    didSucceed = !hasMoreData;
   
    // Compute the hash digest
    unsigned char digest[CC_MD5_DIGEST_LENGTH];
    CC_MD5_Final(digest, &hashObject);
   
    // Abort if the read operation failed
    if (!didSucceed) goto done;
   
    // Compute the string result
    char hash[2 * sizeof(digest) + 1];
    for (size_t i = 0; i < sizeof(digest); ++i) {
        snprintf(hash + (2 * i), 3, "%02x", (int)(digest[i]));
    }
    result = CFStringCreateWithCString(kCFAllocatorDefault,
                                       (const char *)hash,
                                       kCFStringEncodingUTF8);
   
done:
   
    if (readStream) {
        CFReadStreamClose(readStream);
        CFRelease(readStream);
    }
    if (fileURL) {
        CFRelease(fileURL);
    }
    return result;
}

Then, from your Objective-C code, you can just use that function like this:

1
2
3
4
5
6
7
NSString *filePath = ...; // Let's assume filePath is defined...
CFStringRef md5hash =
    FileMD5HashCreateWithPath((CFStringRef)filePath,
                              FileHashDefaultChunkSizeForReadingData);
NSLog(@"MD5 hash of file at path \"%@\": %@",
      filePath, (NSString *)md5hash);
CFRelease(md5hash);

Remember that FileMD5HashCreateWithPath transfers ownership of the returned string, so you must release it yourself.

I also created a small GitHub repository that may help you understand how to integrate that code in your project. It contains a very simple Xcode project, with a target for iOS and another one for Mac OS X. In both cases, the application just provides a simple button to compute the MD5 hash of the executable file (the binary). Here is where you can find that repository: FileMD5Hash GitHub repository.

Advantages of this solution

There are several nice things about this implementation:

  • first, it works as advertised: it computes the MD5 hash of the file correctly, and it doesn’t make the memory footprint of your app grow, even if you give it the path to a huge file;
  • even though the path argument is a CFStringRef, it’s really easy to use this from Objective-C, thanks to the fact that NSString and CFStringRef are toll-free bridged; cf. example above for usage;
  • it works just fine both on iOS and on Mac OS X;
  • by reusing sizeof(digest), I avoided the pitfall of exposing the real value of CC_MD5_DIGEST_LENGTH, which would make it more difficult to adapt to other cryptographic algorithms.

How about SHA1, SHA256, and others?

It’s really simple to adapt this function to other algorithms. Say you want to adapt it to get the SHA1 hash instead. Here’s what you need to do:

  • replace CC_MD5_CTX with CC_SHA1_CTX;
  • replace CC_MD5_Init with CC_SHA1_Init;
  • replace CC_MD5_Update with CC_SHA1_Update;
  • replace CC_MD5_Final with CC_SHA1_Final;
  • replace CC_MD5_DIGEST_LENGTH with CC_SHA1_DIGEST_LENGTH;

Or more simply, just do a find and replace to transform every occurrence of the string “MD5” with “SHA1“. VoilĂ , you got it!

Another way to extend this to other algorithms is to make this function more modular, and basically take all of those things as arguments. This is a little more difficult, but I did it for my project TagAdA. With this more advanced and more modular solution, you have a third argument that represents the algorithm that you wish to use, and you only have one instance of the code associated to that logic in your binary, even if you use several of those cryptographic algorithms in your app. I even went to great lengths using the preprocessor to minimize the amount of duplicated code in my source file.

Anyway, there you go! I hope you will find this useful.

25 Comments
Pierre September 17th, 2010

Thanks a lot for the great library and your help getting it to work ;). It’s doing exactly what I needed and it’s lightning fast too.

For those like me struggling to make it work in their iOS projects, I created (with Joel’s blessing) a GitHub repo with the necessary files. You can find it at http://github.com/Fuitad/FileMD5Hash

Joel September 17th, 2010

My pleasure! I’m glad you found that useful Pierre!

Andrey September 18th, 2010

How to use it with iOS? I’m getting error:

Undefined symbols:
“FileMD5HashCreateWithPath(__CFString const*, unsigned long)”, referenced from:
-[MyAppDelegate createEditableCopyOfDatabaseIfNeeded] in MyAppDelegate.o
ld: symbol(s) not found
collect2: ld returned 1 exit status

Joel September 18th, 2010

@Andrey

Please make sure to add FileMD5Hash.c to the list of files that Xcode is supposed to compile for your target. One way to do that is to drag and drop FileMD5Hash.c to the “Compile Sources” build phase of your target.

Andrey September 18th, 2010

This didn’t work for me because I tried to use it in a .mm file. The solution is simple:

Just add this code to FileMD5Hash.h:

1
2
3
4
5
#if defined(__cplusplus)
    #define MYAPP_EXTERN extern "C"
#else
    #define MYAPP_EXTERN extern
#endif

and declare the function in FileMD5Hash.h like this:

1
2
MYAPP_EXTERN CFStringRef FileMD5HashCreateWithPath(CFStringRef filePath,
                                                   size_t chunkSizeForReadingData);

Thank you Joel for your MD5 code and for the solution on how to use it in .mm files!

Joel September 19th, 2010

No problem Andrey, thanks for reposting this trick in your comment!

Joel October 18th, 2010

I decided to fork Pierre’s GitHub repository, and to add a simple Xcode project that shows how to integrate this code with a simple iOS or Mac application. This should document in more detail things that I intentionally omitted in the blog post (to keep it simple, and more readable).

So, if you can’t figure out how to make this code work in your project, please, take a look at the FileMD5Hash GitHub repository.

And many thanks to Pierre for coming up with this great idea of a simple GitHub repository for this code!

Neil MacKenzie December 30th, 2010

Joel:

I am using your FileMD5Hash.c(compiled only) file as is in a product of ours. It is a unclear to me what I need to include in order to be in compliance with your license.

We have a ReadMe file for our product, is putting a notice in there sufficient?

I do not speak legalize very well. How should this notice read?

Regards
Neil

Joan Lluch March 19th, 2011

Joel. Why can’t you just use the original code and wrap the readDataOfLength and related code into an autorelease pool allocation and release pair? Wouldn’t that be much clearer with the same effect?

Joan

Joel March 20th, 2011

@Neil You don’t even need to mention me in your README. The only thing I care about is that you keep my copyright notice in the source files, and that if you change the files in any way, you mention that in a comment in the source file. So just enjoy!

Joel March 20th, 2011

@Joan Your idea would work too, that’s true. However, when you say “much clearer”, I just want to say that it has to do with how familiar you are with Foundation and CoreFoundation. Some people might prefer to use CoreFoundation.

I don’t mind using CoreFoundation for some things, and this implementation is actually a little more efficient than what you’re suggesting. Cf. the Cocoa Fundamentals Guide:

Because in iOS an application executes in a more memory-constrained environment, the use of autorelease pools is discouraged in methods or blocks of code (for example, loops) where an application creates many objects. Instead, you should explicitly release objects whenever possible.”

So I guess what I should tell you is this: if you feel more comfortable using Foundation level APIs and you don’t mind or can’t notice the slight performance hit, then you should definitely do it your way.

Joan Lluch March 20th, 2011

Joel. Actually I feel very comfortable with CoreFoundation. My background is raw ‘C’ and I even programed in assembler so you can imagine what kind of things I am used to. I even have a strong preference (we could call it obsession) in using core foundation collections instead of their cocoa equivalents, specifically I use use NULL retain/release callBacks all the time on CFArrays and CFDictionaries.

Even when using cocoa I avoid explicit autorelease calls in my code. If I have to return a new object I always implement ‘create’ methods. I only leave implicit autoreleases when the object will be immediately retained anyway so the memory overhead is zero.

So my post was not really about what I would do but about most developers could consider to do.

Said that I still believe that using cocoa tends to be easier, and more convenient for most developers. What the docs recommend about autorelease pools is precisely to avoid doing what the original code did, that is actually *using* autorelease pools. By creating and draining an insider autorelease pool as per my suggestion, what we achieve is to release the objects right there, so in fact avoiding the use of the global autorelease pool, which is what has really to be prevented.

At the end of the day we both are thinking alike and possibly using the same coding patterns, so that’s the important thing.

Joan

Federico Cappelli April 20th, 2011

Great work ;)

Alessandro Maestri May 5th, 2011

Hi Joel.

I use your trick with success. Great job.
But now I’ve a question for you: can I use your trick with a file on a remote site, then with a ‘filePath’ that is similar to ‘http://…’ ?

Thank’s,
Alex.

danny May 12th, 2011

thanks for the code, it’s very helpful!

Best Regards,

Daniel Oliva

danny May 12th, 2011

Hey Everyone!,

Here is the easiest way that I had everything up and running,
1.) Download the FileMD5Hash.c & FileMD5Hash.h from the linked github.
2.) Xcode -> New Project -> Foundation Tool (Command Utility) ->Drag both .c & .h Files into Source folder in Xcode
3.) Follow Andrey's advice in regards to the modification of FileMD5Hash.h;

4.) setting the correct filePath (example : NSString *filePath = "/Users/YourUserName/YourUserFile.pdf";

5.) Run !

*If a Exec_Bad_Access Error occurs it's probably because your trying to CFRelease(md5hash) when md5hash is nil; & md5hash would be nil because the CFStreamOpen Probably failed…

Okay, lastly thanks! and sorry for blowing up your forum with an error message !

Regards,

Daniel

vinnyt August 31st, 2011

on line 49 is there any reason you are declaring the array inside the for loop. I would hope the compiler is smart enough to allocate that array and keep it around. It just scares me a bit that the C compiler might be dumb enough to reallocate that at every iteration of the loop, and even though that is small chunk when you run that loop thousand times it is going to be troublesome.

jacksonadams December 13th, 2011

thank you very, very much! lifesaver

jacksonadams December 13th, 2011

THANK YOU!!! lifesaver

Matt March 15th, 2013

Hey Joel, this is fantastic! One issue — I can’t seem to compile your sample app for Mac 64-bit. Any plans to get that working? Thanks so much!

Paul de Lange October 3rd, 2013

I modified your TGDFileHash program to include crc32 checksum. You can find it here:

https://gist.github.com/paul-delange/6808278

Feel free to take it or use as you want

Koichi November 10th, 2013

Great !
You are absolutely genius !!

Stefan December 18th, 2013

I do have compile problems/errors when compiling with ARC… are you experiencing the same?

e.g. implicit declaration as well as needed casts…

Joel January 20th, 2014

Hi Stefan,

Thanks for reporting this problem. Indeed, FileMD5Hash wasn’t ready to be used with ARC. Can you try again with the new API I just pushed to the GitHub project? It should just work now.

Thanks!

Brendan November 6th, 2014

Thanks so much for this. Esp. Paul with your CRC32 checksum addition. This saved me at least a day :)

Leave a Reply