Multi-threaded applications have traditional been complex to build.  However, ASP.NET 4.0 framework has greatly simplified the process of creating threaded applications.  Given that most machines today are using multiple processor cores, it is especially important to start writing applications that take full advantage of the processor power.  The example below uses the “System.Threading.Tasks.Parallel” class to report on file usage across a file system.  This was written for a client who wanted to report on several pieces of information related to file shares allocated to company departments:

  1. Number of files and kilobytes used per file owner (who is using the same)
  2. Number of files per file extension and kilobytes (useful to determine how many audio files, video files, Office documents, etc…)
  3. Age of files grouped into several buckets (0 to 6 months, 6 to 12 months, 12 to 24 months, and over 24 months)

The first method, “CrawlDirectories”, accepts the root directory you want to start crawling, any directories you want to exclude, and if you want to retrieve file owner information for each file.

private void CrawlDirectories(string RootDirectory, bool getFileOwners)
{
    // get a directory object
    try
    {
        DirectoryInfo di = new DirectoryInfo(RootDirectory);

        // loop through directories and look at files
        try
        {
            DirectoryInfo[] myDi = di.GetDirectories();

            Parallel.ForEach(myDi, folder =>
            { 
                    //look at files
                    AddFiles(folder.GetFiles(), getFileOwners);

                    //see if subdirectories to crawl
                    CrawlDirectories(folder.FullName, excludeDirectory, getFileOwners); 
            });
        }

        catch (Exception ex)
        {
            myLog("Error getting subdirectories: " + di.FullName + "; " + ex.Message, true);
        }
    }
    catch (Exception ex)
    {
        myLog("Error getting directory Info: " + RootDirectory + "; error: " + ex.Message, true);
    }
}

Notice the For loop using the new “Parallel” class above (in red).  This defines a parallel loop.  The application will spin up multiple threads to crawl through the directories.  Within this loop, we are calling the “Addfiles” method which loop through the files within the current directory, and then it calls itself to recursively crawl any subdirectories within the current directory.  Monitoring my machine resources, I can see on average 70 threads spin up to crawl the directories.  All of these is handled automatically by the framework!

The next method, “AddFiles”, will use a parallel loop to report on the files within the directories.  When you are using arrays or other types of collections in a multi-threaded application, you must use thread-safe collections!  I have chosen to use “System.Collections.Concurrent.ConcurrentDictionary” to store the file data I’m collecting.  This is basically a key, value pair that is thread-safe and can be accessed by multiple threads concurrently.

private void AddFiles(FileInfo[] filesInDir, bool getFileOwners)
{
    string ownerName = "";
    string fileExtension = "";
    Parallel.ForEach(filesInDir, fi =>
    {
        try
        {
            //calculate # of files and bytes per file owner                  
            if (getFileOwners)
            {
                IdentityReference NTAccountName = fi.GetAccessControl(AccessControlSections.Owner).GetOwner(Type.GetType("System.Security.Principal.NTAccount"));
                ownerName = NTAccountName.Value.ToUpper();
                ownerBytesDC.AddOrUpdate(ownerName, fi.Length, (key, old) => old + fi.Length);
                ownerCountDC.AddOrUpdate(ownerName, 1, (key, old) => old + 1);
            }

            //calculate file extension # of files and bytes
            fileExtension = fi.Extension.Replace(".", "").Replace(",", "").ToUpper();
            extensionBytesDC.AddOrUpdate(fileExtension, fi.Length, (key, old) => old + fi.Length);
            extensionCountDC.AddOrUpdate(fileExtension, 1, (key, old) => old + 1);

            //calculate ages of files
            if (fi.LastAccessTime >= DateTime.Now.AddMonths(-6))
            {
                System.Threading.Interlocked.Add(ref TotalByte0to6mo, fi.Length);
                System.Threading.Interlocked.Add(ref NumberOfFiles0to6mo, 1);
            }
            else if (fi.LastAccessTime < DateTime.Now.AddMonths(-6) && fi.LastAccessTime >= DateTime.Now.AddMonths(-12))
            {
                System.Threading.Interlocked.Add(ref TotalByte6to12mo, fi.Length);
                System.Threading.Interlocked.Add(ref NumberOfFiles6to12mo, 1);
            }
            else if (fi.LastAccessTime < DateTime.Now.AddMonths(-12) && fi.LastAccessTime >= DateTime.Now.AddMonths(-24))
            {
                System.Threading.Interlocked.Add(ref TotalByte12to24mo, fi.Length);
                System.Threading.Interlocked.Add(ref NumberOfFilee12to24mo, 1);

            }
            else if (fi.LastAccessTime < DateTime.Now.AddMonths(-24))
            {
                System.Threading.Interlocked.Add(ref TotalByteOver24mo, fi.Length);
                System.Threading.Interlocked.Add(ref NumberOfFilesOver24mo, 1);
            }
        }
        catch (Exception ex)
        {
            myLog("Cannot report on file: " + fi.FullName + "; error: " + ex.Message, true);
        }
    });
}

We end up with several key, value pair collections:

  1. ownerBytesDC and ownerCountDC.  These collections both have a key of the user name, and the value of the kilobytes total for that user and the count of files. 
  2. extensionBytesDC and extensionCountDC.  Key is the file extension, with values of total bytes and file count.
  3. The remaining data is stored in Int64 variables for the age of files in each age bucket.

ASP.NET 4.0 framework provides a great way to update a collection in thread-safe manner.  The collection has a method “AddOrUpdate()” that will add the key if it doesn’t exist in collection, or update it if it’s already there.

You may be wondering why we have the “getFileOwners” boolean parameter on the method.  Getting the file owner of a file requires a call to the “GetAccessControl” method – which is part of the file security methods within Windows.  This operation is expensive!  It slows down the crawl by several-fold.  Thus, if you don’t require this information, you can turn it off.

This application can crawl through about two million files and three terabytes of data per hour (tested on a two-proc VMware machine).  More robust hardware could speed this up.

Easy multi-threaded applications – finally!