Saturday, April 4, 2015

Partitioning an enumearble into fixed size chunks

Recently at work, i was doing some db querying for items having a list of ids in the hand. Since I know that MSSQL Server allows only a limited number of parameters, i had to split my ids into fixed chunks, and execute several queries. The simple task of splitting an IEnumearble to fixed size chunks turned out to be not so straightforward, so i decided it is worth writing a short blog post about it. My initial solution was along the lines of:
     public static IEnumerable<IEnumerable<T>> Partition<T>(this IEnumerable<T> source, int size)  
     {  
       // check arguments ...  
       var result = new List<T>(size);  
       foreach (var item in source)  
       {  
         result.Add(item);  
         if (result.Count == size)  
         {  
           yield return result;  
           result = new List<T>(size);  
         }  
       }  
       if (result.Count > 0)  
         yield return result;  
     }  
Now this seems quite straightforward, but I didn't like it too much, because it eagerly fills a result list before returning it. I was sure there is a lazier solution, so googled the thing, and got pretty surprised, that on SO most of the answers were either as bad as my initial solution, or even much worse.

My disappointment quickly turned into the decision, that I will do an implementation that satisfies what I would expect of this method, and it's result. I'll let you follow me through with the assumptions and the way I think is the best to grab this problem.
I want to simply write a unit test for each assumption I am making (I use NUnit), and then provide a simple solution that passes all the tests, and then invite you to make it more elegant/compact/readable/professional, or whatever else you think could be done to the code. So lets start:

1) I always like to start with my corner cases, as they are the easiest to implement, and once you implemented a functionality, it is boring to add them. Somehow before I don't mind that much. First assumption: for null enumerable I want an appropriate exception.
     [Test]  
     public void Partition_ForNullEnumerable_ThrowsArgumentNullException()  
     {  
       IEnumerable<long> source = null;  
       var ex = Assert.Throws<ArgumentNullException>(() => source.Partition(5).ToList(), "null source is not allowed");  
       Assert.AreEqual("source", ex.ParamName, "Error should reflect input field");  
     }  
Note that I need to call tolist, otherwise my method is never actually called.

2) For partition size of less than 1 I want to have an appropriate error:
     [Test]  
     public void Partition_ForPartitionSizeBelow1_ThrowsArgumentException()  
     {  
       var source = new List<long>();  
       var ex = Assert.Throws<ArgumentException>(() => source.Partition(0).ToList(), "partition cannot be below 1");  
       StringAssert.Contains("0", ex.Message, "Message should reflect error");  
       StringAssert.Contains("size", ex.Message, "Message should reflect error");  
     }  
so far so good.

3) So now that we have the boring stuff behind us, let's write a series of tests that reflect how I expect my method to behave. Since most of my tests will test lazyness, I will make an enumerable, where I can easily check which items have been "produced". I will use a simple context class:
     class TestContext  
     {  
       /// <summary>  
       /// the collection of all our letters  
       /// </summary>  
       public const string SourceLetters = "0123456789ABCDEF";  
       public IEnumerable<char> Source { get; private set; }  
       public string ProducedItems { get { return builder.ToString(); } }  
       public TestContext()  
       {  
         builder = new StringBuilder();  
         Source = SourceLetters  
           .ToCharArray()  
           .Select(x =>  
           {  
             builder.Append(x);  
             return x;  
           });  
       }  
       private StringBuilder builder;  
     }  
     private TestContext GetContext()  
     {  
       return new TestContext();  
     }  
This way i can easily assert on the ProducedItems.

4) Next assumption: just calling my method should not produce anything.
     [Test]  
     public void Partition_CallingMethod_DoesNotEnumerateSource()  
     {  
       var ctx = GetContext();  
       var result = ctx.Source.Partition(5);  
       Assert.AreEqual(string.Empty, ctx.ProducedItems, "No items should have been produced before enumeration");  
     }  
Since I think all of the solutions using yield return would pass this, lets quickly move on.

5) Check for correctness: Partitioning to the whole set will give me the correct items.
     [Test]  
     public void Partition_ToSizeOfSet_GivesCorrectResult()  
     {  
       var ctx = GetContext();  
       var result = ctx.Source.Partition(TestContext.SourceLetters.Length);  
       Assert.AreEqual(1, result.Count(), "we should get only one partition");  
       CollectionAssert.AreEqual(TestContext.SourceLetters.ToCharArray(), result.First().ToArray(), "The first partition should have all the items");  
     }  

6) what about when size is even bigger?
     [Test]  
     public void Partition_ToSizeBiggerThanSet_GivesCorrectResult()  
     {  
       var ctx = GetContext();  
       var result = ctx.Source.Partition(TestContext.SourceLetters.Length + 5);  
       Assert.AreEqual(1, result.Count(), "we should get only one partition");  
       CollectionAssert.AreEqual(TestContext.SourceLetters, result.First(), "The first partition should have all the items");  
     }  

7) ok, now partition to equal parts, that divide the total count
     [Test]  
     public void Partition_To4Parts_GivesCorrectResult()  
     {  
       var ctx = GetContext();  
       var result = ctx.Source.Partition(4);  
       Assert.AreEqual(4, result.Count(), "splitting 16 letters into sizes of 4 should result in 4 partitions");  
       CollectionAssert.AreEqual(TestContext.SourceLetters.Substring(0, 4).ToCharArray(), result.First().ToArray(), "The first partition has invalid result");  
       CollectionAssert.AreEqual(TestContext.SourceLetters.Substring(4, 4).ToCharArray(), result.Skip(1).First().ToArray(), "The second partition has invalid result");  
       CollectionAssert.AreEqual(TestContext.SourceLetters.Substring(8, 4).ToCharArray(), result.Skip(2).First().ToArray(), "The third partition has invalid result");  
       CollectionAssert.AreEqual(TestContext.SourceLetters.Substring(12, 4).ToCharArray(), result.Skip(3).First().ToArray(), "The fourth partition has invalid result");  
     }  

8) sweet, and the last: partition to parts that don't divide exactly
     [Test]  
     public void Partition_ToPartsOf5_GivesCorrectResult()  
     {  
       var ctx = GetContext();  
       var result = ctx.Source.Partition(5);  
       Assert.AreEqual(4, result.Count(), "splitting 16 letters into sizes of 5 should result in 4 partitions");  
       CollectionAssert.AreEqual(TestContext.SourceLetters.Substring(0, 5), result.First(), "The first partition has invalid result");  
       CollectionAssert.AreEqual(TestContext.SourceLetters.Substring(5, 5), result.Skip(1).First(), "The second partition has invalid result");  
       CollectionAssert.AreEqual(TestContext.SourceLetters.Substring(10, 5), result.Skip(2).First(), "The third partition has invalid result");  
       CollectionAssert.AreEqual(TestContext.SourceLetters.Substring(15, 1), result.Skip(3).First(), "The fourth partition has invalid result");  
     }  
So far so good. If a solution doesn't pass all these tests, then it is definately wrong.

9) Now I want to make sure, that I get correct results even if i enumerate my results in a different order than expected.
     [Test]  
     public void Partition_EnumeartingInReverseOrder_GivesCorrectResult()  
     {  
       var ctx = GetContext();  
       using (var enumerator = ctx.Source.Partition(8).GetEnumerator())  
       {  
         enumerator.MoveNext(); // go to first element  
         var result1 = enumerator.Current;  
         enumerator.MoveNext(); // go to second element  
         var result2 = enumerator.Current;  
         CollectionAssert.AreEqual(TestContext.SourceLetters.Substring(8, 8).ToCharArray(), result2.ToArray(), "The second partition has invalid result");  
         CollectionAssert.AreEqual(TestContext.SourceLetters.Substring(0, 8).ToCharArray(), result1.ToArray(), "The first partition has invalid result");  
       }  
     }  
The funny thing is, I have seen accepted SO answers, that would not pass this test.

Now it is time to start testing the lazyness of our solution. This is where my initial solution starts to fail, and where I will have to start writing a new solution.

10) So if i get the first partition, I want to produce only the first item from the original enumerable. I need to produce it, otherwise the system cannot know, if there is at least one partition, or none...
     [Test]  
     public void Partition_GettingFirstPartition_OnlyProduces1Item()  
     {  
       var ctx = GetContext();  
       using (var enumerator = ctx.Source.Partition(8).GetEnumerator())  
       {  
         enumerator.MoveNext(); // go to first element  
         var result1 = enumerator.Current;  
         Assert.AreEqual("0", ctx.ProducedItems, "Only the first item should have been produced");  
       }  
     }  
Here we will already fail. Our initial solution will produce all 8 items in the partition, before returning it. And this is exactly my concern. If the "production" of these items is expensive, and I am looking only for the first item that fulfills my conditions, I do not want to produce all items in the partition. Makes no sense.

11) Now I want to test if I start enumerating the first partition, getting the first item will not reproduce that item, and will not produce any other ones.
     [Test]  
     public void Partition_GettingFirstElementOfFirstPartition_OnlyProduces1Item()  
     {  
       var ctx = GetContext();  
       using (var enumerator = ctx.Source.Partition(8).GetEnumerator())  
       {  
         enumerator.MoveNext(); // go to first element  
         var result1 = enumerator.Current;  
         using(var enumerator1 = result1.GetEnumerator()){  
           enumerator1.MoveNext(); // go to first element  
           var result1_1 = enumerator1.Current;  
         }  
         Assert.AreEqual("0", ctx.ProducedItems, "Only the first item should have been produced");  
       }  
     }  

12) I want to make sure, that when I enumerate the first partition, I am producing the items in a lazy manner. So that getting the second Item from the first partition creates only 2 items:
     [Test]  
     public void Partition_GettingFirst2ElementsOfFirstPartition_OnlyProduces2Items()  
     {  
       var ctx = GetContext();  
       using (var enumerator = ctx.Source.Partition(8).GetEnumerator())  
       {  
         enumerator.MoveNext(); // go to first element  
         var result1 = enumerator.Current;  
         using(var enumerator1 = result1.GetEnumerator()){  
           enumerator1.MoveNext(); // go to first element  
           enumerator1.MoveNext(); // go to second element  
           var result1_1 = enumerator1.Current;  
         }  
         Assert.AreEqual("01", ctx.ProducedItems, "Only the first 2 items should have been produced");  
       }  
     }  

13) and the same for the third
     [Test]  
     public void Partition_GettingFirst3ElementsOfFirstPartition_OnlyProduces3Items()  
     {  
       var ctx = GetContext();  
       using (var enumerator = ctx.Source.Partition(8).GetEnumerator())  
       {  
         enumerator.MoveNext(); // go to first element  
         var result1 = enumerator.Current;  
         using(var enumerator1 = result1.GetEnumerator()){  
           enumerator1.MoveNext(); // go to first element  
           enumerator1.MoveNext(); // go to second element  
           enumerator1.MoveNext(); // go to third element  
           var result1_1 = enumerator1.Current;  
         }  
         Assert.AreEqual("012", ctx.ProducedItems, "Only the first 3 items should have been produced");  
       }  
     }  

14) now I want to make sure, that enumerating through the first partition will not produce any items from the second partition, as it is not necessary
     [Test]  
     public void Partition_EnumeratingAllFirstPartition_OnlyProducesItemsFromThere()  
     {  
       var ctx = GetContext();  
       using (var enumerator = ctx.Source.Partition(3).GetEnumerator())  
       {  
         enumerator.MoveNext(); // go to first element  
         var result1 = enumerator.Current.ToList();  
         Assert.AreEqual("012", ctx.ProducedItems, "Only the first 3 items should have been produced, as that is our partition size");  
       }  
     }  

15) so far so good. I understand that to get to the second partition, all items of the first partition have to be produced, since an enumerable only allows goind forward by one, but I want to make sure, that the second partition is not enumerated to the end as the first one
     [Test]  
     public void Partition_GettingSecondPartition_OnlyProducesItemsToTheFirstItemOfPartition2()  
     {  
       var ctx = GetContext();  
       using (var enumerator = ctx.Source.Partition(3).GetEnumerator())  
       {  
         enumerator.MoveNext(); // go to first partition  
         enumerator.MoveNext(); // go to second partition  
         var result2 = enumerator.Current;  
         Assert.AreEqual("0123", ctx.ProducedItems, "Only the first 4 items should have been produced, as the second partition starts at item 4");  
       }  
     }  

16) now i want to have the same lazy production for partition 2 as for partition 1
     [Test]  
     public void Partition_EnumerationgSecondPartition_ProducesItemsLazily()  
     {  
       var ctx = GetContext();  
       using (var enumerator = ctx.Source.Partition(4).GetEnumerator())  
       {  
         enumerator.MoveNext(); // go to first partition  
         enumerator.MoveNext(); // go to second partition  
         using (var enumerator2 = enumerator.Current.GetEnumerator())  
         {  
           enumerator2.MoveNext(); // go to first item  
           enumerator2.MoveNext(); // go to second item  
           enumerator2.MoveNext(); // go to third item  
           var result2_3 = enumerator.Current;  
         }  
         Assert.AreEqual("0123456", ctx.ProducedItems, "Only the first 7 items should have been produced");  
       }  
     }  

17) and now I want to make sure, that even if I enumerate first the second partition, and then the first, the items of the first partition are not reproduced (meaning my original enumeratble is not restarted). I think in most of the cases, that is the expected behavior, although I could also imagine, that when the items are cheap to produce, but require a lot of memory, then maybe the expected behavior would be to reproduce the items. But in our usecase here, we assume that producing is more expensive than having the items around.
     [Test]  
     public void Partition_EnumerationgFirstPartitionAfterSecond_DoesNotReproduceItem()  
     {  
       var ctx = GetContext();  
       using (var enumerator = ctx.Source.Partition(4).GetEnumerator())  
       {  
         enumerator.MoveNext(); // go to first partition  
         var result1 = enumerator.Current;  
         enumerator.MoveNext(); // go to second partition  
         var result2 = enumerator.Current.ToList(); // enumerate 2nd partition  
         var list1 = result1.ToList(); // enumerate first partition  
         Assert.AreEqual("01234567", ctx.ProducedItems, "The first partition items should not be reproduced");  
       }  
     }  

18) now we are alredy quite far in our requirements. I would actually also like to reenumerate any partition without enumerating my original source, so no reproduction of items.
     [Test]  
     public void Partition_MultiplePartitionEnumeartion_DoesNotReproduceItem()  
     {  
       var ctx = GetContext();  
       using (var enumerator = ctx.Source.Partition(4).GetEnumerator())  
       {  
         enumerator.MoveNext(); // go to first partition  
         var result1 = enumerator.Current;  
         var list1 = result1.ToList(); // enumerate first partition  
         var list2 = result1.ToList(); // enumerate first partition again  
         Assert.AreEqual("0123", ctx.ProducedItems, "The first partition items should not be reproduced");  
       }  
     }  

I think this would conclude my assumptions about the partitioning. What about the solution? I would invite you to take these tests, and come up with a method that passes all of them.
If you are done with it, you can come back, and compare it to my solution. If you find yours compacter, easier to understand, or simply more elegant, I would be very interested to see your code.

Mine looks like:
   public static class Extensions  
   {  
     /// <summary>  
     /// cached enumeration with push possibilities  
     /// </summary>  
     private class CachedEnumeration<T> : IEnumerable<T>  
     {  
       /// <summary>  
       /// enumerator for the cachedEnumeration class  
       /// </summary>  
       class CachedEnumerator : IEnumerator<T>  
       {  
         private readonly CachedEnumeration<T> m_source;  
         private int m_index;  
         public CachedEnumerator(CachedEnumeration<T> source)  
         {  
           m_source = source;  
           // start at index -1, since an enumerator needs to start with MoveNext before calling current  
           m_index = -1;  
         }  
         public T Current  
         {  
           get { return m_source.m_items[m_index]; }  
         }  
         public void Dispose()  
         {  
         }  
         object System.Collections.IEnumerator.Current  
         {  
           get { return Current; }  
         }  
         public bool MoveNext()  
         {  
           // if we have cached items, just increase our index  
           if (m_source.m_items.Count > m_index + 1)  
           {  
             m_index++;  
             return true;  
           }  
           else  
           {  
             var result = m_source.FetchOne();  
             if (result)  
               m_index++;  
             return result;  
           }  
         }  
         public void Reset()  
         {  
           m_index = -1;  
         }  
       }  
       /// <summary>  
       /// list containing all the items  
       /// </summary>  
       private readonly List<T> m_items;  
       /// <summary>  
       /// callback how to fetch an item  
       /// </summary>  
       private readonly Func<Tuple<bool, T>> m_fetchMethod;  
       private readonly int m_targetSize;  
       public CachedEnumeration(int size, T firstItem, Func<Tuple<bool, T>> fetchMethod)  
       {  
         m_items = new List<T>(size);  
         m_items.Add(firstItem);  
         m_fetchMethod = fetchMethod;  
         m_targetSize = size;  
       }  
       public IEnumerator<T> GetEnumerator()  
       {  
         return new CachedEnumerator(this);  
       }  
       System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()  
       {  
         return GetEnumerator();  
       }  
       private bool FetchOne()  
       {  
         if (IsFull)  
           return false;  
         var result = m_fetchMethod();  
         if (result.Item1)  
           m_items.Add(result.Item2);  
         return result.Item1;  
       }  
       /// <summary>  
       /// fetches all items to the cached enumerable  
       /// </summary>  
       public void FetchAll()  
       {  
         while (FetchOne()) { }  
       }  
       /// <summary>  
       /// tells weather the enumeration is already full  
       /// </summary>  
       public bool IsFull { get { return m_targetSize == m_items.Count; } }  
     }  
     /// <summary>  
     /// partitions the <paramref name="source"/> to parts of size <paramref name="size"/>  
     /// </summary>  
     public static IEnumerable<IEnumerable<T>> Partition<T>(this IEnumerable<T> source, int size)  
     {  
       if (source == null)  
         throw new ArgumentNullException("source");  
       if (size < 1)  
         throw new ArgumentException(string.Format("The specified size ({0}) is invalid, it needs to be at least 1.", size), "size");  
       var enumerator = source.GetEnumerator();  
       while (enumerator.MoveNext())  
       {  
         var lastResult = new CachedEnumeration<T>(size, enumerator.Current, () => Tuple.Create(enumerator.MoveNext(), enumerator.Current));  
         yield return lastResult;  
         lastResult.FetchAll();  
       }  
     }  
   }  
What do you think?
I dont like too much, that the enumerator is not in a using statement, but if it would be in there, then the call to .First() on the result of this method would destroy the enumerator in the using, and then the first set would not produce any more items. Maybe somehow it is possible to wrap the enumerator in another Disposable class, that would remember somehow that this enumerator is in use somewhere else, and then would allow disposing of the enumerator, if it is not needed anymore at all, and not just let it hang waiting for the GC to find it....
I'll explore it in another blog entry :)

Well, that was it for today. You can download the source code here.
And as always, I'm open for feedback!!!!

Thanks for reading.