1. Slow for larger queries
2. Errors need to be better
3. Support is less
4. Source and Sink need to be present
5. Especially the errors that Pig produces due to UDFS(Python) are not helpful at all. When something goes wrong, it just gives exec error in udf even if problem is related to syntax or type error, let alone a logical one. This is a big one.
You have UDFs which you want to parallellize and utilize for large amounts of data, then you are in luck. Use Pig as a base pipeline where it does the hard work and you just apply your UDF in the step that you want.
Lazy evaluation: unless you do not produce an output file or does not output any message, it does not get evaluated. This has an advantage in the logical plan, it could optimize the program beginning to end and optimizer could produce an efficient plan to execute.
Enjoys everything that Hadoop offers, parallelization, fault-tolerancy with many relational database features.
If you want to do apply some statistics to your dataset. Functional programming paradigm fits quite naturally to pipeline processes, so I expect it to be quite successful.