Skip to content

Pig Over Hive

by admin on January 7th, 2012

Hive is one of my favorite tools to crunch data using Hadoop. Its SQL like interface makes it really easy to get started. It supports many many input file formats. It is pretty robust and has excellent features like dynamic partitions, filter pushdown etc.

Pig is another tool you can use to run analytics on Hadoop. I never quite knew why one would ever use Pig over Hive. I recently got a chance to explore Pig and I used this opportunity to find reasons which favor Pig over Hive. I did manage to put together a small list:

  1. Pig allows committing data at arbitrary points in script. Unless and until you call store or dump, it will not process data. Hive processes data and produces output for every query executed. The output is either stored or streamed to stdout.
  2. Pig supports unstructured/semi-structured data. Hive always requires imposing a schema on the input.
  3. For a series of steps which form an ETL process Pig’s procedural syntax looks cleaner than Hive’s declarative syntax. It becomes complex to express ETL processes as either series of Hive queries or one huge composite query.

From → Hive

No comments yet

Leave a Reply

Note: XHTML is allowed. Your email address will never be published.

Subscribe to this comment feed via RSS