Pig-cookbook

Disponível somente no TrabalhosFeitos
  • Páginas : 8 (1914 palavras )
  • Download(s) : 0
  • Publicado : 7 de janeiro de 2013
Ler documento completo
Amostra do texto
Pig Cookbook
Table of contents
1 Overview............................................................................................................................2 2 Performance Enhancers......................................................................................................2

Copyright © 2007 The Apache Software Foundation. All rights reserved.

Pig Cookbook

1.Overview
This document provides hints and tips for pig users.

2. Performance Enhancers
2.1. Use Optimization
Pig supports various optimization rules which are turned on by default. Become familiar with these rules.

2.2. Use Types
If types are not specified in the load statement, Pig assumes the type of =double= for numeric computations. A lot of the time, your data would be much smaller, maybe,integer or long. Specifying the real type will help with speed of arithmetic computation. It has an additional advantage of early error detection.
--Query 1 A = load 'myfile' as (t, u, v); B = foreach A generate t + u; --Query 2 A = load 'myfile' as (t: int, u: int, v); B = foreach A generate t + u;

The second query will run more efficiently than the first. In some of our queries with see 2xspeedup.

2.3. Project Early and Often
Pig does not (yet) determine when a field is no longer needed and drop the field from the row. For example, say you have a query like:
A B C D E = = = = = load 'myfile' as (t, u, v); load 'myotherfile' as (x, y, z); join A by t, B by x; group C by u; foreach D generate group, COUNT($1);

There is no need for v, y, or z to participate in this query. Andthere is no need to carry both t and x past the join, just one will suffice. Changing the query above to the query below will greatly reduce the amount of data being carried through the map and reduce phases by pig.

Page 2

Copyright © 2007 The Apache Software Foundation. All rights reserved.

Pig Cookbook

A = load 'myfile' as (t, u, v); A1 = foreach A generate t, u; B = load'myotherfile' as (x, y, z); B1 = foreach B generate x; C = join A1 by t, B1 by x; C1 = foreach C generate t, u; D = group C1 by u; E = foreach D generate group, COUNT($1);

Depending on your data, this can produce significant time savings. In queries similar to the example shown here we have seen total time drop by 50%.

2.4. Filter Early and Often
As with early projection, in most cases it isbeneficial to apply filters as early as possible to reduce the amount of data flowing through the pipeline.
-- Query 1 A = load 'myfile' as (t, u, v); B = load 'myotherfile' as (x, y, z); C = filter A by t == 1; D = join C by t, B by x; E = group D by u; F = foreach E generate group, COUNT($1); -- Query 2 A = load 'myfile' as (t, u, v); B = load 'myotherfile' as (x, y, z); C = join A by t, B by x; D =group C by u; E = foreach D generate group, COUNT($1); F = filter E by C.t == 1;

The first query is clearly more efficient than the second one because it reduces the amount of data going into the join. One case where pushing filters up might not be a good idea is if the cost of applying filter is very high and only a small amount of data is filtered out.

2.5. Reduce Your Operator Pipeline
Forclarity of your script, you might choose to split your projects into several steps for instance:
A = load 'data' as (in: map[]); -- get key out of the map B = foreach A generate in#k1 as k1, in#k2 as k2; -- concatenate the keys C = foreach B generate CONCAT(k1, k2); .......

Page 3

Copyright © 2007 The Apache Software Foundation. All rights reserved.

Pig Cookbook

While the exampleabove is easier to read, you might want to consider combining the two foreach statements to improve your query performance:
A = load 'data' as (in: map[]); -- concatenate the keys from the map B = foreach A generate CONCAT(in#k1, in#k2); ....

The same goes for filters.

2.6. Make Your UDFs Algebraic
Queries that can take advantage of the combiner generally ran much faster (sometimes several...
tracking img