commit c7d8b47db50e65a2fa09f036ccfb5a5f8a82a951 Author: Jenkins Date: Fri May 15 10:15:01 2015 -0700 Branch for CDH5.3.4 commit 2957cf6fb78a832c2e330771a68de7cac697a1cb Author: Jenkins Date: Mon Mar 30 17:38:22 2015 -0700 Updating Maven version to 5.3.4-SNAPSHOT commit fb719d34b930b2092c2449e10becba11b0f747a5 Author: Ryan Blue Date: Thu Mar 12 11:07:20 2015 -0700 CLOUDERA-BUILD. Fix parquet-scala parent version. commit c54647e43f3dcb0899364e8b7cc067c1aab6d16a Author: Jenkins slave Date: Wed Feb 11 13:04:02 2015 -0800 Preparing for CDH5.3.3 development commit 046693240581618403a02e5d7614b93576ec91c5 Author: Cheng Lian Date: Tue Feb 3 12:53:37 2015 -0800 PARQUET-173: Fixes `StatisticsFilter` for `And` filter predicate [Review on Reviewable](https://reviewable.io/reviews/apache/incubator-parquet-mr/108) Author: Cheng Lian Closes #108 from liancheng/PARQUET-173 and squashes the following commits: d188f0b [Cheng Lian] Fixes test case be2c8a1 [Cheng Lian] Fixes `StatisticsFilter` for `And` filter predicate commit 7854baa4b342c951427e8a705c3d0738f116efed Author: Jim Carroll Date: Thu Jan 29 17:32:54 2015 -0800 PARQUET-157: Divide by zero fix There is a divide by zero error in logging code inside the InternalParquetRecordReader. I've been running with this fixed for a while but everytime I revert I hit the problem again. I can't believe anyone else hasn't had this problem. I submitted a Jira ticket a few weeks ago but didn't hear anything on the list so here's the fix. This also avoids compiling log statements in some cases where it's unnecessary inside the checkRead method of InternalParquetRecordReader. Also added a .gitignore entry to clean up a build artifact. Author: Jim Carroll Closes #102 from jimfcarroll/divide-by-zero-fix and squashes the following commits: 423200c [Jim Carroll] Filter out parquet-scrooge build artifact from git. 22337f3 [Jim Carroll] PARQUET-157: Fix a divide by zero error when Parquet runs quickly. Also avoid compiling log statements in some cases where it's unnecessary. commit 7e189d320e3c94d87d917264075ab35d23b1c9ad Author: Neville Li Date: Thu Jan 29 17:31:04 2015 -0800 PARQUET-142: add path filter in ParquetReader Currently parquet-tools command fails when input is a directory with _SUCCESS file from mapreduce. Filtering those out like ParquetFileReader does fixes the problem. ``` parquet-cat /tmp/parquet_write_test Could not read footer: java.lang.RuntimeException: file:/tmp/parquet_write_test/_SUCCESS is not a Parquet file (too small) $ tree /tmp/parquet_write_test /tmp/parquet_write_test ├── part-m-00000.parquet └── _SUCCESS ``` Author: Neville Li Closes #89 from nevillelyh/gh/path-filter and squashes the following commits: 7377a20 [Neville Li] PARQUET-142: add path filter in ParquetReader Conflicts: parquet-hadoop/src/main/java/parquet/hadoop/ParquetReader.java Resolution: Line changed in PARQUET-84 not backported was next to a changed line commit 58e4d78325bb8e32884fb8dd794f402dc63ada72 Author: Chris Albright Date: Thu Jan 29 17:29:06 2015 -0800 PARQUET-124: normalize path checking to prevent mismatch between URI and ... ...path Author: Chris Albright Closes #79 from chrisalbright/master and squashes the following commits: b1b0086 [Chris Albright] Merge remote-tracking branch 'upstream/master' 9669427 [Chris Albright] PARQUET-124: Adding test (Thanks Ryan Blue) that proves mergeFooters was failing 8e342ed [Chris Albright] PARQUET-124: normalize path checking to prevent mismatch between URI and path Conflicts: parquet-hadoop/src/test/java/parquet/hadoop/TestParquetFileWriter.java Resolution: Import conflict only commit ae29756f28988aabf3c3f83e385ceda49402a7bb Author: Yash Datta Date: Mon Jan 26 18:21:11 2015 -0800 PARQUET-136: NPE thrown in StatisticsFilter when all values in a string/binary column trunk are null In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column. Even if column has no values, it can be ignored. The other way is to fix this behaviour in the writer, but is that what we want ? Author: Yash Datta Author: Alex Levenson Author: Yash Datta Closes #99 from saucam/npe and squashes the following commits: 5138e44 [Yash Datta] PARQUET-136: Remove unreachable block b17cd38 [Yash Datta] Revert "PARQUET-161: Trigger tests" 82209e6 [Yash Datta] PARQUET-161: Trigger tests aab2f81 [Yash Datta] PARQUET-161: Review comments for the test case 2217ee2 [Yash Datta] PARQUET-161: Add a test case for checking the correct statistics info is recorded in case of all nulls in a column c2f8d6f [Yash Datta] PARQUET-161: Fix the write path to write statistics object in case of only nulls in the column 97bb517 [Yash Datta] Revert "revert TestStatisticsFilter.java" a06f0d0 [Yash Datta] Merge pull request #1 from isnotinvain/alexlevenson/PARQUET-161-136 b1001eb [Alex Levenson] Fix statistics isEmpty, handle more edge cases in statistics filter 0c88be0 [Alex Levenson] revert TestStatisticsFilter.java 1ac9192 [Yash Datta] PARQUET-136: Its better to not filter chunks for which empty statistics object is returned. Empty statistics can be read in case of 1. pre-statistics files, 2. files written from current writer that has a bug, as it does not write the statistics if column has all nulls e5e924e [Yash Datta] Revert "PARQUET-136: In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column" 8cc5106 [Yash Datta] Revert "PARQUET-136: fix hasNulls to cater to the case where all values are nulls" c7c126f [Yash Datta] PARQUET-136: fix hasNulls to cater to the case where all values are nulls 974a22b [Yash Datta] PARQUET-136: In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column commit f5773277dd8de1a294fe96c7d93ff412dd550544 Author: Cheng Lian Date: Fri Jan 23 16:20:10 2015 -0800 PARQUET-168: Fixes parquet-tools command line option description [Review on Reviewable](https://reviewable.io/reviews/apache/incubator-parquet-mr/106) Author: Cheng Lian Closes #106 from liancheng/PARQUET-168 and squashes the following commits: 4524f2d [Cheng Lian] Fixes command line option description commit 682aaf7a409620d221ac2bfa1ba825e45af47b84 Author: Jenkins slave Date: Mon Jan 26 08:52:20 2015 -0800 Preparing for CDH5.3.2 development commit 9c200977af281e9baa74c19788bca0148f37a132 Author: Wolfgang Hoschek Date: Thu Dec 11 14:01:27 2014 -0800 PARQUET-145 InternalParquetRecordReader.close() should not throw an exception if initialization has failed PARQUET-145 InternalParquetRecordReader.close() should not throw an exception if initialization has failed Author: Wolfgang Hoschek Closes #93 from whoschek/PARQUET-145-3 and squashes the following commits: 52a6acb [Wolfgang Hoschek] PARQUET-145 InternalParquetRecordReader.close() should not throw an exception if initialization has failed commit a61b1994a9022b0141ae7bc1ab3ba6d1d03a752c Author: Josh Wills Date: Tue Dec 2 16:19:14 2014 +0000 PARQUET-140: Allow clients to control the GenericData instance used to read Avro records Author: Josh Wills Closes #90 from jwills/master and squashes the following commits: 044cf54 [Josh Wills] PARQUET-140: Allow clients to control the GenericData object that is used to read Avro records commit c95c368235ed78b7574faad1f3c66e40e217cec9 Author: Jenkins slave Date: Mon Dec 1 17:57:45 2014 -0800 Preparing for CDH5.3.1 development commit 47d25c01db9c076c54bf5e72560e4376dd77ab42 Author: Brock Noland Date: Thu Nov 20 09:19:25 2014 -0800 PARQUET-114: Sample NanoTime class serializes and deserializes Timestamp incorrectly I ran the Parquet Column tests and they passed. FYI @rdblue Author: Brock Noland Closes #71 from brockn/master and squashes the following commits: 69ba484 [Brock Noland] PARQUET-114 - Sample NanoTime class serializes and deserializes Timestamp incorrectly commit 2bd7475fbcc9038234dc701a8659a0d7a4ff809b Author: Ryan Blue Date: Tue Nov 18 20:20:04 2014 -0800 PARQUET-132: Add type parameter to AvroParquetInputFormat. Author: Ryan Blue Closes #84 from rdblue/PARQUET-132-parameterize-avro-inputformat and squashes the following commits: 63114b0 [Ryan Blue] PARQUET-132: Add type parameter to AvroParquetInputFormat. commit d02714ae823a70bc83562a51a178f9fadc6a01dc Author: elif dede Date: Mon Nov 17 16:53:08 2014 -0800 PARQUET-135: Input location is not getting set for the getStatistics in ParquetLoader when using two different loaders within a Pig script. Author: elif dede Closes #86 from elifdd/parquetLoader_error_PARQUET-135 and squashes the following commits: b0150ee [elif dede] fixed white space bdb381a [elif dede] PARQUET-135: Call setInput from getStatistics in ParquetLoader to fix ReduceEstimator errors in pig jobs Conflicts: parquet-hadoop/src/test/java/parquet/format/converter/TestParquetMetadataConverter.java Resolution: Upstream patch 251a495 has only a whitespace change for this file, the conflict was in an area not backported. No change to the file. commit 8556c86f6410dd0fcfbb63cd6af6c6dd095e2832 Author: Alex Levenson Date: Mon Sep 22 11:11:08 2014 -0700 PARQUET-94: Fix bug in ParquetScroogeScheme constructor, minor cleanup I noticed that ParquetScroogeScheme's constructor ignores the provided klass argument. I also added in missing type parameters for the Config object where they were missing. Author: Alex Levenson Closes #61 from isnotinvain/alexlevenson/parquet-scrooge-cleanup and squashes the following commits: 2b16007 [Alex Levenson] Fix bug in ParquetScroogeScheme constructor, minor cleanup commit c3b673614a909313fa928f5fdfe8f41376ba2862 Author: Tianshuo Deng Date: Wed Sep 10 10:37:51 2014 -0700 PARQUET-87: Add API for projection pushdown on the cascading scheme level JIRA: https://issues.apache.org/jira/browse/PARQUET-87 Previously, the projection pushdown configuration is global, and not bind to a specific tap. After adding this API, projection pushdown can be done more "naturally", which may benefit scalding. The code that uses this API would look like: ``` Scheme sourceScheme = new ParquetScroogeScheme(new Config().withProjection(projectionFilter)); Tap source = new Hfs(sourceScheme, PARQUET_PATH); ``` Author: Tianshuo Deng Closes #51 from tsdeng/projection_from_scheme and squashes the following commits: 2c72757 [Tianshuo Deng] make config class final 813dc1a [Tianshuo Deng] erge branch 'master' into projection_from_scheme b587b79 [Tianshuo Deng] make constructor of Config private, fix format 3aa7dd2 [Tianshuo Deng] remove builder 9348266 [Tianshuo Deng] use builder() 7c91869 [Tianshuo Deng] make fields of Config private, create builder method for Config 5fdc881 [Tianshuo Deng] builder for setting projection pushdown and predicate pushdown a47f271 [Tianshuo Deng] immutable 3d514b1 [Tianshuo Deng] done commit b25683579764033f8049e64f35e820e96391042b Author: julien Date: Tue Sep 9 15:45:20 2014 -0700 upgrade scalatest_version to depend on scala 2.10.4 Author: julien Closes #52 from julienledem/scalatest_version and squashes the following commits: 945fa75 [julien] upgrade scalatest_version to depend on scala 2.10.4 commit 5f5f2bd37425d250e7045b79cdaa26bc7138db76 Author: Tianshuo Deng Date: Mon Sep 8 14:12:11 2014 -0700 update scala 2.10 Try to upgrade to scala 2.10 Author: Tianshuo Deng Closes #35 from tsdeng/update_scala_2_10 and squashes the following commits: 1b7e55f [Tianshuo Deng] fix comment bed9de3 [Tianshuo Deng] remove twitter artifactory 2bce643 [Tianshuo Deng] publish fix 06b374e [Tianshuo Deng] define scala.binary.version fcf6965 [Tianshuo Deng] Merge branch 'master' into update_scala_2_10 e91d9f7 [Tianshuo Deng] update version 5d18b88 [Tianshuo Deng] version 83df898 [Tianshuo Deng] update scala 2.10 Conflicts: pom.xml Resolution: Newline addition caused a spurrious conflict and deconflicted CDH version changes with Scala version update. commit 0e0e78e8087b3f013fc25e6b00d85d5285801423 Author: Alex Levenson Date: Mon Aug 18 10:38:11 2014 -0700 PARQUET-73: Add support for FilterPredicates to cascading schemes Author: Alex Levenson Closes #34 from isnotinvain/alexlevenson/filter-cascading-scheme and squashes the following commits: cd69a8e [Alex Levenson] Add support for FilterPredicates to cascading schemes commit 0d39b37e97f55ef363ded6f433350212ba3131ec Author: Alex Levenson Date: Wed Jul 30 13:49:00 2014 -0700 Only call put() when needed in SchemaCompatibilityValidator#validateColumn() This is some minor cleanup suggested by @tsdeng Author: Alex Levenson Closes #24 from isnotinvain/alexlevenson/columnTypesEncountered and squashes the following commits: 7f05d90 [Alex Levenson] Only call put() when needed in SchemaCompatibilityValidator#validateColumn() commit aa97326614a648ed7b8e00cbd354151ce79c3b2b Author: Alex Levenson Date: Tue Jul 29 14:38:59 2014 -0700 Add a unified and optionally more constrained API for expressing filters on columns This is a re-opened version of: https://github.com/Parquet/parquet-mr/pull/412 The idea behind this pull request is to add a way to express filters on columns using DSL that allows parquet visibility into what is being filtered and how. This visibility will allow us to make optimizations at read time, the biggest one being filtering entire row groups or pages of records without even reading them based on the statistics / metadata that is stored along with each row group or page. Included in this api are interfaces for user defined predicates, which must operate at the value level by may opt in to operating at the row group / page level as well. This should make this new API a superset of the `parquet.filter` package. This new api will need to be reconciled with the column filters currently in the `parquet.filter` package, but I wanted to get feedback on this first. A limitation in both this api and the old one is that you can't do cross-column filters, eg: columX > columnY. Author: Alex Levenson Closes #4 from isnotinvain/alexlevenson/filter-api and squashes the following commits: c1ab7e3 [Alex Levenson] Address feedback c1bd610 [Alex Levenson] cleanup dotString in ColumnPath 418bfc1 [Alex Levenson] Update version, add temporary hacks for semantic enforcer 6643bd3 [Alex Levenson] Fix some more non backward incompatible changes 39f977f [Alex Levenson] Put a bunch of backwards compatible stuff back in, add @Deprecated 13a02c6 [Alex Levenson] Fix compile errors, add back in overloaded getRecordReader f82edb7 [Alex Levenson] Merge branch 'master' into alexlevenson/filter-api 9bd014f [Alex Levenson] clean up TODOs and reference jiras 4cc7e87 [Alex Levenson] Add some comments 30e3d61 [Alex Levenson] Create a common interface for both kinds of filters ac153a6 [Alex Levenson] Create a Statistics class for use in UDPs fbbf601 [Alex Levenson] refactor IncrementallyUpdatedFilterPredicateGenerator to only generate the parts that require generation 5df47cd [Alex Levenson] Static imports of checkNotNull c1d1823 [Alex Levenson] address some of the minor feedback items 67a3ba0 [Alex Levenson] update binary's toString 3d7372b [Alex Levenson] minor fixes fed9531 [Alex Levenson] Add skipCurrentRecord method to clear events in thrift converter 2e632d5 [Alex Levenson] Make Binary Serializable 09c024f [Alex Levenson] update comments 3169849 [Alex Levenson] fix compilation error 0185030 [Alex Levenson] Add integration test for value level filters 4fde18c [Alex Levenson] move to right package ae36b37 [Alex Levenson] Handle merge issues af69486 [Alex Levenson] Merge branch 'master' into alexlevenson/filter-api 0665271 [Alex Levenson] Add tests for value inspector c5e3b07 [Alex Levenson] Add tests for resetter and evaluator 29f677a [Alex Levenson] Fix scala DSL 8897a28 [Alex Levenson] Fix some tests b448bee [Alex Levenson] Fix mistake in MessageColumnIO c8133f8 [Alex Levenson] Fix some tests 4cf686d [Alex Levenson] more null checks 69e683b [Alex Levenson] check all the nulls 220a682 [Alex Levenson] more cleanup aad5af3 [Alex Levenson] rm generated src file from git 5075243 [Alex Levenson] more minor cleanup 9966713 [Alex Levenson] Hook generation into maven build 8282725 [Alex Levenson] minor cleanup fea3ea9 [Alex Levenson] minor cleanup 9e35406 [Alex Levenson] move statistics filter c52750c [Alex Levenson] finish moving things around 97a6bfd [Alex Levenson] Move things around pt2 843b9fe [Alex Levenson] Move some files around pt 1 5eedcc0 [Alex Levenson] turn off dictionary support for AtomicConverter 541319e [Alex Levenson] various cleanup and fixes 08e9638 [Alex Levenson] rm ColumnPathUtil bfe6795 [Alex Levenson] Add type bounds to FilterApi 6c831ab [Alex Levenson] don't double log exception in SerializationUtil a7a58d1 [Alex Levenson] use ColumnPath instead of String 8f11a6b [Alex Levenson] Move ColumnPath and Canonicalizer to parquet-common 9164359 [Alex Levenson] stash abc2be2 [Alex Levenson] Add null handling to record filters -- this impl is still broken though 90ba8f7 [Alex Levenson] Update Serialization Util 0a261f1 [Alex Levenson] Add compression in SerializationUtil f1278be [Alex Levenson] Add comment, fix tests cbd1a85 [Alex Levenson] Replace some specialization with generic views e496cbf [Alex Levenson] Fix short circuiting in StatisticsFilter db6b32d [Alex Levenson] Address some comments, fix constructor in ParquetReader fd6f44d [Alex Levenson] Fix semver backward compat 2fdd304 [Alex Levenson] Some more cleanup d34fb89 [Alex Levenson] Cleanup some TODOs 544499c [Alex Levenson] stash 7b32016 [Alex Levenson] Merge branch 'master' into alexlevenson/filter-api 0e31251 [Alex Levenson] First pass at values filter, needs reworking 470e409 [Alex Levenson] fix java6/7 bug, minor cleanup ee7b221 [Alex Levenson] more InputFormat tests 5ef849e [Alex Levenson] Add guards for not specifying both kinds of filter 0186b1f [Alex Levenson] Add logging to ParquetInputFormat and tests for configuration a622648 [Alex Levenson] cleanup imports 9b1ea88 [Alex Levenson] Add tests for statistics filter d517373 [Alex Levenson] tests for filter validator b25fc44 [Alex Levenson] small cleanup of filter validator 32067a1 [Alex Levenson] add test for collapse logical nots 1efc198 [Alex Levenson] Add tests for invert filter predicate 046b106 [Alex Levenson] some more fixes d3c4d7a [Alex Levenson] fix some more types, add in test for SerializationUtil cc51274 [Alex Levenson] fix generics in FilterPredicateInverter ea08349 [Alex Levenson] First pass at rowgroup filter, needs testing 156d91b [Alex Levenson] Add runtime type checker 4dfb4f2 [Alex Levenson] Add serialization util 8f80b20 [Alex Levenson] update comment 7c25121 [Alex Levenson] Add class to Column struct 58f1190 [Alex Levenson] Remove filterByUniqueValues 7f20de6 [Alex Levenson] rename user predicates af14b42 [Alex Levenson] Update dsl 04409c5 [Alex Levenson] Add generic types into Visitor ba42884 [Alex Levenson] rm getClassName 65f8af9 [Alex Levenson] Add in support for user defined predicates on columns 6926337 [Alex Levenson] Add explicit tokens for notEq, ltEq, gtEq 667ec9f [Alex Levenson] remove test for collapsing double negation db2f71a [Alex Levenson] rename FilterPredicatesTest a0a0533 [Alex Levenson] Address first round of comments b2bca94 [Alex Levenson] Add scala DSL and tests bedda87 [Alex Levenson] Add tests for FilterPredicate building 238cbbe [Alex Levenson] Add scala dsl 39f7b24 [Alex Levenson] add scala mvn boilerplate 2ec71a7 [Alex Levenson] Add predicate API Conflicts: parquet-column/src/main/java/parquet/io/api/Binary.java parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java Resolution: InternalParquetRecordReader: conflicts from not backporting PARQUET-2, which were minor. Binary: changed several anonymous classes to private static. Conflict appears to be an artifact of major changes. The important thing to verify is that these don't break binary compatibility. Version conflicts: parquet-avro/pom.xml parquet-cascading/pom.xml parquet-column/pom.xml parquet-common/pom.xml parquet-encoding/pom.xml parquet-generator/pom.xml parquet-hadoop-bundle/pom.xml parquet-hadoop/pom.xml parquet-hive-bundle/pom.xml parquet-hive/parquet-hive-binding/parquet-hive-0.10-binding/pom.xml parquet-hive/parquet-hive-binding/parquet-hive-0.12-binding/pom.xml parquet-hive/parquet-hive-binding/parquet-hive-binding-bundle/pom.xml parquet-hive/parquet-hive-binding/parquet-hive-binding-factory/pom.xml parquet-hive/parquet-hive-binding/parquet-hive-binding-interface/pom.xml parquet-hive/parquet-hive-binding/pom.xml parquet-hive/parquet-hive-storage-handler/pom.xml parquet-hive/pom.xml parquet-jackson/pom.xml parquet-pig-bundle/pom.xml parquet-pig/pom.xml parquet-protobuf/pom.xml parquet-scrooge/pom.xml parquet-test-hadoop2/pom.xml parquet-thrift/pom.xml parquet-tools/pom.xml pom.xml commit 836ff0336a0d58948d87baaa248c88d7a1e3fd4f Author: Matt Massie Date: Mon Nov 3 14:00:33 2014 +0000 PARQUET-123: Enable dictionary support in AvroIndexedRecordConverter If consumers are loading Parquet records into an immutable structure like an Apache Spark RDD, being able to configure string reuse in AvroIndexedRecordConverter can drastically reduce the overall memory footprint of strings. NOTE: This isn't meant to be a merge-able PR (yet). I want to use this PR as a way to discuss: (1) if this is a reasonable approach and (2) to learn if PrimitiveConverter needs to be thread-safe as I'm currently using a ConcurrentHashMap. If there's agreement that this would be worthwhile, I'll create a JIRA and write some unit tests. Author: Matt Massie Closes #76 from massie/immutable-strings and squashes the following commits: 88ce5bf [Matt Massie] PARQUET-123: Enable dictionary support in AvroIndexedRecordConverter commit e6209b6a458811829ac9e5cd4032ae1c58c0ce48 Author: Ryan Blue Date: Wed Oct 1 13:44:45 2014 -0700 PARQUET-64: Add new OriginalTypes in parquet-format 2.2.0. This implements the restrictions for those types documented in the parquet-format logical types spec. This requires a release of parquet-format 2.2.0 with the new types. I'll rebase and update the dependency when it is released. Author: Ryan Blue Closes #31 from rdblue/PARQUET-64-add-new-types and squashes the following commits: 10feab9 [Ryan Blue] PARQUET-64: Add new OriginalTypes in parquet-format 2.2.0. commit 61a3152abd16ccf93e4401cab6d93abfc0fa841a Author: Tianshuo Deng Date: Mon Sep 29 12:00:03 2014 -0700 PARQUET-104: Fix writing empty row group at the end of the file At then end of a parquet file, it may writes an empty rowgroup. This happens when: numberOfRecords mod sizeOfRowGroup = 0 Author: Tianshuo Deng Closes #66 from tsdeng/fix_empty_row_group and squashes the following commits: 10b93fb [Tianshuo Deng] rename e3a5896 [Tianshuo Deng] format 91fa0d4 [Tianshuo Deng] fix empty row group Conflicts: parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordWriter.java Resolution: Close had a conflict from the extra metadata addition in 792b149, PARQUET-67. This applied just the rename changes for the flush method and the file writer. commit cf4e189db12a2df5a325c7156a7e1b737425970e Author: Colin Marc Date: Thu Sep 25 16:45:56 2014 -0700 PARQUET-96: fill out some missing methods on parquet.example classes I'm slightly embarrassed to say that we use these, and we'd really like to stop needing a fork, so here we are. Author: Colin Marc Closes #59 from colinmarc/missing-group-methods and squashes the following commits: af8ea08 [Colin Marc] fill out some missing methods on parquet.example classes Conflicts: parquet-column/src/main/java/parquet/example/data/GroupValueSource.java parquet-column/src/main/java/parquet/example/data/simple/SimpleGroup.java Resolution: Method additions, not real conflicts. commit 3f68518845e0d907866d0b556d7463f457f43b14 Author: julien Date: Thu Sep 25 11:25:53 2014 -0700 PARQUET-90: integrate field ids in schema This integrates support for field is that was introduced in Parquet format. Thrift and Protobufs ids will now be saved in the Parquet schema. Author: julien Closes #56 from julienledem/field_ids and squashes the following commits: 62c2809 [julien] remove withOriginalType; use Typles builder more 8ff0034 [julien] review feedback 084c8be [julien] binary compat 85d785c [julien] add proto id in schema; fix schema parsing for ids d4be488 [julien] integrate field ids in schema Conflicts: parquet-column/src/main/java/parquet/schema/GroupType.java parquet-column/src/main/java/parquet/schema/MessageType.java parquet-column/src/main/java/parquet/schema/Type.java Resolution: The conflicting methods were added in 9ad5485, PARQUET-2, with type persuasion. Because nothing calls these methods, they are not needed. commit d863d582d0a0d74a204cdffedd73de96f62d8cdf Author: Tom White Date: Wed Oct 29 20:48:23 2014 +0000 Enforce CDH-wide version of Jackson. commit a42db5a596023a3ea2b6203fbe9f17f535687ec6 Author: Tom White Date: Mon Nov 3 14:37:17 2014 +0000 CLOUDERA-BUILD. Add javaVersion property and enforce it. commit ead451b95f7768377ddeff3e435d6d095d5afc89 Author: Tom White Date: Mon Nov 3 14:11:03 2014 +0000 PARQUET-121: Allow Parquet to build with Java 8 There are test failures running with Java 8 due to http://openjdk.java.net/jeps/180 which changed retrieval order for HashMap. Here's how I tested this: ```bash use-java8 mvn clean install -DskipTests -Dmaven.javadoc.skip=true mvn test mvn test -P hadoop-2 ``` I also compiled the main code with Java 7 (target=1.6 bytecode), and compiled the tests with Java 8, and ran them with Java 8. The idea here is to simulate users who want to run Parquet with JRE 8. ```bash use-java7 mvn clean install -DskipTests -Dmaven.javadoc.skip=true use-java8 find . -name test-classes | grep target/test-classes | grep -v 'parquet-scrooge' | xargs rm -rf mvn test -DtargetJavaVersion=1.8 -Dmaven.main.skip=true -Dscala.maven.test.skip=true ``` A couple of notes about this: * The targetJavaVersion property is used since other Hadoop projects use the same name. * I couldn’t get parquet-scrooge to compile with target=1.8, which is why I introduced scala.maven.test.skip (and updated scala-maven-plugin to the latest version which supports the property). Compiling with target=1.8 should be fixed in another JIRA as it looks pretty involved. Author: Tom White Closes #77 from tomwhite/PARQUET-121-java8 and squashes the following commits: 8717e13 [Tom White] Fix tests to run under Java 8. 35ea670 [Tom White] PARQUET-121. Allow Parquet to build with Java 8. commit 2d6301176789f35c05d052d2a6299dd4e86afc64 Author: Ryan Blue Date: Wed Oct 1 14:14:24 2014 -0700 PARQUET-107: Add option to disable summary metadata. This adds an option to the commitJob phase of the MR OutputCommitter, parquet.enable.summary-metadata (default true), that can be used to disable the summary metadata files generated from the footers of all of the files produced. This enables more control over when those summary files are produced and makes it possible to rename MR outputs and then generate the summaries. Author: Ryan Blue Closes #68 from rdblue/PARQUET-107-add-summary-metadata-option and squashes the following commits: 261e5e4 [Ryan Blue] PARQUET-107: Add option to disable summary metadata. commit 5ee6d69592bd77d5be33ffe5b419a93631a92775 Author: Ryan Blue Date: Tue Sep 30 17:00:50 2014 -0700 CLOUDERA-BUILD. Enable parquet-scrooge module. commit 6452d639c4a6b51480bfad20a124fa346afd3b3a Author: Jenkins slave Date: Fri Sep 26 09:26:30 2014 -0700 Preparing for CDH5.3.0 development commit a925b9b749be4413240fb260b115401cddb8746e Author: Ryan Blue Date: Tue Sep 23 12:14:17 2014 -0700 PARQUET-82: Check page size is valid when writing. Author: Ryan Blue Closes #48 from rdblue/PARQUET-82-check-page-size and squashes the following commits: 9f31402 [Ryan Blue] PARQUET-82: Check page size is valid when writing. commit 5d750fb03446f1c7e6bb20da3b6cc182794cb472 Author: Daniel Weeks Date: Mon Sep 22 11:21:20 2014 -0700 PARQUET-92: Pig parallel control The parallelism for reading footers was fixed at '5', which isn't optimal for using pig with S3. Just adding a property to adjust the parallelism. JIRA: https://issues.apache.org/jira/browse/PARQUET-92 Author: Daniel Weeks Closes #57 from dcw-netflix/pig-parallel-control and squashes the following commits: e49087c [Daniel Weeks] Update ParquetFileReader.java ec4f8ca [Daniel Weeks] Added configurable control of parallelism d37a6de [Daniel Weeks] Resetting pom to main 0c1572e [Daniel Weeks] Merge remote-tracking branch 'upstream/master' 98c6607 [Daniel Weeks] Merge remote-tracking branch 'upstream/master' 96ba602 [Daniel Weeks] Disabled projects that don't compile commit 374c4c482c39411e7cfeb04e14ba163e77db3d6f Author: Ryan Blue Date: Thu Sep 4 11:28:03 2014 -0700 PARQUET-63: Enable dictionary encoding for FIXED. This uses the existing dictionary support introduced for int96. Encoding and ParquetProperties have been updated to use the dictionary supporting classes, when requested for write or present during read. This also fixes a bug in the fixed dictionary values writer, where the length was hard-coded for int96, 12 bytes. Author: Ryan Blue Closes #30 from rdblue/PARQUET-63-add-fixed-dictionary-support and squashes the following commits: bc34a34 [Ryan Blue] PARQUET-63: Enable dictionary encoding for FIXED. commit 2a0b165e058c83323d370ca87151b7cefccb1621 Author: Tianshuo Deng Date: Wed Sep 3 15:37:00 2014 -0700 do ProtocolEvents fixing only when there is required fields missing in the requested schema https://issues.apache.org/jira/browse/PARQUET-61 This PR is trying to redo the https://github.com/apache/incubator-parquet-mr/pull/7 In this PR, it fixes the protocol event in a more precise condition: Only when the requested schema missing some required fields that are present in the full schema So even if there a projection, as long as the projection is not getting rid of the required field, the protocol events amender will not be called. Could you take a look at this ? @dvryaboy @yan-qi Author: Tianshuo Deng Closes #28 from tsdeng/fix_protocol_when_required_field_missing and squashes the following commits: ba778b9 [Tianshuo Deng] add continue for readability d5639df [Tianshuo Deng] fix unused import 090e894 [Tianshuo Deng] format 13a609d [Tianshuo Deng] comment format ef1fe58 [Tianshuo Deng] little refactor, remove the hasMissingRequiredFieldFromProjection method 7c2c158 [Tianshuo Deng] format 83a5655 [Tianshuo Deng] do ProtocolEvents fixing only when there is required fields missing in the requested schema commit 0e9f24b8e2ff096b6e26093f263c5e8c8c95948e Author: Daniel Weeks Date: Thu Aug 28 11:30:50 2014 -0700 PARQUET-75: Fixed string decode performance issue Switch to using 'UTF8.decode' as opposed to 'new String' https://issues.apache.org/jira/browse/PARQUET-75 Author: Daniel Weeks Closes #40 from dcw-netflix/string-decode and squashes the following commits: 2cf53e7 [Daniel Weeks] Fixed string decode performance issue Conflicts: parquet-column/src/main/java/parquet/io/api/Binary.java Resolution: conflict because anon classes are now static classes in master. just backported the fix, which is small. commit 2be528e2533ed2645cbd407f47071b4de3ce95b2 Author: julien Date: Thu Aug 28 10:35:19 2014 -0700 PARQUET-80: upgrade semver plugin version to 0.9.27 To include the fix in: https://github.com/jeluard/semantic-versioning/pull/39 Author: julien Closes #46 from julienledem/upgrade_semver_plugin and squashes the following commits: 30e7247 [julien] upgrade semver plugin version to 0.9.27 commit 2606d36b3e8e03170c5c7167885a7109cdfb61cb Author: Eric Snyder Date: Wed Aug 20 14:09:38 2014 -0700 PARQUET-66: Upcast blockSize to long to prevent integer overflow. Author: Eric Snyder Closes #33 from snyderep/master and squashes the following commits: c99802e [Eric Snyder] PARQUET-66: Upcast blockSize to long to prevent integer overflow. commit fe8228d2bf4a7f6638cc8cbfe8282d94f643c984 Author: Ryan Blue Date: Wed Aug 20 14:02:01 2014 -0700 PARQUET-62: Fix binary dictionary write bug. The binary dictionary writers keep track of written values in memory to deduplicate and write dictionary pages periodically. If the written values are changed by the caller, then this corrupts the dictionary without an error message. This adds a defensive copy to fix the problem. Author: Ryan Blue Closes #29 from rdblue/PARQUET-62-fix-dictionary-bug and squashes the following commits: 42b6920 [Ryan Blue] PARQUET-62: Fix binary dictionary write bug. commit 2ff0ca66310e2b7a53796f81d39c2ca5a21ce7b8 Author: Daniel Weeks Date: Wed Aug 20 13:52:42 2014 -0700 Parquet-70: Fixed storing pig schema to udfcontext for non projection case and moved... ... column index access setting to udfcontext so as not to affect other loaders. I found an problem that affects both the Column name access and column index access due to the way the pig schema is stored by the loader. ##Column Name Access: The ParquetLoader was only storing the pig schema in the UDFContext when push projection is applied. In the full load case, the schema was not stored which triggered a full reload of the schema during task execution. You can see in initSchema references the UDFContext for the schema, but that is only set in push projection. However, the schema needs to be set in both the job context (so the TupleReadSupport can access the schema) and the UDFContext (so the task side loader can access it), which is why it is set in both locations. This also meant the requested schema was never set to the task side either, which could cause other problems as well. ##Column Index Access: For index based access, the problem was that the column index access setting and the requested schema were not stored in the udfcontext and sent to the task side (unless pushProjection was called). The schema was stored in the job context, but this would be overwritten if another loader was executed first. Also, the property to use column index access was only being set at the job context level, so subsequent loaders would use column index access even if they didn't request it. This fix now ensures that both the schema and column index access are set in the udfcontext and loaded in the initSchema method. JIRA: https://issues.apache.org/jira/browse/PARQUET-70 -Dan Author: Daniel Weeks Closes #36 from dcw-netflix/pig-schema-context and squashes the following commits: f896a25 [Daniel Weeks] Moved property loading into setInput 8f3dc28 [Daniel Weeks] Changed to set job conf settings in both front and backend d758de0 [Daniel Weeks] Updated to use isFrontend() for setting context properties b7ef96a [Daniel Weeks] Fixed storing pig schema to udfcontext for non projection case and moved column index access setting to udfcontext so as not to affect other loaders. commit e800d419700a67a344d3b9c347fc6a9e0ede6e3d Author: Cheng Lian Date: Fri Aug 1 16:38:03 2014 -0700 PARQUET-13: The `-d` option for `parquet-schema` shouldn't have optional argument Author: Cheng Lian Closes #11 from liancheng/fix-cli-arg and squashes the following commits: 85a5453 [Cheng Lian] Reverted the dummy change 47ce817 [Cheng Lian] Dummy change to trigger Travis 1c0a244 [Cheng Lian] The `-d` option for `parquet-schema` shouldn't have optional argument commit 7a3609693e9a016c9c622021f9f6ef6baa59210e Author: Daniel Weeks Date: Mon Jul 28 18:07:07 2014 -0700 Column index access support This patch adds the ability to use column index based access to parquet files in pig, which allows for rename capability similar to other file formats. This is achieved by using the parametrized loader with an alternate schema. Example: p = LOAD '/data/parquet/' USING parquet.pig.ParquetLoader('n1:int, n2:float, n3:chararray', 'true'); In this example, the names from the requested schema will be translated to the column positions from the file and will produce tuples based on the index position. Two test cases are included that exercise index based access for both full file reads and column projected reads. Note: This patch also disables the enforcer plugin on the pig project per discussion at the parquet meetup. The justification for this is that the enforcer is too strict for internal classes and results in dead code because duplicating methods is required to add parameters where there is only one usage of the constructor/method. The interface for the pig loader is imposed by LoadFunc and StoreFunc by the pig project and the implementations internals should not be used directly. Author: Daniel Weeks Closes #12 from dcw-netflix/column-index-access and squashes the following commits: 1b5c5cf [Daniel Weeks] Refactored based on rewview comments 12b53c1 [Daniel Weeks] Fixed some formatting and the missing filter method sig e5553f1 [Daniel Weeks] Adding back default constructor to satisfy other project requirements 69d21e0 [Daniel Weeks] Merge branch 'master' into column-index-access f725c6f [Daniel Weeks] Removed enforcer for pig support d182dc6 [Daniel Weeks] Introduces column index access 1c3c0c7 [Daniel Weeks] Fixed test with strict checking off f3cb495 [Daniel Weeks] Added type persuasion for primitive types with a flag to control strict type checking for conflicting schemas, which is strict by default. Conflicts: parquet-pig/src/test/java/parquet/pig/TestParquetLoader.java Resolution: removed parts of 9ad5485 (not backported) in the tests. commit ec8f54af732ebc2c3439a260d9e7205b8234d0cf Author: Sandy Ryza Date: Wed Jul 23 14:29:35 2014 +0100 PARQUET-25. Pushdown predicates only work with hardcoded arguments. Pull request for Sandy Ryza's fix for PARQUET-25. Author: Sandy Ryza Closes #22 from tomwhite/PARQUET-25-unbound-record-filter-configurable and squashes the following commits: a9d3fdc [Sandy Ryza] PARQUET-25. Pushdown predicates only work with hardcoded arguments. commit a7c05be4e0d5b0cae4b583a57bee7ac663278ebc Author: Ryan Blue Date: Fri Jul 18 16:19:25 2014 -0700 PARQUET-18: Fix all-null value pages with dict encoding. TestDictionary#testZeroValues demonstrates the problem, where a page of all null values is decoded using the DicitonaryValuesReader. Because there are no non-null values, the page values section is 0 byte, but the DictionaryValuesReader assumes there is at least one encoded value and attempts to read a bit width. The test passes a byte array to initFromPage with the offset equal to the array's length. The fix is to detect that there are no input bytes to read. To avoid adding validity checks to the read path, this sets the internal decoder to one that will throw an exception if any reads are attempted. Author: Ryan Blue Closes #18 from rdblue/PARQUET-18-fix-nulls-with-dictionary and squashes the following commits: 0711766 [Ryan Blue] PARQUET-18: Fix all-null value pages with dict encoding. commit ba5bc9d9851acd4f325f8d1988f24debcafef823 Author: Matthieu Martin Date: Fri Jul 18 16:02:09 2014 -0700 PARQUET-4: Use LRU caching for footers in ParquetInputFormat. Reopening https://github.com/Parquet/parquet-mr/pull/403 against the new Apache repository. Author: Matthieu Martin Closes #2 from matt-martin/master and squashes the following commits: 99bb5a3 [Matthieu Martin] Minor javadoc and whitespace changes. Also added the FileStatusWrapper class to ParquetInputFormat to make sure that the debugging log statements print out meaningful paths. 250a398 [Matthieu Martin] Be less aggressive about checking whether the underlying file has been appended to/overwritten/deleted in order to minimize the number of namenode interactions. d946445 [Matthieu Martin] Add javadocs to parquet.hadoop.LruCache. Rename cache "entries" as cache "values" to avoid confusion with java.util.Map.Entry (which contains key value pairs whereas our old "entries" really only refer to the values). a363622 [Matthieu Martin] Use LRU caching for footers in ParquetInputFormat. commit be4fdbfcd7008f363f53594f1a50611105681e07 Author: Tom White Date: Wed Jul 16 14:50:29 2014 +0100 PARQUET-9: Filtering records across multiple blocks Update of the minimal fix discussed in https://github.com/apache/incubator-parquet-mr/pull/1, with the recursive call changed to to a loop. Author: Tom White Author: Steven Willis Closes #9 from tomwhite/filtering-records-across-multiple-blocks and squashes the following commits: afb08a4 [Tom White] Minimal fix 9e723ee [Steven Willis] Test for filtering records across multiple blocks commit 02642a739fed6b4771309545fca3075d0d2acb88 Author: Maxwell Swadling Date: Mon May 12 10:55:19 2014 +1000 Fixed hadoop WriteSupportClass loading commit d5f5f226378a773a8020e9821afa66c0b2641db0 Author: Ryan Blue Date: Fri Sep 12 10:54:11 2014 -0700 CLOUDERA-BUILD. Add protoc.executable property. commit b13f38116fc26aac625625cecec4dd0bef9ba24d Author: Ryan Blue Date: Mon Sep 8 15:43:12 2014 -0700 CLOUDERA-BUILD. Add mr1 profile for tests. commit 12346b3cfb934f74e213f3adde7e2028815452e3 Author: Ryan Blue Date: Thu Aug 28 17:56:07 2014 -0700 CLOUDERA-BUILD. Add back ctors removed since 1.2.5. Jdiff reports that from 1.2.5-cdh5.0.0 to 1.5.0-cdh5.2.0, the API has had 4 removals: * constructor ParquetThriftBytesOutputFormat(TProtocolFactory, Class>, boolean) * constructor ParquetWriter(Path, WriteSupport, CompressionCodecName, int, int,int, boolean, boolean, Configuration) constructor * constructor ThriftBytesWriteSupport (TProtocolFactory, Class>, boolean) * constructor ThriftToParquetFileWriter (Path, TaskAttemptContext, TProtocolFactory, Class>, boolean) This commits adds these constructors back to ensure compatibility. commit b4c75a0790d747da529e55c7c5f5bc2aa1d6176f Author: Ryan Blue Date: Fri Aug 1 14:36:12 2014 -0700 CLOUDERA-BUILD. Add jdiff to POM. commit 5cad32c4ff0a60f22462cbc597416ad74c35ca8f Author: Ryan Blue Date: Mon Jul 28 18:16:57 2014 -0700 CLOUDERA-BUILD. Update to CDH avro version. commit 2be8acb0cf57a628ab5a9c3f0a068f697de3578b Author: Ryan Blue Date: Mon Jul 28 18:14:36 2014 -0700 CLOUDERA-BUILD. Update to CDH protobuf version. commit 532b752e96ed2251add1ca366363582902c80667 Author: Ryan Blue Date: Mon Aug 4 19:04:18 2014 -0700 PARQUET-59: Fix parquet-scrooge test on hadoop-2. Author: Ryan Blue Closes #27 from rdblue/PARQUET-59-fix-scrooge-test-on-hadoop-2 and squashes the following commits: ac34369 [Ryan Blue] PARQUET-59: Fix parquet-scrooge test on hadoop-2. commit 0c532c156ad7ac609c8c0998ae139d6a63a14339 Author: Ryan Blue Date: Sun Jul 27 15:54:30 2014 -0700 CLOUDERA-BUILD. CDH-16396: Comment out parquet-hive* from parquet pom. commit 45b6975cfc6485bdb6046bd639b393bd63b713db Author: Ryan Blue Date: Sun Jul 27 15:02:07 2014 -0700 CLOUDERA-BUILD. Update to CDH5 thrift version. commit b665f08cfc77469b337a492676c7cbff43d13383 Author: Ryan Blue Date: Sun Jul 27 15:01:06 2014 -0700 CLOUDERA-BUILD. Update to parquet-format 2.1.0-cdh5. commit 9f9aee2153f2b73d1e9f283c7264d2af63380e76 Author: Ryan Blue Date: Sun Jul 27 14:59:51 2014 -0700 CLOUDERA-BUILD. Disable semantic versioning checks. commit 2cafa0ae0911459bee0ba296238f9817ab733ab6 Author: Ryan Blue Date: Sat Jul 26 16:37:22 2014 -0700 CLOUDERA-BUILD. Update Pig to CDH dependency. TestSummary needed to be modified because null is no longer allowed in a Bag. Three nulls were removed and the validation method updated to reflect the new structure of the test data. commit fa46bde2e360b35ea6a1aa7dbec3478ded7f7b04 Author: Ryan Blue Date: Sat Jul 26 16:23:15 2014 -0700 CLOUDERA-BUILD. Update to CDH Hadoop version. commit 8078d97ad3e18d8678c5d2b1394bf730c9739342 Author: Ryan Blue Date: Mon Jul 21 15:36:16 2014 -0700 CLOUDERA-BUILD. Update root POM for CDH packaging.