spark-instrumented-optimizer/docs/sql-programming-guide.md

3104 lines
104 KiB
Markdown
Raw Normal View History

SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
---
layout: global
displayTitle: Spark SQL, DataFrames and Datasets Guide
title: Spark SQL and DataFrames
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
---
* This will become a table of contents (this text will be scraped).
{:toc}
# Overview
Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided
by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally,
Spark SQL uses this extra information to perform extra optimizations. There are several ways to
interact with Spark SQL including SQL and the Dataset API. When computing a result
the same execution engine is used, independent of which API/language you are using to express the
computation. This unification means that developers can easily switch back and forth between
different APIs based on which provides the most natural way to express a given transformation.
All of the examples on this page use sample data included in the Spark distribution and can be run in
the `spark-shell`, `pyspark` shell, or `sparkR` shell.
SPARK-1374: PySpark API for SparkSQL An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries. ``` from pyspark.context import SQLContext sqlCtx = SQLContext(sc) rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}]) srdd = sqlCtx.applySchema(rdd) sqlCtx.registerRDDAsTable(srdd, "table1") srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1") srdd2.collect() ``` The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]``` Author: Ahir Reddy <ahirreddy@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #363 from ahirreddy/pysql and squashes the following commits: 0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns 307d6e0 [Ahir Reddy] Style fix 6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies 3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py 29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD f2312c7 [Ahir Reddy] Moved everything into sql.py a19afe4 [Ahir Reddy] Doc fixes 6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL 521ff6d [Ahir Reddy] Trying to get spark to build with hive ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins ded03e7 [Ahir Reddy] Added doc test for HiveContext 22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency e4da06c [Ahir Reddy] Display message if hive is not built into spark 227a0be [Michael Armbrust] Update API links. Fix Hive example. 58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes. 4285340 [Michael Armbrust] Fix building of Hive API Docs. 38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs. 337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build 40491c9 [Ahir Reddy] PR Changes + Method Visibility 1836944 [Michael Armbrust] Fix comments. e00980f [Michael Armbrust] First draft of python sql programming guide. b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test f98a422 [Ahir Reddy] HiveContexts 79621cf [Ahir Reddy] cleaning up cruft b406ba0 [Ahir Reddy] doctest formatting 20936a5 [Ahir Reddy] Added tests and documentation e4d21b4 [Ahir Reddy] Added pyrolite dependency 79f739d [Ahir Reddy] added more tests 7515ba0 [Ahir Reddy] added more tests :) d26ec5e [Ahir Reddy] added test e9f5b8d [Ahir Reddy] adding tests 906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python 251f99d [Ahir Reddy] for now only allow dictionaries as input 09b9980 [Ahir Reddy] made jrdd explicitly lazy c608947 [Ahir Reddy] SchemaRDD now has all RDD operations 725c91e [Ahir Reddy] awesome row objects 55d1c76 [Ahir Reddy] return row objects 4fe1319 [Ahir Reddy] output dictionaries correctly be079de [Ahir Reddy] returning dictionaries works cd5f79f [Ahir Reddy] Switched to using Scala SQLContext e948bd9 [Ahir Reddy] yippie 4886052 [Ahir Reddy] even better c0fb1c6 [Ahir Reddy] more working 043ca85 [Ahir Reddy] working 5496f9f [Ahir Reddy] doesn't crash b8b904b [Ahir Reddy] Added schema rdd class 67ba875 [Ahir Reddy] java to python, and python to java bcc0f23 [Ahir Reddy] Java to python ab6025d [Ahir Reddy] compiling
2014-04-15 03:07:55 -04:00
## SQL
SPARK-1374: PySpark API for SparkSQL An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries. ``` from pyspark.context import SQLContext sqlCtx = SQLContext(sc) rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}]) srdd = sqlCtx.applySchema(rdd) sqlCtx.registerRDDAsTable(srdd, "table1") srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1") srdd2.collect() ``` The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]``` Author: Ahir Reddy <ahirreddy@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #363 from ahirreddy/pysql and squashes the following commits: 0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns 307d6e0 [Ahir Reddy] Style fix 6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies 3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py 29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD f2312c7 [Ahir Reddy] Moved everything into sql.py a19afe4 [Ahir Reddy] Doc fixes 6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL 521ff6d [Ahir Reddy] Trying to get spark to build with hive ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins ded03e7 [Ahir Reddy] Added doc test for HiveContext 22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency e4da06c [Ahir Reddy] Display message if hive is not built into spark 227a0be [Michael Armbrust] Update API links. Fix Hive example. 58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes. 4285340 [Michael Armbrust] Fix building of Hive API Docs. 38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs. 337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build 40491c9 [Ahir Reddy] PR Changes + Method Visibility 1836944 [Michael Armbrust] Fix comments. e00980f [Michael Armbrust] First draft of python sql programming guide. b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test f98a422 [Ahir Reddy] HiveContexts 79621cf [Ahir Reddy] cleaning up cruft b406ba0 [Ahir Reddy] doctest formatting 20936a5 [Ahir Reddy] Added tests and documentation e4d21b4 [Ahir Reddy] Added pyrolite dependency 79f739d [Ahir Reddy] added more tests 7515ba0 [Ahir Reddy] added more tests :) d26ec5e [Ahir Reddy] added test e9f5b8d [Ahir Reddy] adding tests 906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python 251f99d [Ahir Reddy] for now only allow dictionaries as input 09b9980 [Ahir Reddy] made jrdd explicitly lazy c608947 [Ahir Reddy] SchemaRDD now has all RDD operations 725c91e [Ahir Reddy] awesome row objects 55d1c76 [Ahir Reddy] return row objects 4fe1319 [Ahir Reddy] output dictionaries correctly be079de [Ahir Reddy] returning dictionaries works cd5f79f [Ahir Reddy] Switched to using Scala SQLContext e948bd9 [Ahir Reddy] yippie 4886052 [Ahir Reddy] even better c0fb1c6 [Ahir Reddy] more working 043ca85 [Ahir Reddy] working 5496f9f [Ahir Reddy] doesn't crash b8b904b [Ahir Reddy] Added schema rdd class 67ba875 [Ahir Reddy] java to python, and python to java bcc0f23 [Ahir Reddy] Java to python ab6025d [Ahir Reddy] compiling
2014-04-15 03:07:55 -04:00
One use of Spark SQL is to execute SQL queries.
Spark SQL can also be used to read data from an existing Hive installation. For more on how to
configure this feature, please refer to the [Hive Tables](#hive-tables) section. When running
SQL from within another programming language the results will be returned as a [DataFrame](#datasets-and-dataframes).
You can also interact with the SQL interface using the [command-line](#running-the-spark-sql-cli)
or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server).
SPARK-1374: PySpark API for SparkSQL An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries. ``` from pyspark.context import SQLContext sqlCtx = SQLContext(sc) rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}]) srdd = sqlCtx.applySchema(rdd) sqlCtx.registerRDDAsTable(srdd, "table1") srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1") srdd2.collect() ``` The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]``` Author: Ahir Reddy <ahirreddy@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #363 from ahirreddy/pysql and squashes the following commits: 0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns 307d6e0 [Ahir Reddy] Style fix 6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies 3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py 29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD f2312c7 [Ahir Reddy] Moved everything into sql.py a19afe4 [Ahir Reddy] Doc fixes 6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL 521ff6d [Ahir Reddy] Trying to get spark to build with hive ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins ded03e7 [Ahir Reddy] Added doc test for HiveContext 22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency e4da06c [Ahir Reddy] Display message if hive is not built into spark 227a0be [Michael Armbrust] Update API links. Fix Hive example. 58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes. 4285340 [Michael Armbrust] Fix building of Hive API Docs. 38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs. 337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build 40491c9 [Ahir Reddy] PR Changes + Method Visibility 1836944 [Michael Armbrust] Fix comments. e00980f [Michael Armbrust] First draft of python sql programming guide. b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test f98a422 [Ahir Reddy] HiveContexts 79621cf [Ahir Reddy] cleaning up cruft b406ba0 [Ahir Reddy] doctest formatting 20936a5 [Ahir Reddy] Added tests and documentation e4d21b4 [Ahir Reddy] Added pyrolite dependency 79f739d [Ahir Reddy] added more tests 7515ba0 [Ahir Reddy] added more tests :) d26ec5e [Ahir Reddy] added test e9f5b8d [Ahir Reddy] adding tests 906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python 251f99d [Ahir Reddy] for now only allow dictionaries as input 09b9980 [Ahir Reddy] made jrdd explicitly lazy c608947 [Ahir Reddy] SchemaRDD now has all RDD operations 725c91e [Ahir Reddy] awesome row objects 55d1c76 [Ahir Reddy] return row objects 4fe1319 [Ahir Reddy] output dictionaries correctly be079de [Ahir Reddy] returning dictionaries works cd5f79f [Ahir Reddy] Switched to using Scala SQLContext e948bd9 [Ahir Reddy] yippie 4886052 [Ahir Reddy] even better c0fb1c6 [Ahir Reddy] more working 043ca85 [Ahir Reddy] working 5496f9f [Ahir Reddy] doesn't crash b8b904b [Ahir Reddy] Added schema rdd class 67ba875 [Ahir Reddy] java to python, and python to java bcc0f23 [Ahir Reddy] Java to python ab6025d [Ahir Reddy] compiling
2014-04-15 03:07:55 -04:00
## Datasets and DataFrames
A Dataset is a new interface added in Spark 1.6 that tries to provide the benefits of RDDs (strong
typing, ability to use powerful lambda functions) with the benefits of Spark SQL's optimized
execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then
manipulated using functional transformations (`map`, `flatMap`, `filter`, etc.).
[SPARK-1566] consolidate programming guide, and general doc updates This is a fairly large PR to clean up and update the docs for 1.0. The major changes are: * A unified programming guide for all languages replaces language-specific ones and shows language-specific info in tabs * New programming guide sections on key-value pairs, unit testing, input formats beyond text, migrating from 0.9, and passing functions to Spark * Spark-submit guide moved to a separate page and expanded slightly * Various cleanups of the menu system, security docs, and others * Updated look of title bar to differentiate the docs from previous Spark versions You can find the updated docs at http://people.apache.org/~matei/1.0-docs/_site/ and in particular http://people.apache.org/~matei/1.0-docs/_site/programming-guide.html. Author: Matei Zaharia <matei@databricks.com> Closes #896 from mateiz/1.0-docs and squashes the following commits: 03e6853 [Matei Zaharia] Some tweaks to configuration and YARN docs 0779508 [Matei Zaharia] tweak ef671d4 [Matei Zaharia] Keep frames in JavaDoc links, and other small tweaks 1bf4112 [Matei Zaharia] Review comments 4414f88 [Matei Zaharia] tweaks d04e979 [Matei Zaharia] Fix some old links to Java guide a34ed33 [Matei Zaharia] tweak 541bb3b [Matei Zaharia] miscellaneous changes fcefdec [Matei Zaharia] Moved submitting apps to separate doc 61d72b4 [Matei Zaharia] stuff 181f217 [Matei Zaharia] migration guide, remove old language guides e11a0da [Matei Zaharia] Add more API functions 6a030a9 [Matei Zaharia] tweaks 8db0ae3 [Matei Zaharia] Added key-value pairs section 318d2c9 [Matei Zaharia] tweaks 1c81477 [Matei Zaharia] New section on basics and function syntax e38f559 [Matei Zaharia] Actually added programming guide to Git a33d6fe [Matei Zaharia] First pass at updating programming guide to support all languages, plus other tweaks throughout 3b6a876 [Matei Zaharia] More CSS tweaks 01ec8bf [Matei Zaharia] More CSS tweaks e6d252e [Matei Zaharia] Change color of doc title bar to differentiate from 0.9.0
2014-05-30 03:34:33 -04:00
The Dataset API is the successor of the DataFrame API, which was introduced in Spark 1.3. In Spark
2.0, Datasets and DataFrames are unified, and DataFrames are now equivalent to Datasets of `Row`s.
In fact, `DataFrame` is simply a type alias of `Dataset[Row]` in [the Scala API][scala-datasets].
However, [Java API][java-datasets] users must use `Dataset<Row>` instead.
[scala-datasets]: api/scala/index.html#org.apache.spark.sql.Dataset
[java-datasets]: api/java/index.html?org/apache/spark/sql/Dataset.html
Python does not have support for the Dataset API, but due to its dynamic nature many of the
benefits are already available (i.e. you can access the field of a row by name naturally
`row.columnName`). The case for R is similar.
Throughout this document, we will often refer to Scala/Java Datasets of `Row`s as DataFrames.
# Getting Started
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
## Starting Point: SparkSession
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
<div class="codetabs">
<div data-lang="scala" markdown="1">
The entry point into all functionality in Spark is the [`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.build()`:
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
{% highlight scala %}
import org.apache.spark.sql.SparkSession
val spark = SparkSession.build()
.master("local")
.appName("Word Count")
.config("spark.some.config.option", "some-value")
.getOrCreate()
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
// this is used to implicitly convert an RDD to a DataFrame.
import spark.implicits._
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
{% endhighlight %}
</div>
Clean up and simplify Spark configuration Over time as we've added more deployment modes, this have gotten a bit unwieldy with user-facing configuration options in Spark. Going forward we'll advise all users to run `spark-submit` to launch applications. This is a WIP patch but it makes the following improvements: 1. Improved `spark-env.sh.template` which was missing a lot of things users now set in that file. 2. Removes the shipping of SPARK_CLASSPATH, SPARK_JAVA_OPTS, and SPARK_LIBRARY_PATH to the executors on the cluster. This was an ugly hack. Instead it introduces config variables spark.executor.extraJavaOpts, spark.executor.extraLibraryPath, and spark.executor.extraClassPath. 3. Adds ability to set these same variables for the driver using `spark-submit`. 4. Allows you to load system properties from a `spark-defaults.conf` file when running `spark-submit`. This will allow setting both SparkConf options and other system properties utilized by `spark-submit`. 5. Made `SPARK_LOCAL_IP` an environment variable rather than a SparkConf property. This is more consistent with it being set on each node. Author: Patrick Wendell <pwendell@gmail.com> Closes #299 from pwendell/config-cleanup and squashes the following commits: 127f301 [Patrick Wendell] Improvements to testing a006464 [Patrick Wendell] Moving properties file template. b4b496c [Patrick Wendell] spark-defaults.properties -> spark-defaults.conf 0086939 [Patrick Wendell] Minor style fixes af09e3e [Patrick Wendell] Mention config file in docs and clean-up docs b16e6a2 [Patrick Wendell] Cleanup of spark-submit script and Scala quick start guide af0adf7 [Patrick Wendell] Automatically add user jar a56b125 [Patrick Wendell] Responses to Tom's review d50c388 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into config-cleanup a762901 [Patrick Wendell] Fixing test failures ffa00fe [Patrick Wendell] Review feedback fda0301 [Patrick Wendell] Note 308f1f6 [Patrick Wendell] Properly escape quotes and other clean-up for YARN e83cd8f [Patrick Wendell] Changes to allow re-use of test applications be42f35 [Patrick Wendell] Handle case where SPARK_HOME is not set c2a2909 [Patrick Wendell] Test compile fixes 4ee6f9d [Patrick Wendell] Making YARN doc changes consistent afc9ed8 [Patrick Wendell] Cleaning up line limits and two compile errors. b08893b [Patrick Wendell] Additional improvements. ace4ead [Patrick Wendell] Responses to review feedback. b72d183 [Patrick Wendell] Review feedback for spark env file 46555c1 [Patrick Wendell] Review feedback and import clean-ups 437aed1 [Patrick Wendell] Small fix 761ebcd [Patrick Wendell] Library path and classpath for drivers 7cc70e4 [Patrick Wendell] Clean up terminology inside of spark-env script 5b0ba8e [Patrick Wendell] Don't ship executor envs 84cc5e5 [Patrick Wendell] Small clean-up 1f75238 [Patrick Wendell] SPARK_JAVA_OPTS --> SPARK_MASTER_OPTS for master settings 4982331 [Patrick Wendell] Remove SPARK_LIBRARY_PATH 6eaf7d0 [Patrick Wendell] executorJavaOpts 0faa3b6 [Patrick Wendell] Stash of adding config options in submit script and YARN ac2d65e [Patrick Wendell] Change spark.local.dir -> SPARK_LOCAL_DIRS
2014-04-21 13:26:33 -04:00
<div data-lang="java" markdown="1">
The entry point into all functionality in Spark is the [`SparkSession`](api/java/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.build()`:
{% highlight java %}
import org.apache.spark.sql.SparkSession
SparkSession spark = SparkSession.build()
.master("local")
.appName("Word Count")
.config("spark.some.config.option", "some-value")
.getOrCreate();
{% endhighlight %}
</div>
SPARK-1374: PySpark API for SparkSQL An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries. ``` from pyspark.context import SQLContext sqlCtx = SQLContext(sc) rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}]) srdd = sqlCtx.applySchema(rdd) sqlCtx.registerRDDAsTable(srdd, "table1") srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1") srdd2.collect() ``` The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]``` Author: Ahir Reddy <ahirreddy@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #363 from ahirreddy/pysql and squashes the following commits: 0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns 307d6e0 [Ahir Reddy] Style fix 6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies 3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py 29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD f2312c7 [Ahir Reddy] Moved everything into sql.py a19afe4 [Ahir Reddy] Doc fixes 6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL 521ff6d [Ahir Reddy] Trying to get spark to build with hive ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins ded03e7 [Ahir Reddy] Added doc test for HiveContext 22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency e4da06c [Ahir Reddy] Display message if hive is not built into spark 227a0be [Michael Armbrust] Update API links. Fix Hive example. 58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes. 4285340 [Michael Armbrust] Fix building of Hive API Docs. 38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs. 337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build 40491c9 [Ahir Reddy] PR Changes + Method Visibility 1836944 [Michael Armbrust] Fix comments. e00980f [Michael Armbrust] First draft of python sql programming guide. b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test f98a422 [Ahir Reddy] HiveContexts 79621cf [Ahir Reddy] cleaning up cruft b406ba0 [Ahir Reddy] doctest formatting 20936a5 [Ahir Reddy] Added tests and documentation e4d21b4 [Ahir Reddy] Added pyrolite dependency 79f739d [Ahir Reddy] added more tests 7515ba0 [Ahir Reddy] added more tests :) d26ec5e [Ahir Reddy] added test e9f5b8d [Ahir Reddy] adding tests 906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python 251f99d [Ahir Reddy] for now only allow dictionaries as input 09b9980 [Ahir Reddy] made jrdd explicitly lazy c608947 [Ahir Reddy] SchemaRDD now has all RDD operations 725c91e [Ahir Reddy] awesome row objects 55d1c76 [Ahir Reddy] return row objects 4fe1319 [Ahir Reddy] output dictionaries correctly be079de [Ahir Reddy] returning dictionaries works cd5f79f [Ahir Reddy] Switched to using Scala SQLContext e948bd9 [Ahir Reddy] yippie 4886052 [Ahir Reddy] even better c0fb1c6 [Ahir Reddy] more working 043ca85 [Ahir Reddy] working 5496f9f [Ahir Reddy] doesn't crash b8b904b [Ahir Reddy] Added schema rdd class 67ba875 [Ahir Reddy] java to python, and python to java bcc0f23 [Ahir Reddy] Java to python ab6025d [Ahir Reddy] compiling
2014-04-15 03:07:55 -04:00
<div data-lang="python" markdown="1">
The entry point into all functionality in Spark is the [`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.build`:
SPARK-1374: PySpark API for SparkSQL An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries. ``` from pyspark.context import SQLContext sqlCtx = SQLContext(sc) rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}]) srdd = sqlCtx.applySchema(rdd) sqlCtx.registerRDDAsTable(srdd, "table1") srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1") srdd2.collect() ``` The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]``` Author: Ahir Reddy <ahirreddy@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #363 from ahirreddy/pysql and squashes the following commits: 0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns 307d6e0 [Ahir Reddy] Style fix 6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies 3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py 29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD f2312c7 [Ahir Reddy] Moved everything into sql.py a19afe4 [Ahir Reddy] Doc fixes 6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL 521ff6d [Ahir Reddy] Trying to get spark to build with hive ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins ded03e7 [Ahir Reddy] Added doc test for HiveContext 22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency e4da06c [Ahir Reddy] Display message if hive is not built into spark 227a0be [Michael Armbrust] Update API links. Fix Hive example. 58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes. 4285340 [Michael Armbrust] Fix building of Hive API Docs. 38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs. 337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build 40491c9 [Ahir Reddy] PR Changes + Method Visibility 1836944 [Michael Armbrust] Fix comments. e00980f [Michael Armbrust] First draft of python sql programming guide. b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test f98a422 [Ahir Reddy] HiveContexts 79621cf [Ahir Reddy] cleaning up cruft b406ba0 [Ahir Reddy] doctest formatting 20936a5 [Ahir Reddy] Added tests and documentation e4d21b4 [Ahir Reddy] Added pyrolite dependency 79f739d [Ahir Reddy] added more tests 7515ba0 [Ahir Reddy] added more tests :) d26ec5e [Ahir Reddy] added test e9f5b8d [Ahir Reddy] adding tests 906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python 251f99d [Ahir Reddy] for now only allow dictionaries as input 09b9980 [Ahir Reddy] made jrdd explicitly lazy c608947 [Ahir Reddy] SchemaRDD now has all RDD operations 725c91e [Ahir Reddy] awesome row objects 55d1c76 [Ahir Reddy] return row objects 4fe1319 [Ahir Reddy] output dictionaries correctly be079de [Ahir Reddy] returning dictionaries works cd5f79f [Ahir Reddy] Switched to using Scala SQLContext e948bd9 [Ahir Reddy] yippie 4886052 [Ahir Reddy] even better c0fb1c6 [Ahir Reddy] more working 043ca85 [Ahir Reddy] working 5496f9f [Ahir Reddy] doesn't crash b8b904b [Ahir Reddy] Added schema rdd class 67ba875 [Ahir Reddy] java to python, and python to java bcc0f23 [Ahir Reddy] Java to python ab6025d [Ahir Reddy] compiling
2014-04-15 03:07:55 -04:00
{% highlight python %}
from pyspark.sql import SparkSession
spark = SparkSession.build \
.master("local") \
.appName("Word Count") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
SPARK-1374: PySpark API for SparkSQL An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries. ``` from pyspark.context import SQLContext sqlCtx = SQLContext(sc) rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}]) srdd = sqlCtx.applySchema(rdd) sqlCtx.registerRDDAsTable(srdd, "table1") srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1") srdd2.collect() ``` The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]``` Author: Ahir Reddy <ahirreddy@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #363 from ahirreddy/pysql and squashes the following commits: 0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns 307d6e0 [Ahir Reddy] Style fix 6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies 3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py 29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD f2312c7 [Ahir Reddy] Moved everything into sql.py a19afe4 [Ahir Reddy] Doc fixes 6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL 521ff6d [Ahir Reddy] Trying to get spark to build with hive ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins ded03e7 [Ahir Reddy] Added doc test for HiveContext 22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency e4da06c [Ahir Reddy] Display message if hive is not built into spark 227a0be [Michael Armbrust] Update API links. Fix Hive example. 58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes. 4285340 [Michael Armbrust] Fix building of Hive API Docs. 38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs. 337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build 40491c9 [Ahir Reddy] PR Changes + Method Visibility 1836944 [Michael Armbrust] Fix comments. e00980f [Michael Armbrust] First draft of python sql programming guide. b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test f98a422 [Ahir Reddy] HiveContexts 79621cf [Ahir Reddy] cleaning up cruft b406ba0 [Ahir Reddy] doctest formatting 20936a5 [Ahir Reddy] Added tests and documentation e4d21b4 [Ahir Reddy] Added pyrolite dependency 79f739d [Ahir Reddy] added more tests 7515ba0 [Ahir Reddy] added more tests :) d26ec5e [Ahir Reddy] added test e9f5b8d [Ahir Reddy] adding tests 906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python 251f99d [Ahir Reddy] for now only allow dictionaries as input 09b9980 [Ahir Reddy] made jrdd explicitly lazy c608947 [Ahir Reddy] SchemaRDD now has all RDD operations 725c91e [Ahir Reddy] awesome row objects 55d1c76 [Ahir Reddy] return row objects 4fe1319 [Ahir Reddy] output dictionaries correctly be079de [Ahir Reddy] returning dictionaries works cd5f79f [Ahir Reddy] Switched to using Scala SQLContext e948bd9 [Ahir Reddy] yippie 4886052 [Ahir Reddy] even better c0fb1c6 [Ahir Reddy] more working 043ca85 [Ahir Reddy] working 5496f9f [Ahir Reddy] doesn't crash b8b904b [Ahir Reddy] Added schema rdd class 67ba875 [Ahir Reddy] java to python, and python to java bcc0f23 [Ahir Reddy] Java to python ab6025d [Ahir Reddy] compiling
2014-04-15 03:07:55 -04:00
{% endhighlight %}
</div>
<div data-lang="r" markdown="1">
The entry point into all functionality in Spark is the [`SparkSession`](api/R/sparkR.session.html) class. To initialize a basic `SparkSession`, just call `sparkR.session()`:
{% highlight r %}
sparkR.session()
{% endhighlight %}
Note that when invoked for the first time, `sparkR.session()` initializes a global `SparkSession` singleton instance, and always returns a reference to this instance for successive invocations. In this way, users only need to initialize the `SparkSession` once, then SparkR functions like `read.df` will be able to access this global instance implicitly, and users don't need to pass the `SparkSession` instance around.
</div>
</div>
`SparkSession` in Spark 2.0 provides builtin support for Hive features including the ability to
write queries using HiveQL, access to Hive UDFs, and the ability to read data from Hive tables.
To use these features, you do not need to have an existing Hive setup.
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
## Creating DataFrames
<div class="codetabs">
<div data-lang="scala" markdown="1">
With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds),
from a Hive table, or from [Spark data sources](#data-sources).
As an example, the following creates a DataFrame based on the content of a JSON file:
{% highlight scala %}
val spark: SparkSession // An existing SparkSession.
val df = spark.read.json("examples/src/main/resources/people.json")
// Displays the content of the DataFrame to stdout
df.show()
{% endhighlight %}
</div>
<div data-lang="java" markdown="1">
With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds),
from a Hive table, or from [Spark data sources](#data-sources).
As an example, the following creates a DataFrame based on the content of a JSON file:
{% highlight java %}
SparkSession spark = ...; // An existing SparkSession.
Dataset<Row> df = spark.read().json("examples/src/main/resources/people.json");
// Displays the content of the DataFrame to stdout
df.show();
{% endhighlight %}
</div>
<div data-lang="python" markdown="1">
With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds),
from a Hive table, or from [Spark data sources](#data-sources).
As an example, the following creates a DataFrame based on the content of a JSON file:
{% highlight python %}
# spark is an existing SparkSession
df = spark.read.json("examples/src/main/resources/people.json")
# Displays the content of the DataFrame to stdout
df.show()
{% endhighlight %}
</div>
<div data-lang="r" markdown="1">
With a `SparkSession`, applications can create DataFrames from a local R data.frame,
from a Hive table, or from [Spark data sources](#data-sources).
As an example, the following creates a DataFrame based on the content of a JSON file:
{% highlight r %}
df <- read.json("examples/src/main/resources/people.json")
# Displays the content of the DataFrame
showDF(df)
{% endhighlight %}
</div>
</div>
## Untyped Dataset Operations (aka DataFrame Operations)
DataFrames provide a domain-specific language for structured data manipulation in [Scala](api/scala/index.html#org.apache.spark.sql.Dataset), [Java](api/java/index.html?org/apache/spark/sql/Dataset.html), [Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame) and [R](api/R/DataFrame.html).
As mentioned above, in Spark 2.0, DataFrames are just Dataset of `Row`s in Scala and Java API. These operations are also referred as "untyped transformations" in contrast to "typed transformations" come with strongly typed Scala/Java Datasets.
Here we include some basic examples of structured data processing using Datasets:
<div class="codetabs">
<div data-lang="scala" markdown="1">
{% highlight scala %}
val spark: SparkSession // An existing SparkSession
// Create the DataFrame
val df = spark.read.json("examples/src/main/resources/people.json")
// Show the content of the DataFrame
df.show()
// age name
// null Michael
// 30 Andy
// 19 Justin
// Print the schema in a tree format
df.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
// Select only the "name" column
df.select("name").show()
// name
// Michael
// Andy
// Justin
// Select everybody, but increment the age by 1
df.select(df("name"), df("age") + 1).show()
// name (age + 1)
// Michael null
// Andy 31
// Justin 20
// Select people older than 21
df.filter(df("age") > 21).show()
// age name
// 30 Andy
// Count people by age
df.groupBy("age").count().show()
// age count
// null 1
// 19 1
// 30 1
{% endhighlight %}
For a complete list of the types of operations that can be performed on a Dataset refer to the [API Documentation](api/scala/index.html#org.apache.spark.sql.Dataset).
In addition to simple column references and expressions, Datasets also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/scala/index.html#org.apache.spark.sql.functions$).
</div>
<div data-lang="java" markdown="1">
{% highlight java %}
SparkSession spark = ...; // An existing SparkSession
// Create the DataFrame
Dataset<Row> df = spark.read().json("examples/src/main/resources/people.json");
// Show the content of the DataFrame
df.show();
// age name
// null Michael
// 30 Andy
// 19 Justin
// Print the schema in a tree format
df.printSchema();
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
// Select only the "name" column
df.select("name").show();
// name
// Michael
// Andy
// Justin
// Select everybody, but increment the age by 1
df.select(df.col("name"), df.col("age").plus(1)).show();
// name (age + 1)
// Michael null
// Andy 31
// Justin 20
// Select people older than 21
df.filter(df.col("age").gt(21)).show();
// age name
// 30 Andy
// Count people by age
df.groupBy("age").count().show();
// age count
// null 1
// 19 1
// 30 1
{% endhighlight %}
For a complete list of the types of operations that can be performed on a Dataset refer to the [API Documentation](api/java/org/apache/spark/sql/Dataset.html).
In addition to simple column references and expressions, Datasets also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/java/org/apache/spark/sql/functions.html).
</div>
<div data-lang="python" markdown="1">
In Python it's possible to access a DataFrame's columns either by attribute
(`df.age`) or by indexing (`df['age']`). While the former is convenient for
interactive data exploration, users are highly encouraged to use the
latter form, which is future proof and won't break with column names that
are also attributes on the DataFrame class.
{% highlight python %}
# spark is an existing SparkSession
# Create the DataFrame
df = spark.read.json("examples/src/main/resources/people.json")
# Show the content of the DataFrame
df.show()
## age name
## null Michael
## 30 Andy
## 19 Justin
# Print the schema in a tree format
df.printSchema()
## root
## |-- age: long (nullable = true)
## |-- name: string (nullable = true)
# Select only the "name" column
df.select("name").show()
## name
## Michael
## Andy
## Justin
# Select everybody, but increment the age by 1
df.select(df['name'], df['age'] + 1).show()
## name (age + 1)
## Michael null
## Andy 31
## Justin 20
# Select people older than 21
df.filter(df['age'] > 21).show()
## age name
## 30 Andy
# Count people by age
df.groupBy("age").count().show()
## age count
## null 1
## 19 1
## 30 1
{% endhighlight %}
For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/python/pyspark.sql.html#pyspark.sql.DataFrame).
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/python/pyspark.sql.html#module-pyspark.sql.functions).
</div>
<div data-lang="r" markdown="1">
{% highlight r %}
# Create the DataFrame
df <- read.json("examples/src/main/resources/people.json")
# Show the content of the DataFrame
showDF(df)
## age name
## null Michael
## 30 Andy
## 19 Justin
# Print the schema in a tree format
printSchema(df)
## root
## |-- age: long (nullable = true)
## |-- name: string (nullable = true)
# Select only the "name" column
showDF(select(df, "name"))
## name
## Michael
## Andy
## Justin
# Select everybody, but increment the age by 1
showDF(select(df, df$name, df$age + 1))
## name (age + 1)
## Michael null
## Andy 31
## Justin 20
# Select people older than 21
showDF(where(df, df$age > 21))
## age name
## 30 Andy
# Count people by age
showDF(count(groupBy(df, "age")))
## age count
## null 1
## 19 1
## 30 1
{% endhighlight %}
For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/R/index.html).
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/R/SparkDataFrame.html).
</div>
</div>
## Running SQL Queries Programmatically
<div class="codetabs">
<div data-lang="scala" markdown="1">
The `sql` function on a `SparkSession` enables applications to run SQL queries programmatically and returns the result as a `DataFrame`.
{% highlight scala %}
val spark = ... // An existing SparkSession
val df = spark.sql("SELECT * FROM table")
{% endhighlight %}
</div>
<div data-lang="java" markdown="1">
The `sql` function on a `SparkSession` enables applications to run SQL queries programmatically and returns the result as a `Dataset<Row>`.
{% highlight java %}
SparkSession spark = ... // An existing SparkSession
Dataset<Row> df = spark.sql("SELECT * FROM table")
{% endhighlight %}
</div>
<div data-lang="python" markdown="1">
The `sql` function on a `SparkSession` enables applications to run SQL queries programmatically and returns the result as a `DataFrame`.
{% highlight python %}
# spark is an existing SparkSession
df = spark.sql("SELECT * FROM table")
{% endhighlight %}
</div>
<div data-lang="r" markdown="1">
The `sql` function enables applications to run SQL queries programmatically and returns the result as a `SparkDataFrame`.
{% highlight r %}
df <- sql("SELECT * FROM table")
{% endhighlight %}
</div>
</div>
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
## Creating Datasets
Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use
a specialized [Encoder](api/scala/index.html#org.apache.spark.sql.Encoder) to serialize the objects
for processing or transmitting over the network. While both encoders and standard serialization are
responsible for turning an object into bytes, encoders are code generated dynamically and use a format
that allows Spark to perform many operations like filtering, sorting and hashing without deserializing
the bytes back into an object.
<div class="codetabs">
<div data-lang="scala" markdown="1">
{% highlight scala %}
// Encoders for most common types are automatically provided by importing spark.implicits._
val ds = Seq(1, 2, 3).toDS()
ds.map(_ + 1).collect() // Returns: Array(2, 3, 4)
// Encoders are also created for case classes.
case class Person(name: String, age: Long)
val ds = Seq(Person("Andy", 32)).toDS()
// DataFrames can be converted to a Dataset by providing a class. Mapping will be done by name.
val path = "examples/src/main/resources/people.json"
val people = spark.read.json(path).as[Person]
{% endhighlight %}
</div>
<div data-lang="java" markdown="1">
{% highlight java %}
SparkSession spark = ... // An existing SparkSession
// Encoders for most common types are provided in class Encoders.
Dataset<Integer> ds = spark.createDataset(Arrays.asList(1, 2, 3), Encoders.INT());
ds.map(new MapFunction<Integer, Integer>() {
@Override
public Integer call(Integer value) throws Exception {
return value + 1;
}
}, Encoders.INT()); // Returns: [2, 3, 4]
Person person = new Person();
person.setName("Andy");
person.setAge(32);
// Encoders are also created for Java beans.
Dataset<Person> ds = spark.createDataset(
Collections.singletonList(person),
Encoders.bean(Person.class)
);
// DataFrames can be converted to a Dataset by providing a class. Mapping will be done by name.
String path = "examples/src/main/resources/people.json";
Dataset<Person> people = spark.read().json(path).as(Encoders.bean(Person.class));
{% endhighlight %}
</div>
</div>
## Interoperating with RDDs
Spark SQL supports two different methods for converting existing RDDs into Datasets. The first
method uses reflection to infer the schema of an RDD that contains specific types of objects. This
reflection based approach leads to more concise code and works well when you already know the schema
while writing your Spark application.
The second method for creating Datasets is through a programmatic interface that allows you to
construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows
you to construct Datasets when the columns and their types are not known until runtime.
### Inferring the Schema Using Reflection
<div class="codetabs">
<div data-lang="scala" markdown="1">
The Scala interface for Spark SQL supports automatically converting an RDD containing case classes
to a DataFrame. The case class
defines the schema of the table. The names of the arguments to the case class are read using
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
reflection and become the names of the columns. Case classes can also be nested or contain complex
types such as `Seq`s or `Array`s. This RDD can be implicitly converted to a DataFrame and then be
registered as a table. Tables can be used in subsequent SQL statements.
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
{% highlight scala %}
val spark: SparkSession // An existing SparkSession
// this is used to implicitly convert an RDD to a DataFrame.
import spark.implicits._
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
// Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the Product interface.
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
case class Person(name: String, age: Int)
// Create an RDD of Person objects and register it as a temporary view.
val people = sc
.textFile("examples/src/main/resources/people.txt")
.map(_.split(","))
.map(p => Person(p(0), p(1).trim.toInt))
.toDF()
people.createOrReplaceTempView("people")
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
// SQL statements can be run by using the sql methods provided by spark.
val teenagers = spark.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
// The columns of a row in the result can be accessed by field index:
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
// or by field name:
teenagers.map(t => "Name: " + t.getAs[String]("name")).collect().foreach(println)
// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
teenagers.map(_.getValuesMap[Any](List("name", "age"))).collect().foreach(println)
// Map("name" -> "Justin", "age" -> 19)
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
{% endhighlight %}
</div>
<div data-lang="java" markdown="1">
Spark SQL supports automatically converting an RDD of
[JavaBeans](http://stackoverflow.com/questions/3295496/what-is-a-javabean-exactly) into a DataFrame.
The `BeanInfo`, obtained using reflection, defines the schema of the table. Currently, Spark SQL
does not support JavaBeans that contain `Map` field(s). Nested JavaBeans and `List` or `Array`
fields are supported though. You can create a JavaBean by creating a class that implements
Serializable and has getters and setters for all of its fields.
{% highlight java %}
public static class Person implements Serializable {
private String name;
private int age;
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public int getAge() {
return age;
}
public void setAge(int age) {
this.age = age;
}
}
{% endhighlight %}
A schema can be applied to an existing RDD by calling `createDataFrame` and providing the Class object
for the JavaBean.
{% highlight java %}
SparkSession spark = ...; // An existing SparkSession
// Load a text file and convert each line to a JavaBean.
JavaRDD<Person> people = spark.sparkContext.textFile("examples/src/main/resources/people.txt").map(
new Function<String, Person>() {
public Person call(String line) throws Exception {
String[] parts = line.split(",");
Person person = new Person();
person.setName(parts[0]);
person.setAge(Integer.parseInt(parts[1].trim()));
return person;
}
});
// Apply a schema to an RDD of JavaBeans and register it as a table.
Dataset<Row> schemaPeople = spark.createDataFrame(people, Person.class);
schemaPeople.createOrReplaceTempView("people");
// SQL can be run over RDDs that have been registered as tables.
Dataset<Row> teenagers = spark.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
// The columns of a row in the result can be accessed by ordinal.
List<String> teenagerNames = teenagers.map(new MapFunction<Row, String>() {
public String call(Row row) {
return "Name: " + row.getString(0);
}
}).collectAsList();
{% endhighlight %}
</div>
SPARK-1374: PySpark API for SparkSQL An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries. ``` from pyspark.context import SQLContext sqlCtx = SQLContext(sc) rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}]) srdd = sqlCtx.applySchema(rdd) sqlCtx.registerRDDAsTable(srdd, "table1") srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1") srdd2.collect() ``` The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]``` Author: Ahir Reddy <ahirreddy@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #363 from ahirreddy/pysql and squashes the following commits: 0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns 307d6e0 [Ahir Reddy] Style fix 6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies 3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py 29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD f2312c7 [Ahir Reddy] Moved everything into sql.py a19afe4 [Ahir Reddy] Doc fixes 6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL 521ff6d [Ahir Reddy] Trying to get spark to build with hive ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins ded03e7 [Ahir Reddy] Added doc test for HiveContext 22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency e4da06c [Ahir Reddy] Display message if hive is not built into spark 227a0be [Michael Armbrust] Update API links. Fix Hive example. 58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes. 4285340 [Michael Armbrust] Fix building of Hive API Docs. 38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs. 337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build 40491c9 [Ahir Reddy] PR Changes + Method Visibility 1836944 [Michael Armbrust] Fix comments. e00980f [Michael Armbrust] First draft of python sql programming guide. b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test f98a422 [Ahir Reddy] HiveContexts 79621cf [Ahir Reddy] cleaning up cruft b406ba0 [Ahir Reddy] doctest formatting 20936a5 [Ahir Reddy] Added tests and documentation e4d21b4 [Ahir Reddy] Added pyrolite dependency 79f739d [Ahir Reddy] added more tests 7515ba0 [Ahir Reddy] added more tests :) d26ec5e [Ahir Reddy] added test e9f5b8d [Ahir Reddy] adding tests 906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python 251f99d [Ahir Reddy] for now only allow dictionaries as input 09b9980 [Ahir Reddy] made jrdd explicitly lazy c608947 [Ahir Reddy] SchemaRDD now has all RDD operations 725c91e [Ahir Reddy] awesome row objects 55d1c76 [Ahir Reddy] return row objects 4fe1319 [Ahir Reddy] output dictionaries correctly be079de [Ahir Reddy] returning dictionaries works cd5f79f [Ahir Reddy] Switched to using Scala SQLContext e948bd9 [Ahir Reddy] yippie 4886052 [Ahir Reddy] even better c0fb1c6 [Ahir Reddy] more working 043ca85 [Ahir Reddy] working 5496f9f [Ahir Reddy] doesn't crash b8b904b [Ahir Reddy] Added schema rdd class 67ba875 [Ahir Reddy] java to python, and python to java bcc0f23 [Ahir Reddy] Java to python ab6025d [Ahir Reddy] compiling
2014-04-15 03:07:55 -04:00
<div data-lang="python" markdown="1">
Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows are constructed by passing a list of
key/value pairs as kwargs to the Row class. The keys of this list define the column names of the table,
and the types are inferred by sampling the whole datase, similar to the inference that is performed on JSON files.
SPARK-1374: PySpark API for SparkSQL An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries. ``` from pyspark.context import SQLContext sqlCtx = SQLContext(sc) rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}]) srdd = sqlCtx.applySchema(rdd) sqlCtx.registerRDDAsTable(srdd, "table1") srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1") srdd2.collect() ``` The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]``` Author: Ahir Reddy <ahirreddy@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #363 from ahirreddy/pysql and squashes the following commits: 0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns 307d6e0 [Ahir Reddy] Style fix 6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies 3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py 29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD f2312c7 [Ahir Reddy] Moved everything into sql.py a19afe4 [Ahir Reddy] Doc fixes 6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL 521ff6d [Ahir Reddy] Trying to get spark to build with hive ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins ded03e7 [Ahir Reddy] Added doc test for HiveContext 22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency e4da06c [Ahir Reddy] Display message if hive is not built into spark 227a0be [Michael Armbrust] Update API links. Fix Hive example. 58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes. 4285340 [Michael Armbrust] Fix building of Hive API Docs. 38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs. 337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build 40491c9 [Ahir Reddy] PR Changes + Method Visibility 1836944 [Michael Armbrust] Fix comments. e00980f [Michael Armbrust] First draft of python sql programming guide. b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test f98a422 [Ahir Reddy] HiveContexts 79621cf [Ahir Reddy] cleaning up cruft b406ba0 [Ahir Reddy] doctest formatting 20936a5 [Ahir Reddy] Added tests and documentation e4d21b4 [Ahir Reddy] Added pyrolite dependency 79f739d [Ahir Reddy] added more tests 7515ba0 [Ahir Reddy] added more tests :) d26ec5e [Ahir Reddy] added test e9f5b8d [Ahir Reddy] adding tests 906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python 251f99d [Ahir Reddy] for now only allow dictionaries as input 09b9980 [Ahir Reddy] made jrdd explicitly lazy c608947 [Ahir Reddy] SchemaRDD now has all RDD operations 725c91e [Ahir Reddy] awesome row objects 55d1c76 [Ahir Reddy] return row objects 4fe1319 [Ahir Reddy] output dictionaries correctly be079de [Ahir Reddy] returning dictionaries works cd5f79f [Ahir Reddy] Switched to using Scala SQLContext e948bd9 [Ahir Reddy] yippie 4886052 [Ahir Reddy] even better c0fb1c6 [Ahir Reddy] more working 043ca85 [Ahir Reddy] working 5496f9f [Ahir Reddy] doesn't crash b8b904b [Ahir Reddy] Added schema rdd class 67ba875 [Ahir Reddy] java to python, and python to java bcc0f23 [Ahir Reddy] Java to python ab6025d [Ahir Reddy] compiling
2014-04-15 03:07:55 -04:00
{% highlight python %}
# spark is an existing SparkSession.
from pyspark.sql import Row
sc = spark.sparkContext
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
# Load a text file and convert each line to a Row.
SPARK-1374: PySpark API for SparkSQL An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries. ``` from pyspark.context import SQLContext sqlCtx = SQLContext(sc) rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}]) srdd = sqlCtx.applySchema(rdd) sqlCtx.registerRDDAsTable(srdd, "table1") srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1") srdd2.collect() ``` The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]``` Author: Ahir Reddy <ahirreddy@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #363 from ahirreddy/pysql and squashes the following commits: 0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns 307d6e0 [Ahir Reddy] Style fix 6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies 3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py 29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD f2312c7 [Ahir Reddy] Moved everything into sql.py a19afe4 [Ahir Reddy] Doc fixes 6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL 521ff6d [Ahir Reddy] Trying to get spark to build with hive ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins ded03e7 [Ahir Reddy] Added doc test for HiveContext 22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency e4da06c [Ahir Reddy] Display message if hive is not built into spark 227a0be [Michael Armbrust] Update API links. Fix Hive example. 58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes. 4285340 [Michael Armbrust] Fix building of Hive API Docs. 38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs. 337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build 40491c9 [Ahir Reddy] PR Changes + Method Visibility 1836944 [Michael Armbrust] Fix comments. e00980f [Michael Armbrust] First draft of python sql programming guide. b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test f98a422 [Ahir Reddy] HiveContexts 79621cf [Ahir Reddy] cleaning up cruft b406ba0 [Ahir Reddy] doctest formatting 20936a5 [Ahir Reddy] Added tests and documentation e4d21b4 [Ahir Reddy] Added pyrolite dependency 79f739d [Ahir Reddy] added more tests 7515ba0 [Ahir Reddy] added more tests :) d26ec5e [Ahir Reddy] added test e9f5b8d [Ahir Reddy] adding tests 906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python 251f99d [Ahir Reddy] for now only allow dictionaries as input 09b9980 [Ahir Reddy] made jrdd explicitly lazy c608947 [Ahir Reddy] SchemaRDD now has all RDD operations 725c91e [Ahir Reddy] awesome row objects 55d1c76 [Ahir Reddy] return row objects 4fe1319 [Ahir Reddy] output dictionaries correctly be079de [Ahir Reddy] returning dictionaries works cd5f79f [Ahir Reddy] Switched to using Scala SQLContext e948bd9 [Ahir Reddy] yippie 4886052 [Ahir Reddy] even better c0fb1c6 [Ahir Reddy] more working 043ca85 [Ahir Reddy] working 5496f9f [Ahir Reddy] doesn't crash b8b904b [Ahir Reddy] Added schema rdd class 67ba875 [Ahir Reddy] java to python, and python to java bcc0f23 [Ahir Reddy] Java to python ab6025d [Ahir Reddy] compiling
2014-04-15 03:07:55 -04:00
lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))
SPARK-1374: PySpark API for SparkSQL An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries. ``` from pyspark.context import SQLContext sqlCtx = SQLContext(sc) rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}]) srdd = sqlCtx.applySchema(rdd) sqlCtx.registerRDDAsTable(srdd, "table1") srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1") srdd2.collect() ``` The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]``` Author: Ahir Reddy <ahirreddy@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #363 from ahirreddy/pysql and squashes the following commits: 0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns 307d6e0 [Ahir Reddy] Style fix 6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies 3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py 29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD f2312c7 [Ahir Reddy] Moved everything into sql.py a19afe4 [Ahir Reddy] Doc fixes 6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL 521ff6d [Ahir Reddy] Trying to get spark to build with hive ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins ded03e7 [Ahir Reddy] Added doc test for HiveContext 22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency e4da06c [Ahir Reddy] Display message if hive is not built into spark 227a0be [Michael Armbrust] Update API links. Fix Hive example. 58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes. 4285340 [Michael Armbrust] Fix building of Hive API Docs. 38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs. 337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build 40491c9 [Ahir Reddy] PR Changes + Method Visibility 1836944 [Michael Armbrust] Fix comments. e00980f [Michael Armbrust] First draft of python sql programming guide. b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test f98a422 [Ahir Reddy] HiveContexts 79621cf [Ahir Reddy] cleaning up cruft b406ba0 [Ahir Reddy] doctest formatting 20936a5 [Ahir Reddy] Added tests and documentation e4d21b4 [Ahir Reddy] Added pyrolite dependency 79f739d [Ahir Reddy] added more tests 7515ba0 [Ahir Reddy] added more tests :) d26ec5e [Ahir Reddy] added test e9f5b8d [Ahir Reddy] adding tests 906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python 251f99d [Ahir Reddy] for now only allow dictionaries as input 09b9980 [Ahir Reddy] made jrdd explicitly lazy c608947 [Ahir Reddy] SchemaRDD now has all RDD operations 725c91e [Ahir Reddy] awesome row objects 55d1c76 [Ahir Reddy] return row objects 4fe1319 [Ahir Reddy] output dictionaries correctly be079de [Ahir Reddy] returning dictionaries works cd5f79f [Ahir Reddy] Switched to using Scala SQLContext e948bd9 [Ahir Reddy] yippie 4886052 [Ahir Reddy] even better c0fb1c6 [Ahir Reddy] more working 043ca85 [Ahir Reddy] working 5496f9f [Ahir Reddy] doesn't crash b8b904b [Ahir Reddy] Added schema rdd class 67ba875 [Ahir Reddy] java to python, and python to java bcc0f23 [Ahir Reddy] Java to python ab6025d [Ahir Reddy] compiling
2014-04-15 03:07:55 -04:00
# Infer the schema, and register the DataFrame as a table.
schemaPeople = spark.createDataFrame(people)
schemaPeople.createOrReplaceTempView("people")
SPARK-1374: PySpark API for SparkSQL An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries. ``` from pyspark.context import SQLContext sqlCtx = SQLContext(sc) rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}]) srdd = sqlCtx.applySchema(rdd) sqlCtx.registerRDDAsTable(srdd, "table1") srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1") srdd2.collect() ``` The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]``` Author: Ahir Reddy <ahirreddy@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #363 from ahirreddy/pysql and squashes the following commits: 0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns 307d6e0 [Ahir Reddy] Style fix 6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies 3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py 29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD f2312c7 [Ahir Reddy] Moved everything into sql.py a19afe4 [Ahir Reddy] Doc fixes 6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL 521ff6d [Ahir Reddy] Trying to get spark to build with hive ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins ded03e7 [Ahir Reddy] Added doc test for HiveContext 22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency e4da06c [Ahir Reddy] Display message if hive is not built into spark 227a0be [Michael Armbrust] Update API links. Fix Hive example. 58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes. 4285340 [Michael Armbrust] Fix building of Hive API Docs. 38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs. 337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build 40491c9 [Ahir Reddy] PR Changes + Method Visibility 1836944 [Michael Armbrust] Fix comments. e00980f [Michael Armbrust] First draft of python sql programming guide. b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test f98a422 [Ahir Reddy] HiveContexts 79621cf [Ahir Reddy] cleaning up cruft b406ba0 [Ahir Reddy] doctest formatting 20936a5 [Ahir Reddy] Added tests and documentation e4d21b4 [Ahir Reddy] Added pyrolite dependency 79f739d [Ahir Reddy] added more tests 7515ba0 [Ahir Reddy] added more tests :) d26ec5e [Ahir Reddy] added test e9f5b8d [Ahir Reddy] adding tests 906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python 251f99d [Ahir Reddy] for now only allow dictionaries as input 09b9980 [Ahir Reddy] made jrdd explicitly lazy c608947 [Ahir Reddy] SchemaRDD now has all RDD operations 725c91e [Ahir Reddy] awesome row objects 55d1c76 [Ahir Reddy] return row objects 4fe1319 [Ahir Reddy] output dictionaries correctly be079de [Ahir Reddy] returning dictionaries works cd5f79f [Ahir Reddy] Switched to using Scala SQLContext e948bd9 [Ahir Reddy] yippie 4886052 [Ahir Reddy] even better c0fb1c6 [Ahir Reddy] more working 043ca85 [Ahir Reddy] working 5496f9f [Ahir Reddy] doesn't crash b8b904b [Ahir Reddy] Added schema rdd class 67ba875 [Ahir Reddy] java to python, and python to java bcc0f23 [Ahir Reddy] Java to python ab6025d [Ahir Reddy] compiling
2014-04-15 03:07:55 -04:00
# SQL can be run over DataFrames that have been registered as a table.
teenagers = spark.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
SPARK-1374: PySpark API for SparkSQL An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries. ``` from pyspark.context import SQLContext sqlCtx = SQLContext(sc) rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}]) srdd = sqlCtx.applySchema(rdd) sqlCtx.registerRDDAsTable(srdd, "table1") srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1") srdd2.collect() ``` The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]``` Author: Ahir Reddy <ahirreddy@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #363 from ahirreddy/pysql and squashes the following commits: 0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns 307d6e0 [Ahir Reddy] Style fix 6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies 3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py 29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD f2312c7 [Ahir Reddy] Moved everything into sql.py a19afe4 [Ahir Reddy] Doc fixes 6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL 521ff6d [Ahir Reddy] Trying to get spark to build with hive ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins ded03e7 [Ahir Reddy] Added doc test for HiveContext 22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency e4da06c [Ahir Reddy] Display message if hive is not built into spark 227a0be [Michael Armbrust] Update API links. Fix Hive example. 58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes. 4285340 [Michael Armbrust] Fix building of Hive API Docs. 38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs. 337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build 40491c9 [Ahir Reddy] PR Changes + Method Visibility 1836944 [Michael Armbrust] Fix comments. e00980f [Michael Armbrust] First draft of python sql programming guide. b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test f98a422 [Ahir Reddy] HiveContexts 79621cf [Ahir Reddy] cleaning up cruft b406ba0 [Ahir Reddy] doctest formatting 20936a5 [Ahir Reddy] Added tests and documentation e4d21b4 [Ahir Reddy] Added pyrolite dependency 79f739d [Ahir Reddy] added more tests 7515ba0 [Ahir Reddy] added more tests :) d26ec5e [Ahir Reddy] added test e9f5b8d [Ahir Reddy] adding tests 906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python 251f99d [Ahir Reddy] for now only allow dictionaries as input 09b9980 [Ahir Reddy] made jrdd explicitly lazy c608947 [Ahir Reddy] SchemaRDD now has all RDD operations 725c91e [Ahir Reddy] awesome row objects 55d1c76 [Ahir Reddy] return row objects 4fe1319 [Ahir Reddy] output dictionaries correctly be079de [Ahir Reddy] returning dictionaries works cd5f79f [Ahir Reddy] Switched to using Scala SQLContext e948bd9 [Ahir Reddy] yippie 4886052 [Ahir Reddy] even better c0fb1c6 [Ahir Reddy] more working 043ca85 [Ahir Reddy] working 5496f9f [Ahir Reddy] doesn't crash b8b904b [Ahir Reddy] Added schema rdd class 67ba875 [Ahir Reddy] java to python, and python to java bcc0f23 [Ahir Reddy] Java to python ab6025d [Ahir Reddy] compiling
2014-04-15 03:07:55 -04:00
# The results of SQL queries are RDDs and support all the normal RDD operations.
teenNames = teenagers.map(lambda p: "Name: " + p.name)
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
for teenName in teenNames.collect():
print(teenName)
SPARK-1374: PySpark API for SparkSQL An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries. ``` from pyspark.context import SQLContext sqlCtx = SQLContext(sc) rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}]) srdd = sqlCtx.applySchema(rdd) sqlCtx.registerRDDAsTable(srdd, "table1") srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1") srdd2.collect() ``` The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]``` Author: Ahir Reddy <ahirreddy@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #363 from ahirreddy/pysql and squashes the following commits: 0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns 307d6e0 [Ahir Reddy] Style fix 6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies 3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py 29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD f2312c7 [Ahir Reddy] Moved everything into sql.py a19afe4 [Ahir Reddy] Doc fixes 6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL 521ff6d [Ahir Reddy] Trying to get spark to build with hive ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins ded03e7 [Ahir Reddy] Added doc test for HiveContext 22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency e4da06c [Ahir Reddy] Display message if hive is not built into spark 227a0be [Michael Armbrust] Update API links. Fix Hive example. 58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes. 4285340 [Michael Armbrust] Fix building of Hive API Docs. 38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs. 337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build 40491c9 [Ahir Reddy] PR Changes + Method Visibility 1836944 [Michael Armbrust] Fix comments. e00980f [Michael Armbrust] First draft of python sql programming guide. b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test f98a422 [Ahir Reddy] HiveContexts 79621cf [Ahir Reddy] cleaning up cruft b406ba0 [Ahir Reddy] doctest formatting 20936a5 [Ahir Reddy] Added tests and documentation e4d21b4 [Ahir Reddy] Added pyrolite dependency 79f739d [Ahir Reddy] added more tests 7515ba0 [Ahir Reddy] added more tests :) d26ec5e [Ahir Reddy] added test e9f5b8d [Ahir Reddy] adding tests 906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python 251f99d [Ahir Reddy] for now only allow dictionaries as input 09b9980 [Ahir Reddy] made jrdd explicitly lazy c608947 [Ahir Reddy] SchemaRDD now has all RDD operations 725c91e [Ahir Reddy] awesome row objects 55d1c76 [Ahir Reddy] return row objects 4fe1319 [Ahir Reddy] output dictionaries correctly be079de [Ahir Reddy] returning dictionaries works cd5f79f [Ahir Reddy] Switched to using Scala SQLContext e948bd9 [Ahir Reddy] yippie 4886052 [Ahir Reddy] even better c0fb1c6 [Ahir Reddy] more working 043ca85 [Ahir Reddy] working 5496f9f [Ahir Reddy] doesn't crash b8b904b [Ahir Reddy] Added schema rdd class 67ba875 [Ahir Reddy] java to python, and python to java bcc0f23 [Ahir Reddy] Java to python ab6025d [Ahir Reddy] compiling
2014-04-15 03:07:55 -04:00
{% endhighlight %}
</div>
</div>
### Programmatically Specifying the Schema
<div class="codetabs">
<div data-lang="scala" markdown="1">
When case classes cannot be defined ahead of time (for example,
the structure of records is encoded in a string, or a text dataset will be parsed
and fields will be projected differently for different users),
a `DataFrame` can be created programmatically with three steps.
1. Create an RDD of `Row`s from the original RDD;
2. Create the schema represented by a `StructType` matching the structure of
`Row`s in the RDD created in Step 1.
3. Apply the schema to the RDD of `Row`s via `createDataFrame` method provided
by `SparkSession`.
For example:
{% highlight scala %}
val spark: SparkSession // An existing SparkSession
// Create an RDD
val people = sc.textFile("examples/src/main/resources/people.txt")
// The schema is encoded in a string
val schemaString = "name age"
// Import Row.
import org.apache.spark.sql.Row;
// Import Spark SQL data types
import org.apache.spark.sql.types.{StructType, StructField, StringType};
// Generate the schema based on the string of schema
val schema = StructType(schemaString.split(" ").map { fieldName =>
StructField(fieldName, StringType, true)
})
// Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
// Apply the schema to the RDD.
val peopleDataFrame = spark.createDataFrame(rowRDD, schema)
// Creates a temporary view using the DataFrame.
peopleDataFrame.createOrReplaceTempView("people")
// SQL statements can be run by using the sql methods provided by spark.
val results = spark.sql("SELECT name FROM people")
// The columns of a row in the result can be accessed by field index or by field name.
results.map(t => "Name: " + t(0)).collect().foreach(println)
{% endhighlight %}
</div>
<div data-lang="java" markdown="1">
When JavaBean classes cannot be defined ahead of time (for example,
the structure of records is encoded in a string, or a text dataset will be parsed and
fields will be projected differently for different users),
a `Dataset<Row>` can be created programmatically with three steps.
1. Create an RDD of `Row`s from the original RDD;
2. Create the schema represented by a `StructType` matching the structure of
`Row`s in the RDD created in Step 1.
3. Apply the schema to the RDD of `Row`s via `createDataFrame` method provided
by `SparkSession`.
For example:
{% highlight java %}
import org.apache.spark.api.java.function.Function;
// Import factory methods provided by DataTypes.
import org.apache.spark.sql.types.DataTypes;
// Import StructType and StructField
import org.apache.spark.sql.types.StructType;
import org.apache.spark.sql.types.StructField;
// Import Row.
import org.apache.spark.sql.Row;
// Import RowFactory.
import org.apache.spark.sql.RowFactory;
SparkSession spark = ...; // An existing SparkSession.
JavaSparkContext sc = spark.sparkContext
// Load a text file and convert each line to a JavaBean.
JavaRDD<String> people = sc.textFile("examples/src/main/resources/people.txt");
// The schema is encoded in a string
String schemaString = "name age";
// Generate the schema based on the string of schema
List<StructField> fields = new ArrayList<>();
for (String fieldName: schemaString.split(" ")) {
fields.add(DataTypes.createStructField(fieldName, DataTypes.StringType, true));
}
StructType schema = DataTypes.createStructType(fields);
// Convert records of the RDD (people) to Rows.
JavaRDD<Row> rowRDD = people.map(
new Function<String, Row>() {
public Row call(String record) throws Exception {
String[] fields = record.split(",");
return RowFactory.create(fields[0], fields[1].trim());
}
});
// Apply the schema to the RDD.
Dataset<Row> peopleDataFrame = spark.createDataFrame(rowRDD, schema);
// Creates a temporary view using the DataFrame.
peopleDataFrame.createOrReplaceTempView("people");
// SQL can be run over a temporary view created using DataFrames.
Dataset<Row> results = spark.sql("SELECT name FROM people");
// The results of SQL queries are DataFrames and support all the normal RDD operations.
// The columns of a row in the result can be accessed by ordinal.
List<String> names = results.javaRDD().map(new Function<Row, String>() {
public String call(Row row) {
return "Name: " + row.getString(0);
}
}).collect();
{% endhighlight %}
</div>
<div data-lang="python" markdown="1">
When a dictionary of kwargs cannot be defined ahead of time (for example,
the structure of records is encoded in a string, or a text dataset will be parsed and
fields will be projected differently for different users),
a `DataFrame` can be created programmatically with three steps.
1. Create an RDD of tuples or lists from the original RDD;
2. Create the schema represented by a `StructType` matching the structure of
tuples or lists in the RDD created in the step 1.
3. Apply the schema to the RDD via `createDataFrame` method provided by `SparkSession`.
For example:
{% highlight python %}
# Import SparkSession and data types
from pyspark.sql.types import *
# spark is an existing SparkSession.
sc = spark.sparkContext
# Load a text file and convert each line to a tuple.
lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: (p[0], p[1].strip()))
# The schema is encoded in a string.
schemaString = "name age"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)
# Apply the schema to the RDD.
schemaPeople = spark.createDataFrame(people, schema)
# Creates a temporary view using the DataFrame
schemaPeople.createOrReplaceTempView("people")
# SQL can be run over DataFrames that have been registered as a table.
results = spark.sql("SELECT name FROM people")
# The results of SQL queries are RDDs and support all the normal RDD operations.
names = results.map(lambda p: "Name: " + p.name)
for name in names.collect():
print(name)
{% endhighlight %}
</div>
</div>
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
# Data Sources
Spark SQL supports operating on a variety of data sources through the DataFrame interface.
A DataFrame can be operated on using relational transformations and can also be used to create a temporary view.
Registering a DataFrame as a temporary view allows you to run SQL queries over its data. This section
describes the general methods for loading and saving data using the Spark Data Sources and then
goes into specific options that are available for the built-in data sources.
## Generic Load/Save Functions
In the simplest form, the default data source (`parquet` unless otherwise configured by
`spark.sql.sources.default`) will be used for all operations.
<div class="codetabs">
<div data-lang="scala" markdown="1">
{% highlight scala %}
val df = spark.read.load("examples/src/main/resources/users.parquet")
df.select("name", "favorite_color").write.save("namesAndFavColors.parquet")
{% endhighlight %}
</div>
<div data-lang="java" markdown="1">
{% highlight java %}
Dataset<Row> df = spark.read().load("examples/src/main/resources/users.parquet");
df.select("name", "favorite_color").write().save("namesAndFavColors.parquet");
{% endhighlight %}
</div>
<div data-lang="python" markdown="1">
{% highlight python %}
df = spark.read.load("examples/src/main/resources/users.parquet")
df.select("name", "favorite_color").write.save("namesAndFavColors.parquet")
{% endhighlight %}
</div>
<div data-lang="r" markdown="1">
{% highlight r %}
df <- read.df("examples/src/main/resources/users.parquet")
write.df(select(df, "name", "favorite_color"), "namesAndFavColors.parquet")
{% endhighlight %}
</div>
</div>
### Manually Specifying Options
You can also manually specify the data source that will be used along with any extra options
that you would like to pass to the data source. Data sources are specified by their fully qualified
name (i.e., `org.apache.spark.sql.parquet`), but for built-in sources you can also use their short
names (`json`, `parquet`, `jdbc`). DataFrames loaded from any data source type can be converted into other types
using this syntax.
<div class="codetabs">
<div data-lang="scala" markdown="1">
{% highlight scala %}
val df = spark.read.format("json").load("examples/src/main/resources/people.json")
df.select("name", "age").write.format("parquet").save("namesAndAges.parquet")
{% endhighlight %}
</div>
<div data-lang="java" markdown="1">
{% highlight java %}
Dataset<Row> df = spark.read().format("json").load("examples/src/main/resources/people.json");
df.select("name", "age").write().format("parquet").save("namesAndAges.parquet");
{% endhighlight %}
</div>
<div data-lang="python" markdown="1">
{% highlight python %}
df = spark.read.load("examples/src/main/resources/people.json", format="json")
df.select("name", "age").write.save("namesAndAges.parquet", format="parquet")
{% endhighlight %}
</div>
<div data-lang="r" markdown="1">
{% highlight r %}
df <- read.df("examples/src/main/resources/people.json", "json")
write.df(select(df, "name", "age"), "namesAndAges.parquet", "parquet")
{% endhighlight %}
</div>
</div>
### Run SQL on files directly
Instead of using read API to load a file into DataFrame and query it, you can also query that
file directly with SQL.
<div class="codetabs">
<div data-lang="scala" markdown="1">
{% highlight scala %}
val df = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")
{% endhighlight %}
</div>
<div data-lang="java" markdown="1">
{% highlight java %}
Dataset<Row> df = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`");
{% endhighlight %}
</div>
<div data-lang="python" markdown="1">
{% highlight python %}
df = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")
{% endhighlight %}
</div>
<div data-lang="r" markdown="1">
{% highlight r %}
df <- sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")
{% endhighlight %}
</div>
</div>
### Save Modes
Save operations can optionally take a `SaveMode`, that specifies how to handle existing data if
present. It is important to realize that these save modes do not utilize any locking and are not
atomic. Additionally, when performing an `Overwrite`, the data will be deleted before writing out the
new data.
<table class="table">
<tr><th>Scala/Java</th><th>Any Language</th><th>Meaning</th></tr>
<tr>
<td><code>SaveMode.ErrorIfExists</code> (default)</td>
<td><code>"error"</code> (default)</td>
<td>
When saving a DataFrame to a data source, if data already exists,
an exception is expected to be thrown.
</td>
</tr>
<tr>
<td><code>SaveMode.Append</code></td>
<td><code>"append"</code></td>
<td>
When saving a DataFrame to a data source, if data/table already exists,
contents of the DataFrame are expected to be appended to existing data.
</td>
</tr>
<tr>
<td><code>SaveMode.Overwrite</code></td>
<td><code>"overwrite"</code></td>
<td>
Overwrite mode means that when saving a DataFrame to a data source,
if data/table already exists, existing data is expected to be overwritten by the contents of
the DataFrame.
</td>
</tr>
<tr>
<td><code>SaveMode.Ignore</code></td>
<td><code>"ignore"</code></td>
<td>
Ignore mode means that when saving a DataFrame to a data source, if data already exists,
the save operation is expected to not save the contents of the DataFrame and to not
change the existing data. This is similar to a <code>CREATE TABLE IF NOT EXISTS</code> in SQL.
</td>
</tr>
</table>
### Saving to Persistent Tables
`DataFrames` can also be saved as persistent tables into Hive metastore using the `saveAsTable`
command. Notice existing Hive deployment is not necessary to use this feature. Spark will create a
default local Hive metastore (using Derby) for you. Unlike the `createOrReplaceTempView` command,
`saveAsTable` will materialize the contents of the DataFrame and create a pointer to the data in the
Hive metastore. Persistent tables will still exist even after your Spark program has restarted, as
long as you maintain your connection to the same metastore. A DataFrame for a persistent table can
be created by calling the `table` method on a `SparkSession` with the name of the table.
By default `saveAsTable` will create a "managed table", meaning that the location of the data will
be controlled by the metastore. Managed tables will also have their data deleted automatically
when a table is dropped.
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
## Parquet Files
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
[SPARK-1566] consolidate programming guide, and general doc updates This is a fairly large PR to clean up and update the docs for 1.0. The major changes are: * A unified programming guide for all languages replaces language-specific ones and shows language-specific info in tabs * New programming guide sections on key-value pairs, unit testing, input formats beyond text, migrating from 0.9, and passing functions to Spark * Spark-submit guide moved to a separate page and expanded slightly * Various cleanups of the menu system, security docs, and others * Updated look of title bar to differentiate the docs from previous Spark versions You can find the updated docs at http://people.apache.org/~matei/1.0-docs/_site/ and in particular http://people.apache.org/~matei/1.0-docs/_site/programming-guide.html. Author: Matei Zaharia <matei@databricks.com> Closes #896 from mateiz/1.0-docs and squashes the following commits: 03e6853 [Matei Zaharia] Some tweaks to configuration and YARN docs 0779508 [Matei Zaharia] tweak ef671d4 [Matei Zaharia] Keep frames in JavaDoc links, and other small tweaks 1bf4112 [Matei Zaharia] Review comments 4414f88 [Matei Zaharia] tweaks d04e979 [Matei Zaharia] Fix some old links to Java guide a34ed33 [Matei Zaharia] tweak 541bb3b [Matei Zaharia] miscellaneous changes fcefdec [Matei Zaharia] Moved submitting apps to separate doc 61d72b4 [Matei Zaharia] stuff 181f217 [Matei Zaharia] migration guide, remove old language guides e11a0da [Matei Zaharia] Add more API functions 6a030a9 [Matei Zaharia] tweaks 8db0ae3 [Matei Zaharia] Added key-value pairs section 318d2c9 [Matei Zaharia] tweaks 1c81477 [Matei Zaharia] New section on basics and function syntax e38f559 [Matei Zaharia] Actually added programming guide to Git a33d6fe [Matei Zaharia] First pass at updating programming guide to support all languages, plus other tweaks throughout 3b6a876 [Matei Zaharia] More CSS tweaks 01ec8bf [Matei Zaharia] More CSS tweaks e6d252e [Matei Zaharia] Change color of doc title bar to differentiate from 0.9.0
2014-05-30 03:34:33 -04:00
[Parquet](http://parquet.io) is a columnar format that is supported by many other data processing systems.
Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema
of the original data. When writing Parquet files, all columns are automatically converted to be nullable for
compatibility reasons.
### Loading Data Programmatically
Using the data from the above example:
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
<div class="codetabs">
<div data-lang="scala" markdown="1">
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
{% highlight scala %}
// spark from the previous example is used in this example.
// This is used to implicitly convert an RDD to a DataFrame.
import spark.implicits._
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
val people: RDD[Person] = ... // An RDD of case class objects, from the previous example.
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
// The RDD is implicitly converted to a DataFrame by implicits, allowing it to be stored using Parquet.
people.write.parquet("people.parquet")
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
// Read in the parquet file created above. Parquet files are self-describing so the schema is preserved.
// The result of loading a Parquet file is also a DataFrame.
val parquetFile = spark.read.parquet("people.parquet")
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
// Parquet files can also be used to create a temporary view and then used in SQL statements.
parquetFile.createOrReplaceTempView("parquetFile")
val teenagers = spark.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
{% endhighlight %}
</div>
<div data-lang="java" markdown="1">
{% highlight java %}
// spark from the previous example is used in this example.
Dataset<Row> schemaPeople = ... // The DataFrame from the previous example.
// DataFrames can be saved as Parquet files, maintaining the schema information.
schemaPeople.write().parquet("people.parquet");
// Read in the Parquet file created above. Parquet files are self-describing so the schema is preserved.
// The result of loading a parquet file is also a DataFrame.
Dataset<Row> parquetFile = spark.read().parquet("people.parquet");
// Parquet files can also be used to create a temporary view and then used in SQL statements.
parquetFile.createOrReplaceTempView("parquetFile");
Dataset<Row> teenagers = spark.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19");
List<String> teenagerNames = teenagers.javaRDD().map(new Function<Row, String>() {
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
public String call(Row row) {
return "Name: " + row.getString(0);
}
}).collect();
SPARK-1374: PySpark API for SparkSQL An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries. ``` from pyspark.context import SQLContext sqlCtx = SQLContext(sc) rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}]) srdd = sqlCtx.applySchema(rdd) sqlCtx.registerRDDAsTable(srdd, "table1") srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1") srdd2.collect() ``` The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]``` Author: Ahir Reddy <ahirreddy@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #363 from ahirreddy/pysql and squashes the following commits: 0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns 307d6e0 [Ahir Reddy] Style fix 6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies 3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py 29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD f2312c7 [Ahir Reddy] Moved everything into sql.py a19afe4 [Ahir Reddy] Doc fixes 6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL 521ff6d [Ahir Reddy] Trying to get spark to build with hive ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins ded03e7 [Ahir Reddy] Added doc test for HiveContext 22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency e4da06c [Ahir Reddy] Display message if hive is not built into spark 227a0be [Michael Armbrust] Update API links. Fix Hive example. 58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes. 4285340 [Michael Armbrust] Fix building of Hive API Docs. 38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs. 337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build 40491c9 [Ahir Reddy] PR Changes + Method Visibility 1836944 [Michael Armbrust] Fix comments. e00980f [Michael Armbrust] First draft of python sql programming guide. b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test f98a422 [Ahir Reddy] HiveContexts 79621cf [Ahir Reddy] cleaning up cruft b406ba0 [Ahir Reddy] doctest formatting 20936a5 [Ahir Reddy] Added tests and documentation e4d21b4 [Ahir Reddy] Added pyrolite dependency 79f739d [Ahir Reddy] added more tests 7515ba0 [Ahir Reddy] added more tests :) d26ec5e [Ahir Reddy] added test e9f5b8d [Ahir Reddy] adding tests 906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python 251f99d [Ahir Reddy] for now only allow dictionaries as input 09b9980 [Ahir Reddy] made jrdd explicitly lazy c608947 [Ahir Reddy] SchemaRDD now has all RDD operations 725c91e [Ahir Reddy] awesome row objects 55d1c76 [Ahir Reddy] return row objects 4fe1319 [Ahir Reddy] output dictionaries correctly be079de [Ahir Reddy] returning dictionaries works cd5f79f [Ahir Reddy] Switched to using Scala SQLContext e948bd9 [Ahir Reddy] yippie 4886052 [Ahir Reddy] even better c0fb1c6 [Ahir Reddy] more working 043ca85 [Ahir Reddy] working 5496f9f [Ahir Reddy] doesn't crash b8b904b [Ahir Reddy] Added schema rdd class 67ba875 [Ahir Reddy] java to python, and python to java bcc0f23 [Ahir Reddy] Java to python ab6025d [Ahir Reddy] compiling
2014-04-15 03:07:55 -04:00
{% endhighlight %}
</div>
<div data-lang="python" markdown="1">
{% highlight python %}
# spark from the previous example is used in this example.
SPARK-1374: PySpark API for SparkSQL An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries. ``` from pyspark.context import SQLContext sqlCtx = SQLContext(sc) rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}]) srdd = sqlCtx.applySchema(rdd) sqlCtx.registerRDDAsTable(srdd, "table1") srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1") srdd2.collect() ``` The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]``` Author: Ahir Reddy <ahirreddy@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #363 from ahirreddy/pysql and squashes the following commits: 0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns 307d6e0 [Ahir Reddy] Style fix 6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies 3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py 29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD f2312c7 [Ahir Reddy] Moved everything into sql.py a19afe4 [Ahir Reddy] Doc fixes 6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL 521ff6d [Ahir Reddy] Trying to get spark to build with hive ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins ded03e7 [Ahir Reddy] Added doc test for HiveContext 22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency e4da06c [Ahir Reddy] Display message if hive is not built into spark 227a0be [Michael Armbrust] Update API links. Fix Hive example. 58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes. 4285340 [Michael Armbrust] Fix building of Hive API Docs. 38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs. 337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build 40491c9 [Ahir Reddy] PR Changes + Method Visibility 1836944 [Michael Armbrust] Fix comments. e00980f [Michael Armbrust] First draft of python sql programming guide. b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test f98a422 [Ahir Reddy] HiveContexts 79621cf [Ahir Reddy] cleaning up cruft b406ba0 [Ahir Reddy] doctest formatting 20936a5 [Ahir Reddy] Added tests and documentation e4d21b4 [Ahir Reddy] Added pyrolite dependency 79f739d [Ahir Reddy] added more tests 7515ba0 [Ahir Reddy] added more tests :) d26ec5e [Ahir Reddy] added test e9f5b8d [Ahir Reddy] adding tests 906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python 251f99d [Ahir Reddy] for now only allow dictionaries as input 09b9980 [Ahir Reddy] made jrdd explicitly lazy c608947 [Ahir Reddy] SchemaRDD now has all RDD operations 725c91e [Ahir Reddy] awesome row objects 55d1c76 [Ahir Reddy] return row objects 4fe1319 [Ahir Reddy] output dictionaries correctly be079de [Ahir Reddy] returning dictionaries works cd5f79f [Ahir Reddy] Switched to using Scala SQLContext e948bd9 [Ahir Reddy] yippie 4886052 [Ahir Reddy] even better c0fb1c6 [Ahir Reddy] more working 043ca85 [Ahir Reddy] working 5496f9f [Ahir Reddy] doesn't crash b8b904b [Ahir Reddy] Added schema rdd class 67ba875 [Ahir Reddy] java to python, and python to java bcc0f23 [Ahir Reddy] Java to python ab6025d [Ahir Reddy] compiling
2014-04-15 03:07:55 -04:00
schemaPeople # The DataFrame from the previous example.
SPARK-1374: PySpark API for SparkSQL An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries. ``` from pyspark.context import SQLContext sqlCtx = SQLContext(sc) rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}]) srdd = sqlCtx.applySchema(rdd) sqlCtx.registerRDDAsTable(srdd, "table1") srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1") srdd2.collect() ``` The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]``` Author: Ahir Reddy <ahirreddy@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #363 from ahirreddy/pysql and squashes the following commits: 0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns 307d6e0 [Ahir Reddy] Style fix 6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies 3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py 29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD f2312c7 [Ahir Reddy] Moved everything into sql.py a19afe4 [Ahir Reddy] Doc fixes 6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL 521ff6d [Ahir Reddy] Trying to get spark to build with hive ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins ded03e7 [Ahir Reddy] Added doc test for HiveContext 22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency e4da06c [Ahir Reddy] Display message if hive is not built into spark 227a0be [Michael Armbrust] Update API links. Fix Hive example. 58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes. 4285340 [Michael Armbrust] Fix building of Hive API Docs. 38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs. 337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build 40491c9 [Ahir Reddy] PR Changes + Method Visibility 1836944 [Michael Armbrust] Fix comments. e00980f [Michael Armbrust] First draft of python sql programming guide. b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test f98a422 [Ahir Reddy] HiveContexts 79621cf [Ahir Reddy] cleaning up cruft b406ba0 [Ahir Reddy] doctest formatting 20936a5 [Ahir Reddy] Added tests and documentation e4d21b4 [Ahir Reddy] Added pyrolite dependency 79f739d [Ahir Reddy] added more tests 7515ba0 [Ahir Reddy] added more tests :) d26ec5e [Ahir Reddy] added test e9f5b8d [Ahir Reddy] adding tests 906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python 251f99d [Ahir Reddy] for now only allow dictionaries as input 09b9980 [Ahir Reddy] made jrdd explicitly lazy c608947 [Ahir Reddy] SchemaRDD now has all RDD operations 725c91e [Ahir Reddy] awesome row objects 55d1c76 [Ahir Reddy] return row objects 4fe1319 [Ahir Reddy] output dictionaries correctly be079de [Ahir Reddy] returning dictionaries works cd5f79f [Ahir Reddy] Switched to using Scala SQLContext e948bd9 [Ahir Reddy] yippie 4886052 [Ahir Reddy] even better c0fb1c6 [Ahir Reddy] more working 043ca85 [Ahir Reddy] working 5496f9f [Ahir Reddy] doesn't crash b8b904b [Ahir Reddy] Added schema rdd class 67ba875 [Ahir Reddy] java to python, and python to java bcc0f23 [Ahir Reddy] Java to python ab6025d [Ahir Reddy] compiling
2014-04-15 03:07:55 -04:00
# DataFrames can be saved as Parquet files, maintaining the schema information.
schemaPeople.write.parquet("people.parquet")
SPARK-1374: PySpark API for SparkSQL An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries. ``` from pyspark.context import SQLContext sqlCtx = SQLContext(sc) rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}]) srdd = sqlCtx.applySchema(rdd) sqlCtx.registerRDDAsTable(srdd, "table1") srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1") srdd2.collect() ``` The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]``` Author: Ahir Reddy <ahirreddy@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #363 from ahirreddy/pysql and squashes the following commits: 0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns 307d6e0 [Ahir Reddy] Style fix 6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies 3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py 29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD f2312c7 [Ahir Reddy] Moved everything into sql.py a19afe4 [Ahir Reddy] Doc fixes 6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL 521ff6d [Ahir Reddy] Trying to get spark to build with hive ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins ded03e7 [Ahir Reddy] Added doc test for HiveContext 22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency e4da06c [Ahir Reddy] Display message if hive is not built into spark 227a0be [Michael Armbrust] Update API links. Fix Hive example. 58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes. 4285340 [Michael Armbrust] Fix building of Hive API Docs. 38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs. 337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build 40491c9 [Ahir Reddy] PR Changes + Method Visibility 1836944 [Michael Armbrust] Fix comments. e00980f [Michael Armbrust] First draft of python sql programming guide. b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test f98a422 [Ahir Reddy] HiveContexts 79621cf [Ahir Reddy] cleaning up cruft b406ba0 [Ahir Reddy] doctest formatting 20936a5 [Ahir Reddy] Added tests and documentation e4d21b4 [Ahir Reddy] Added pyrolite dependency 79f739d [Ahir Reddy] added more tests 7515ba0 [Ahir Reddy] added more tests :) d26ec5e [Ahir Reddy] added test e9f5b8d [Ahir Reddy] adding tests 906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python 251f99d [Ahir Reddy] for now only allow dictionaries as input 09b9980 [Ahir Reddy] made jrdd explicitly lazy c608947 [Ahir Reddy] SchemaRDD now has all RDD operations 725c91e [Ahir Reddy] awesome row objects 55d1c76 [Ahir Reddy] return row objects 4fe1319 [Ahir Reddy] output dictionaries correctly be079de [Ahir Reddy] returning dictionaries works cd5f79f [Ahir Reddy] Switched to using Scala SQLContext e948bd9 [Ahir Reddy] yippie 4886052 [Ahir Reddy] even better c0fb1c6 [Ahir Reddy] more working 043ca85 [Ahir Reddy] working 5496f9f [Ahir Reddy] doesn't crash b8b904b [Ahir Reddy] Added schema rdd class 67ba875 [Ahir Reddy] java to python, and python to java bcc0f23 [Ahir Reddy] Java to python ab6025d [Ahir Reddy] compiling
2014-04-15 03:07:55 -04:00
# Read in the Parquet file created above. Parquet files are self-describing so the schema is preserved.
# The result of loading a parquet file is also a DataFrame.
parquetFile = spark.read.parquet("people.parquet")
SPARK-1374: PySpark API for SparkSQL An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries. ``` from pyspark.context import SQLContext sqlCtx = SQLContext(sc) rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}]) srdd = sqlCtx.applySchema(rdd) sqlCtx.registerRDDAsTable(srdd, "table1") srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1") srdd2.collect() ``` The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]``` Author: Ahir Reddy <ahirreddy@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #363 from ahirreddy/pysql and squashes the following commits: 0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns 307d6e0 [Ahir Reddy] Style fix 6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies 3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py 29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD f2312c7 [Ahir Reddy] Moved everything into sql.py a19afe4 [Ahir Reddy] Doc fixes 6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL 521ff6d [Ahir Reddy] Trying to get spark to build with hive ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins ded03e7 [Ahir Reddy] Added doc test for HiveContext 22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency e4da06c [Ahir Reddy] Display message if hive is not built into spark 227a0be [Michael Armbrust] Update API links. Fix Hive example. 58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes. 4285340 [Michael Armbrust] Fix building of Hive API Docs. 38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs. 337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build 40491c9 [Ahir Reddy] PR Changes + Method Visibility 1836944 [Michael Armbrust] Fix comments. e00980f [Michael Armbrust] First draft of python sql programming guide. b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test f98a422 [Ahir Reddy] HiveContexts 79621cf [Ahir Reddy] cleaning up cruft b406ba0 [Ahir Reddy] doctest formatting 20936a5 [Ahir Reddy] Added tests and documentation e4d21b4 [Ahir Reddy] Added pyrolite dependency 79f739d [Ahir Reddy] added more tests 7515ba0 [Ahir Reddy] added more tests :) d26ec5e [Ahir Reddy] added test e9f5b8d [Ahir Reddy] adding tests 906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python 251f99d [Ahir Reddy] for now only allow dictionaries as input 09b9980 [Ahir Reddy] made jrdd explicitly lazy c608947 [Ahir Reddy] SchemaRDD now has all RDD operations 725c91e [Ahir Reddy] awesome row objects 55d1c76 [Ahir Reddy] return row objects 4fe1319 [Ahir Reddy] output dictionaries correctly be079de [Ahir Reddy] returning dictionaries works cd5f79f [Ahir Reddy] Switched to using Scala SQLContext e948bd9 [Ahir Reddy] yippie 4886052 [Ahir Reddy] even better c0fb1c6 [Ahir Reddy] more working 043ca85 [Ahir Reddy] working 5496f9f [Ahir Reddy] doesn't crash b8b904b [Ahir Reddy] Added schema rdd class 67ba875 [Ahir Reddy] java to python, and python to java bcc0f23 [Ahir Reddy] Java to python ab6025d [Ahir Reddy] compiling
2014-04-15 03:07:55 -04:00
# Parquet files can also be used to create a temporary view and then used in SQL statements.
parquetFile.createOrReplaceTempView("parquetFile");
teenagers = spark.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
teenNames = teenagers.map(lambda p: "Name: " + p.name)
for teenName in teenNames.collect():
print(teenName)
{% endhighlight %}
</div>
<div data-lang="r" markdown="1">
{% highlight r %}
schemaPeople # The SparkDataFrame from the previous example.
# SparkDataFrame can be saved as Parquet files, maintaining the schema information.
write.parquet(schemaPeople, "people.parquet")
# Read in the Parquet file created above. Parquet files are self-describing so the schema is preserved.
# The result of loading a parquet file is also a DataFrame.
parquetFile <- read.parquet("people.parquet")
# Parquet files can also be used to create a temporary view and then used in SQL statements.
createOrReplaceTempView(parquetFile, "parquetFile")
teenagers <- sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
schema <- structType(structField("name", "string"))
teenNames <- dapply(df, function(p) { cbind(paste("Name:", p$name)) }, schema)
for (teenName in collect(teenNames)$name) {
cat(teenName, "\n")
}
{% endhighlight %}
</div>
<div data-lang="sql" markdown="1">
{% highlight sql %}
CREATE TEMPORARY VIEW parquetTable
USING org.apache.spark.sql.parquet
OPTIONS (
path "examples/src/main/resources/people.parquet"
)
SELECT * FROM parquetTable
{% endhighlight %}
</div>
</div>
### Partition Discovery
Table partitioning is a common optimization approach used in systems like Hive. In a partitioned
table, data are usually stored in different directories, with partitioning column values encoded in
the path of each partition directory. The Parquet data source is now able to discover and infer
partitioning information automatically. For example, we can store all our previously used
population data into a partitioned table using the following directory structure, with two extra
columns, `gender` and `country` as partitioning columns:
{% highlight text %}
path
└── to
└── table
├── gender=male
│   ├── ...
│   │
│   ├── country=US
│   │   └── data.parquet
│   ├── country=CN
│   │   └── data.parquet
│   └── ...
└── gender=female
   ├── ...
  
   ├── country=US
   │   └── data.parquet
   ├── country=CN
   │   └── data.parquet
   └── ...
{% endhighlight %}
By passing `path/to/table` to either `SparkSession.read.parquet` or `SparkSession.read.load`, Spark SQL
will automatically extract the partitioning information from the paths.
Now the schema of the returned DataFrame becomes:
{% highlight text %}
root
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- gender: string (nullable = true)
|-- country: string (nullable = true)
{% endhighlight %}
Notice that the data types of the partitioning columns are automatically inferred. Currently,
numeric data types and string type are supported. Sometimes users may not want to automatically
infer the data types of the partitioning columns. For these use cases, the automatic type inference
can be configured by `spark.sql.sources.partitionColumnTypeInference.enabled`, which is default to
`true`. When type inference is disabled, string type will be used for the partitioning columns.
Starting from Spark 1.6.0, partition discovery only finds partitions under the given paths
by default. For the above example, if users pass `path/to/table/gender=male` to either
`SparkSession.read.parquet` or `SparkSession.read.load`, `gender` will not be considered as a
partitioning column. If users need to specify the base path that partition discovery
should start with, they can set `basePath` in the data source options. For example,
when `path/to/table/gender=male` is the path of the data and
users set `basePath` to `path/to/table/`, `gender` will be a partitioning column.
### Schema Merging
Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with
a simple schema, and gradually add more columns to the schema as needed. In this way, users may end
up with multiple Parquet files with different but mutually compatible schemas. The Parquet data
source is now able to automatically detect this case and merge schemas of all these files.
Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we
turned it off by default starting from 1.5.0. You may enable it by
1. setting data source option `mergeSchema` to `true` when reading Parquet files (as shown in the
examples below), or
2. setting the global SQL option `spark.sql.parquet.mergeSchema` to `true`.
<div class="codetabs">
<div data-lang="scala" markdown="1">
{% highlight scala %}
// spark from the previous example is used in this example.
// This is used to implicitly convert an RDD to a DataFrame.
import spark.implicits._
// Create a simple DataFrame, stored into a partition directory
val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
df1.write.parquet("data/test_table/key=1")
// Create another DataFrame in a new partition directory,
// adding a new column and dropping an existing column
val df2 = sc.makeRDD(6 to 10).map(i => (i, i * 3)).toDF("single", "triple")
df2.write.parquet("data/test_table/key=2")
// Read the partitioned table
val df3 = spark.read.option("mergeSchema", "true").parquet("data/test_table")
df3.printSchema()
// The final schema consists of all 3 columns in the Parquet files together
// with the partitioning column appeared in the partition directory paths.
// root
// |-- single: int (nullable = true)
// |-- double: int (nullable = true)
// |-- triple: int (nullable = true)
// |-- key : int (nullable = true)
{% endhighlight %}
</div>
<div data-lang="python" markdown="1">
{% highlight python %}
# spark from the previous example is used in this example.
# Create a simple DataFrame, stored into a partition directory
df1 = spark.createDataFrame(sc.parallelize(range(1, 6))\
.map(lambda i: Row(single=i, double=i * 2)))
df1.write.parquet("data/test_table/key=1")
# Create another DataFrame in a new partition directory,
# adding a new column and dropping an existing column
df2 = spark.createDataFrame(sc.parallelize(range(6, 11))
.map(lambda i: Row(single=i, triple=i * 3)))
df2.write.parquet("data/test_table/key=2")
# Read the partitioned table
df3 = spark.read.option("mergeSchema", "true").parquet("data/test_table")
df3.printSchema()
# The final schema consists of all 3 columns in the Parquet files together
# with the partitioning column appeared in the partition directory paths.
# root
# |-- single: int (nullable = true)
# |-- double: int (nullable = true)
# |-- triple: int (nullable = true)
# |-- key : int (nullable = true)
{% endhighlight %}
</div>
<div data-lang="r" markdown="1">
{% highlight r %}
# Create a simple DataFrame, stored into a partition directory
write.df(df1, "data/test_table/key=1", "parquet", "overwrite")
# Create another DataFrame in a new partition directory,
# adding a new column and dropping an existing column
write.df(df2, "data/test_table/key=2", "parquet", "overwrite")
# Read the partitioned table
df3 <- read.df("data/test_table", "parquet", mergeSchema="true")
printSchema(df3)
# The final schema consists of all 3 columns in the Parquet files together
# with the partitioning column appeared in the partition directory paths.
# root
# |-- single: int (nullable = true)
# |-- double: int (nullable = true)
# |-- triple: int (nullable = true)
# |-- key : int (nullable = true)
{% endhighlight %}
</div>
</div>
### Hive metastore Parquet table conversion
When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own
Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the
`spark.sql.hive.convertMetastoreParquet` configuration, and is turned on by default.
#### Hive/Parquet Schema Reconciliation
There are two key differences between Hive and Parquet from the perspective of table schema
processing.
1. Hive is case insensitive, while Parquet is not
1. Hive considers all columns nullable, while nullability in Parquet is significant
Due to this reason, we must reconcile Hive metastore schema with Parquet schema when converting a
Hive metastore Parquet table to a Spark SQL Parquet table. The reconciliation rules are:
1. Fields that have the same name in both schema must have the same data type regardless of
nullability. The reconciled field should have the data type of the Parquet side, so that
nullability is respected.
1. The reconciled schema contains exactly those fields defined in Hive metastore schema.
- Any fields that only appear in the Parquet schema are dropped in the reconciled schema.
- Any fields that only appear in the Hive metastore schema are added as nullable field in the
reconciled schema.
#### Metadata Refreshing
Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table
conversion is enabled, metadata of those converted tables are also cached. If these tables are
updated by Hive or other external tools, you need to refresh them manually to ensure consistent
metadata.
<div class="codetabs">
<div data-lang="scala" markdown="1">
{% highlight scala %}
// spark is an existing HiveContext
spark.refreshTable("my_table")
{% endhighlight %}
</div>
<div data-lang="java" markdown="1">
{% highlight java %}
// spark is an existing HiveContext
spark.refreshTable("my_table")
{% endhighlight %}
</div>
<div data-lang="python" markdown="1">
{% highlight python %}
# spark is an existing HiveContext
spark.refreshTable("my_table")
{% endhighlight %}
</div>
<div data-lang="sql" markdown="1">
{% highlight sql %}
REFRESH TABLE my_table;
{% endhighlight %}
</div>
</div>
### Configuration
Configuration of Parquet can be done using the `setConf` method on `SparkSession` or by running
`SET key=value` commands using SQL.
<table class="table">
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
<tr>
<td><code>spark.sql.parquet.binaryAsString</code></td>
<td>false</td>
<td>
Some other Parquet-producing systems, in particular Impala, Hive, and older versions of Spark SQL, do
[SPARK-7961][SQL]Refactor SQLConf to display better error message 1. Add `SQLConfEntry` to store the information about a configuration. For those configurations that cannot be found in `sql-programming-guide.md`, I left the doc as `<TODO>`. 2. Verify the value when setting a configuration if this is in SQLConf. 3. Use `SET -v` to display all public configurations. Author: zsxwing <zsxwing@gmail.com> Closes #6747 from zsxwing/sqlconf and squashes the following commits: 7d09bad [zsxwing] Use SQLConfEntry in HiveContext 49f6213 [zsxwing] Add getConf, setConf to SQLContext and HiveContext e014f53 [zsxwing] Merge branch 'master' into sqlconf 93dad8e [zsxwing] Fix the unit tests cf950c1 [zsxwing] Fix the code style and tests 3c5f03e [zsxwing] Add unsetConf(SQLConfEntry) and fix the code style a2f4add [zsxwing] getConf will return the default value if a config is not set 037b1db [zsxwing] Add schema to SetCommand 0520c3c [zsxwing] Merge branch 'master' into sqlconf 7afb0ec [zsxwing] Fix the configurations about HiveThriftServer 7e728e3 [zsxwing] Add doc for SQLConfEntry and fix 'toString' 5e95b10 [zsxwing] Add enumConf c6ba76d [zsxwing] setRawString => setConfString, getRawString => getConfString 4abd807 [zsxwing] Fix the test for 'set -v' 6e47e56 [zsxwing] Fix the compilation error 8973ced [zsxwing] Remove floatConf 1fc3a8b [zsxwing] Remove the 'conf' command and use 'set -v' instead 99c9c16 [zsxwing] Fix tests that use SQLConfEntry as a string 88a03cc [zsxwing] Add new lines between confs and return types ce7c6c8 [zsxwing] Remove seqConf f3c1b33 [zsxwing] Refactor SQLConf to display better error message
2015-06-18 02:22:54 -04:00
not differentiate between binary data and strings when writing out the Parquet schema. This
flag tells Spark SQL to interpret binary data as a string to provide compatibility with these systems.
</td>
</tr>
<tr>
<td><code>spark.sql.parquet.int96AsTimestamp</code></td>
<td>true</td>
<td>
Some Parquet-producing systems, in particular Impala and Hive, store Timestamp into INT96. This
flag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems.
</td>
</tr>
<tr>
<td><code>spark.sql.parquet.cacheMetadata</code></td>
<td>true</td>
<td>
[SPARK-7961][SQL]Refactor SQLConf to display better error message 1. Add `SQLConfEntry` to store the information about a configuration. For those configurations that cannot be found in `sql-programming-guide.md`, I left the doc as `<TODO>`. 2. Verify the value when setting a configuration if this is in SQLConf. 3. Use `SET -v` to display all public configurations. Author: zsxwing <zsxwing@gmail.com> Closes #6747 from zsxwing/sqlconf and squashes the following commits: 7d09bad [zsxwing] Use SQLConfEntry in HiveContext 49f6213 [zsxwing] Add getConf, setConf to SQLContext and HiveContext e014f53 [zsxwing] Merge branch 'master' into sqlconf 93dad8e [zsxwing] Fix the unit tests cf950c1 [zsxwing] Fix the code style and tests 3c5f03e [zsxwing] Add unsetConf(SQLConfEntry) and fix the code style a2f4add [zsxwing] getConf will return the default value if a config is not set 037b1db [zsxwing] Add schema to SetCommand 0520c3c [zsxwing] Merge branch 'master' into sqlconf 7afb0ec [zsxwing] Fix the configurations about HiveThriftServer 7e728e3 [zsxwing] Add doc for SQLConfEntry and fix 'toString' 5e95b10 [zsxwing] Add enumConf c6ba76d [zsxwing] setRawString => setConfString, getRawString => getConfString 4abd807 [zsxwing] Fix the test for 'set -v' 6e47e56 [zsxwing] Fix the compilation error 8973ced [zsxwing] Remove floatConf 1fc3a8b [zsxwing] Remove the 'conf' command and use 'set -v' instead 99c9c16 [zsxwing] Fix tests that use SQLConfEntry as a string 88a03cc [zsxwing] Add new lines between confs and return types ce7c6c8 [zsxwing] Remove seqConf f3c1b33 [zsxwing] Refactor SQLConf to display better error message
2015-06-18 02:22:54 -04:00
Turns on caching of Parquet schema metadata. Can speed up querying of static data.
</td>
</tr>
<tr>
<td><code>spark.sql.parquet.compression.codec</code></td>
<td>gzip</td>
<td>
Sets the compression codec use when writing Parquet files. Acceptable values include:
uncompressed, snappy, gzip, lzo.
</td>
</tr>
<tr>
<td><code>spark.sql.parquet.filterPushdown</code></td>
<td>true</td>
<td>Enables Parquet filter push-down optimization when set to true.</td>
</tr>
<tr>
<td><code>spark.sql.hive.convertMetastoreParquet</code></td>
<td>true</td>
<td>
When set to false, Spark SQL will use the Hive SerDe for parquet tables instead of the built in
support.
</td>
</tr>
<tr>
<td><code>spark.sql.parquet.mergeSchema</code></td>
<td>false</td>
<td>
<p>
When true, the Parquet data source merges schemas collected from all data files, otherwise the
schema is picked from the summary file or a random data file if no summary file is available.
</p>
</td>
</tr>
</table>
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
## JSON Datasets
<div class="codetabs">
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
<div data-lang="scala" markdown="1">
Spark SQL can automatically infer the schema of a JSON dataset and load it as a `Dataset[Row]`.
This conversion can be done using `SparkSession.read.json()` on either an RDD of String,
or a JSON file.
Note that the file that is offered as _a json file_ is not a typical JSON file. Each
line must contain a separate, self-contained valid JSON object. As a consequence,
a regular multi-line JSON file will most often fail.
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
{% highlight scala %}
val spark: SparkSession // An existing SparkSession
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files.
val path = "examples/src/main/resources/people.json"
val people = spark.read.json(path)
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
// The inferred schema can be visualized using the printSchema() method.
people.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
// Creates a temporary view using the DataFrame
people.createOrReplaceTempView("people")
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
// SQL statements can be run by using the sql methods provided by spark.
val teenagers = spark.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
// Alternatively, a DataFrame can be created for a JSON dataset represented by
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
// an RDD[String] storing one JSON object per string.
val anotherPeopleRDD = sc.parallelize(
"""{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val anotherPeople = spark.read.json(anotherPeopleRDD)
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
{% endhighlight %}
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
</div>
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
<div data-lang="java" markdown="1">
Spark SQL can automatically infer the schema of a JSON dataset and load it as a `Dataset<Row>`.
This conversion can be done using `SparkSession.read().json()` on either an RDD of String,
or a JSON file.
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
Note that the file that is offered as _a json file_ is not a typical JSON file. Each
line must contain a separate, self-contained valid JSON object. As a consequence,
a regular multi-line JSON file will most often fail.
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
{% highlight java %}
// sc is an existing JavaSparkContext.
SparkSession spark = new org.apache.spark.sql.SparkSession(sc);
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files.
Dataset<Row> people = spark.read().json("examples/src/main/resources/people.json");
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
// The inferred schema can be visualized using the printSchema() method.
people.printSchema();
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
// Creates a temporary view using the DataFrame
people.createOrReplaceTempView("people");
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
// SQL statements can be run by using the sql methods provided by spark.
Dataset<Row> teenagers = spark.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19");
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
// Alternatively, a DataFrame can be created for a JSON dataset represented by
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
// an RDD[String] storing one JSON object per string.
List<String> jsonData = Arrays.asList(
"{\"name\":\"Yin\",\"address\":{\"city\":\"Columbus\",\"state\":\"Ohio\"}}");
JavaRDD<String> anotherPeopleRDD = sc.parallelize(jsonData);
Dataset<Row> anotherPeople = spark.read().json(anotherPeopleRDD);
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
{% endhighlight %}
</div>
<div data-lang="python" markdown="1">
Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame.
This conversion can be done using `SparkSession.read.json` on a JSON file.
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
Note that the file that is offered as _a json file_ is not a typical JSON file. Each
line must contain a separate, self-contained valid JSON object. As a consequence,
a regular multi-line JSON file will most often fail.
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
{% highlight python %}
# spark is an existing SparkSession.
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
# A JSON dataset is pointed to by path.
# The path can be either a single text file or a directory storing text files.
people = spark.read.json("examples/src/main/resources/people.json")
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
# The inferred schema can be visualized using the printSchema() method.
people.printSchema()
# root
# |-- age: long (nullable = true)
# |-- name: string (nullable = true)
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
# Creates a temporary view using the DataFrame.
people.createOrReplaceTempView("people")
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
# SQL statements can be run by using the sql methods provided by `spark`.
teenagers = spark.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
# Alternatively, a DataFrame can be created for a JSON dataset represented by
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
# an RDD[String] storing one JSON object per string.
anotherPeopleRDD = sc.parallelize([
'{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}'])
anotherPeople = spark.jsonRDD(anotherPeopleRDD)
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
{% endhighlight %}
</div>
<div data-lang="r" markdown="1">
Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. using
the `read.json()` function, which loads data from a directory of JSON files where each line of the
files is a JSON object.
Note that the file that is offered as _a json file_ is not a typical JSON file. Each
line must contain a separate, self-contained valid JSON object. As a consequence,
a regular multi-line JSON file will most often fail.
{% highlight r %}
# A JSON dataset is pointed to by path.
# The path can be either a single text file or a directory storing text files.
path <- "examples/src/main/resources/people.json"
# Create a DataFrame from the file(s) pointed to by path
people <- read.json(path)
# The inferred schema can be visualized using the printSchema() method.
printSchema(people)
# root
# |-- age: long (nullable = true)
# |-- name: string (nullable = true)
# Register this DataFrame as a table.
createOrReplaceTempView(people, "people")
# SQL statements can be run by using the sql methods.
teenagers <- sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
{% endhighlight %}
</div>
<div data-lang="sql" markdown="1">
{% highlight sql %}
CREATE TEMPORARY VIEW jsonTable
USING org.apache.spark.sql.json
OPTIONS (
path "examples/src/main/resources/people.json"
)
SELECT * FROM jsonTable
{% endhighlight %}
</div>
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
</div>
## Hive Tables
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
Spark SQL also supports reading and writing data stored in [Apache Hive](http://hive.apache.org/).
However, since Hive has a large number of dependencies, these dependencies are not included in the
default Spark distribution. If Hive dependencies can be found on the classpath, Spark will load them
automatically. Note that these Hive dependencies must also be present on all of the worker nodes, as
they will need access to the Hive serialization and deserialization libraries (SerDes) in order to
access data stored in Hive.
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` (for security configuration),
and `hdfs-site.xml` (for HDFS configuration) file in `conf/`.
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
<div class="codetabs">
<div data-lang="scala" markdown="1">
When working with Hive, one must instantiate `SparkSession` with Hive support, including
connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions.
Users who do not have an existing Hive deployment can still enable Hive support. When not configured
by the `hive-site.xml`, the context automatically creates `metastore_db` in the current directory and
creates a directory configured by `spark.sql.warehouse.dir`, which defaults to the directory
`spark-warehouse` in the current directory that the spark application is started. Note that
the `hive.metastore.warehouse.dir` property in `hive-site.xml` is deprecated since Spark 2.0.0.
Instead, use `spark.sql.warehouse.dir` to specify the default location of database in warehouse.
You may need to grant write privilege to the user who starts the spark application.
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
{% highlight scala %}
// warehouse_location points to the default location for managed databases and tables
val conf = new SparkConf().setAppName("HiveFromSpark").set("spark.sql.warehouse.dir", warehouse_location)
val spark = SparkSession.builder.config(conf).enableHiveSupport().getOrCreate()
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
spark.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")
SPARK-1251 Support for optimizing and executing structured queries This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL. *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.* The code is broken into three primary components: - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs. This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files. - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251). [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review. With this PR comes support for inferring the schema of existing RDDs that contain case classes. Using this information, developers can now express structured queries that are automatically compiled into RDD operations. ```scala // Define the schema using a case class. case class Person(name: String, age: Int) val people: RDD[Person] = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)) // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19' val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd ``` RDDs can also be registered as Tables, allowing SQL queries to be written over them. ```scala people.registerAsTable("people") val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19") ``` The results of queries are themselves RDDs and support standard RDD operations: ```scala teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ``` Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL. ```scala sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sql("SELECT key, value FROM src").collect().foreach(println) ``` ## Relationship to Shark Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project. Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <huaiyin.thu@gmail.com> Author: Reynold Xin <rxin@apache.org> Author: Lian, Cheng <rhythm.mail@gmail.com> Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Yin Huai <huai@cse.ohio-state.edu> Author: Timothy Chen <tnachen@gmail.com> Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Timothy Chen <tnachen@apache.org> Author: Henry Cook <henry.m.cook+github@gmail.com> Author: Mark Hamstra <markhamstra@gmail.com> Closes #146 from marmbrus/catalyst and squashes the following commits: 458bd1b [Michael Armbrust] Update people.txt 0d638c3 [Michael Armbrust] Typo fix from @ash211. bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests 778299a [Michael Armbrust] Fix some old links to spark-project.org fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations. This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user. This change also makes it slightly less verbose to run language integrated queries. fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD. 48a99bc [Michael Armbrust] Address first round of feedback. 461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour adcf1a4 [Henry Cook] Update sql-programming-guide.md 9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins. 6978dd8 [Michael Armbrust] update docs, add apache license 1d0eb63 [Michael Armbrust] update changes with spark core e5e1d6b [Michael Armbrust] Remove travis configuration. c2efad6 [Michael Armbrust] First draft of SQL documentation. 013f62a [Michael Armbrust] Fix documentation / code style. c01470f [Michael Armbrust] Clean up example 2f22454 [Michael Armbrust] WIP: Parquet example. ce8073b [Michael Armbrust] clean up implicits. f7d992d [Michael Armbrust] Naming / spelling. 9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext. d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar. Create a separate, optional Hive assembly that is used when present. 8b35e0a [Michael Armbrust] address feedback, work on DSL 5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation 1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven 3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes 3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context. 7233a74 [Michael Armbrust] initial support for maven builds f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven 7386a9f [Michael Armbrust] Initial example programs using spark sql. aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow 7ca4b4e [Andre Schumacher] Improving checks in Parquet tests 5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport 54637ec [Andre Schumacher] First part of second round of code review feedback c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows ba28849 [Michael Armbrust] code review comments. d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization. 9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates. Also add a constructor so that it can be serialized out-of-the-box. 959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream. d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path. c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support 3c3f962 [Michael Armbrust] Fix a bug due to array reuse. This will need to be revisited after we merge the mutable row PR. 7d0f13e [Michael Armbrust] Update parquet support with master. 9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support 0040ae6 [Andre Schumacher] Feedback from code review 1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow 70e489d [Cheng Lian] Fixed a spelling typo 6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object. 8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly 99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval 7b9d142 [Michael Armbrust] Update travis to increase permgen size. da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS. 6fdefe6 [Michael Armbrust] Port sbt improvements from master. 296fe50 [Michael Armbrust] Address review feedback. d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework. 3bda72d [Andre Schumacher] Adding license banner to new files 3ac9eb0 [Andre Schumacher] Rebasing to new main branch c863bed [Andre Schumacher] Codestyle checks 61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite 3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala 3a0a552 [Andre Schumacher] Reorganizing Parquet table operations 18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables 75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection 6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan 0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport 6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase 99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types c334386 [Michael Armbrust] Initial support for generating schema's based on case classes. 608a29e [Michael Armbrust] Add hive as a repl dependency 7413ac2 [Michael Armbrust] make test downloading quieter. 4d57d0e [Michael Armbrust] Fix test execution on travis. 5f2963c [Michael Armbrust] naming and continuous compilation fixes. f5e7492 [Michael Armbrust] Add Apache license. Make naming more consistent. 3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject. 2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query. 24eaa79 [Michael Armbrust] fix > 100 chars 6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL. df88f01 [Michael Armbrust] add a simple test for aggregation 18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data. b922511 [Michael Armbrust] Fix insertion of nested types into hive tables. 5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function. a430895 [Michael Armbrust] Planning for logical Repartition operators. 532dd37 [Michael Armbrust] Allow the local warehouse path to be specified. 4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter. 8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package. c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation. 29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables. 9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe cf4db59 [Lian, Cheng] Added golden answers for PruningSuite 54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names 2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query. 128a9f8 [Yin Huai] Minor changes. 017872c [Yin Huai] Remove stats20 from whitelist. a1a4776 [Yin Huai] Update comments. feb022c [Yin Huai] Partitioning key should be case insensitive. 555fb1d [Yin Huai] Correctly set the extension for a text file. d00260b [Yin Huai] Strips backticks from partition keys. 334aace [Yin Huai] New golden files. a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query. 428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`. eea75c5 [Yin Huai] Correctly set codec. 45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew e089627 [Yin Huai] Code style. 563bb22 [Yin Huai] Set compression info in FileSinkDesc. 35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables. 5495fab [Yin Huai] Remove cloneRecords which is no longer needed. 1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy. 3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location. 8506c17 [Michael Armbrust] Address review feedback. 3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master 9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling 566fd66 [Timothy Chen] Whitelist tests and add support for Binary type 69adf72 [Yin Huai] Set cloneRecords to false. a9c3188 [Timothy Chen] Fix udaf struct return 346f828 [Yin Huai] Move SharkHadoopWriter to the correct location. 59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew ed3a1d1 [Yin Huai] Load data directly into Hive. 7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT. b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile 1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile 5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii 678341a [Mark Hamstra] Replaced non-ascii text 887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew 1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents 7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package. bc9a12c [Michael Armbrust] Move hive test files. 5720d2b [Lian, Cheng] Fixed comment typo f0c3742 [Lian, Cheng] Refactored PhysicalOperation f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings 2407a21 [Lian, Cheng] Added optimized logical plan to debugging output a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens 9329820 [Michael Armbrust] add golden answer files to repository dce0593 [Michael Armbrust] move golden answer to the source code directory. 964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView 7785ee6 [Michael Armbrust] Tighten visibility based on comments. 341116c [Michael Armbrust] address comments. 0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS 2897deb [Michael Armbrust] fix scaladoc 7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries. b376d15 [Michael Armbrust] fix newlines at EOF 5cc367c [Michael Armbrust] use berkeley instead of cloudbees ff5ea3f [Michael Armbrust] new golden db92adc [Michael Armbrust] more tests passing. clean up logging. 740febb [Michael Armbrust] Tests for tgfs. 0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf. ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView dd00b7e [Michael Armbrust] initial implementation of generators. ea76cf9 [Michael Armbrust] Add NoRelation to planner. bea4b7f [Michael Armbrust] Add SumDistinct. 016b489 [Michael Armbrust] fix typo. acb9566 [Michael Armbrust] Correctly type attributes of CTAS. 8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation. 02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query. 5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes 5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg 8017afb [Michael Armbrust] fix copy paste error. dc6353b [Michael Armbrust] turn off deprecation cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance. 883006d [Michael Armbrust] improve tests. 32b615b [Michael Armbrust] add override to asPartial. e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe. f94345c [Michael Armbrust] fix doc link d8cb805 [Michael Armbrust] Implement partial aggregation. ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings. Remove a blank line. b4be6a5 [Michael Armbrust] better logging when applying rules. 67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex cb57459 [Michael Armbrust] blacklist machine specific test. 2f27604 [Michael Armbrust] Address comments / style errors. 389525d [Michael Armbrust] update golden, blacklist mr. e3c10bd [Michael Armbrust] update whitelist. 44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex 42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs. ab5bff3 [Michael Armbrust] Support for get item of map types. 1679554 [Michael Armbrust] add toString for if and IS NOT NULL. ab9a131 [Michael Armbrust] when UDFs fail they should return null. 25288d0 [Michael Armbrust] Implement [] for arrays and maps. e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions. 010accb [Michael Armbrust] add tinyint to metastore type parser. 7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes. ac9d7de [Michael Armbrust] Resolve *s in Transform clauses. 692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables. 92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects 9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector. 72a003d [Michael Armbrust] revert regex change 7661b6c [Michael Armbrust] blacklist machines specific tests aa430e7 [Michael Armbrust] Update .travis.yml e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs. 5e54aa6 [Michael Armbrust] quotes for struct field names. bbec500 [Michael Armbrust] update test coverage, new golden 3734a94 [Michael Armbrust] only quote string types. 3f9e519 [Michael Armbrust] use names w/ boolean args 5b3d2c8 [Michael Armbrust] implement distinct. 5b33216 [Michael Armbrust] work on decimal support. 2c6deb3 [Michael Armbrust] improve printing compatibility. 35a70fb [Michael Armbrust] multi-letter field names. a9388fb [Michael Armbrust] printing for map types. c3feda7 [Michael Armbrust] use toArray. c654f19 [Michael Armbrust] Support for list and maps in hive table scan. cf8d992 [Michael Armbrust] Use built in functions for creating temp directory. 1579eec [Michael Armbrust] Only cast unresolved inserts. 6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression. da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test. Not sure how this was passing before... 6709441 [Michael Armbrust] Evaluation for accessing nested fields. dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation. d670e41 [Michael Armbrust] Print nested fields like hive does. efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan. 9c22b4e [Michael Armbrust] Support for parsing nested types. 82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables ea6f37f [Michael Armbrust] fix style. 7845364 [Michael Armbrust] deactivate concurrent test. b649c20 [Michael Armbrust] fix test logging / caching. 1590568 [Michael Armbrust] add log4j.properties 19bfd74 [Michael Armbrust] store hive output in circular buffer dfb67aa [Michael Armbrust] add test case cb775ac [Michael Armbrust] get rid of SharkContext singleton 2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master 63003e9 [Michael Armbrust] Fix spacing. 41b41f3 [Michael Armbrust] Only cast unresolved inserts. 6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs 5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator b1151a8 [Timothy Chen] Fix load data regex 8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API. e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction 45b334b [Yin Huai] fix comments 235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning. 6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style 271e483 [Michael Armbrust] Update build status icon. d3a3d48 [Michael Armbrust] add testing to travis 807b2d7 [Michael Armbrust] check style and publish docs with travis d20b565 [Michael Armbrust] fix if style bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide. d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests. This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive. These are used by default when HIVE_HOME is not set. f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin 41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file). 5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference 7213a2c [Reynold Xin] style fix for Hive.scala. 08e4d05 [Reynold Xin] First round of style cleanup. 605255e [Reynold Xin] Added scalastyle checker. 61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases 2486fb7 [Lian, Cheng] Fixed spelling 8ee41be [Lian, Cheng] Minor refactoring ebb56fa [Michael Armbrust] add travis config 4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests d4f539a [Michael Armbrust] blacklist mr and user specific tests. 677eb07 [Michael Armbrust] Update test whitelist. 5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning c263c84 [Michael Armbrust] Only push predicates into partitioned table scans. ab77882 [Michael Armbrust] upgrade spark to RC5. c98ede5 [Lian, Cheng] Response to comments from @marmbrus 83d4520 [Yin Huai] marmbrus's comments 70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes 9ebff47 [Yin Huai] remove unnecessary .toSeq e811d1a [Yin Huai] markhamstra's comments 4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations. 040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators. 9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions. e9347fc [Michael Armbrust] Remove broken scaladoc links. 99c6707 [Michael Armbrust] upgrade spark 57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable cb49af0 [Lian, Cheng] Fixed Scaladoc links 4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion 111ffdc [Lian, Cheng] More comments and minor reformatting 9e0d840 [Lian, Cheng] Added partition pruning optimization 761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan 04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 9dd3b26 [Michael Armbrust] Fix scaladoc. 6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst 7c92a41 [Lian, Cheng] Added Hive SerDe support ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator 2957f31 [Yin Huai] addressed comments on PR 907db68 [Michael Armbrust] Space after while. 04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts 4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile 5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23 fd084a4 [Michael Armbrust] implement casts binary <=> string. 0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type 548e479 [Yin Huai] merge master into exchangeOperator and fix code style 5b11db0 [Reynold Xin] Added Void to Boolean type widening. 9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear. 9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20 b20a4d4 [Lian, Cheng] Fix issue #20 6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc. 4285962 [Michael Armbrust] Remove temporary test cases 167162f [Michael Armbrust] more merge errors, cleanup. e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge. 6377d0b [Michael Armbrust] Drop empty files, fix if (). c0b0e60 [Michael Armbrust] cleanup broken doc links. 330a88b [Michael Armbrust] Fix bugs in AddExchange. 4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering. 043e296 [Michael Armbrust] Make physical union nodes variadic. ece15e1 [Michael Armbrust] update unit tests 5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width. 9804eb5 [Michael Armbrust] upgrade spark 053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow 5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg 563053f [Michael Armbrust] Address @rxin's comments. 6537c66 [Michael Armbrust] Address @rxin's comments. 2a76fc6 [Michael Armbrust] add notes from @rxin. 685bfa1 [Michael Armbrust] fix spelling 69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions. 7859a86 [Michael Armbrust] Remove SparkAggregate. Its kinda broken and breaks RDD lineage. fc22e01 [Michael Armbrust] whitelist newly passing union test. 3f547b8 [Michael Armbrust] Add support for widening types in unions. 53b95f8 [Michael Armbrust] coercion should not occur until children are resolved. b892e32 [Michael Armbrust] Union is not resolved until the types match up. 95ab382 [Michael Armbrust] Use resolved instead of custom function. This is better because some nodes override the notion of resolved. 81a109d [Michael Armbrust] fix link. f143f61 [Michael Armbrust] Implement sampling. Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!" 6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar. c800798 [Michael Armbrust] Add build status icon. 0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown 05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row. f2fdd77 [Michael Armbrust] fix required distribtion for aggregate. 658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala 583a337 [Michael Armbrust] break apart distribution and partitioning. e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator 0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links. 73c70de [Yin Huai] add a first set of unit tests for data properties. fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements. 2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans. fcbc03b [Michael Armbrust] Fix if (). 7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators. b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization. e286d20 [Michael Armbrust] address code review comments. 80d0681 [Michael Armbrust] fix scaladoc links. de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString. 3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation. fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution. 2abb0bc [Michael Armbrust] better debug messages, use exists. 098dfc4 [Michael Armbrust] Implement Long sorting again. 60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable. a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString. dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode. 037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation. ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes. b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen 83adb9d [Yin Huai] add DataProperty 5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics. f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator. This can probably also be use for codegen... 6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet. ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts. 058ec15 [Michael Armbrust] handle more writeables. ffa9f25 [Michael Armbrust] blacklist some more MR tests. aa2239c [Michael Armbrust] filter test lines containing Owner: f71a325 [Michael Armbrust] Update golden jar. a3003ae [Michael Armbrust] Update makefile to use better sharding support. 568d150 [Michael Armbrust] Updates to white/blacklist. 8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right. c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure. See comments for details. 09c6300 [Michael Armbrust] Add nullability information to StructFields. 5460b2d [Michael Armbrust] load srcpart by default. 3695141 [Michael Armbrust] Lots of parser improvements. 965ac9a [Michael Armbrust] Add expressions that allow access into complex types. 3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences. 8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning. e57f97a [Michael Armbrust] more decimal/null support. e1440ed [Michael Armbrust] Initial support for function specific type conversions. 1814ed3 [Michael Armbrust] use childrenResolved function. f2ec57e [Michael Armbrust] Begin supporting decimal. 6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs 7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters. d0124f3 [Michael Armbrust] Correctly type null literals. b65626e [Michael Armbrust] Initial support for parsing BigDecimal. a90efda [Michael Armbrust] utility function for outputing string stacktraces. 7102f33 [Michael Armbrust] methods with side-effects should use (). 3ccaef7 [Michael Armbrust] add renaming TODO. bc282c7 [Michael Armbrust] fix bug in getNodeNumbered c8e89d5 [Michael Armbrust] memoize inputSet calculation. 6aefa46 [Michael Armbrust] Skip folding literals. a72e540 [Michael Armbrust] Add IN operator. 04f885b [Michael Armbrust] literals are only non-nullable if they are not null. 35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output. 12fd52d [Michael Armbrust] support for sorting longs. 0606520 [Michael Armbrust] drop old comment. 859200a [Michael Armbrust] support for reading more types from the metastore. 1fedd18 [Michael Armbrust] coercion from null to numeric types 71e902d [Michael Armbrust] fix test cases. cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer 8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment 86355a6 [Michael Armbrust] throw error if there are unexpected join clauses. c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute. 0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling a92919d [Michael Armbrust] add alter view as to native commands f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT f0faa26 [Michael Armbrust] add sample and distinct operators. ef7b943 [Michael Armbrust] add metastore support for float e9f4588 [Michael Armbrust] fix > 100 char. 755b229 [Michael Armbrust] blacklist some ddl tests. 9ae740a [Michael Armbrust] blacklist more tests that require MR. 4cfc11a [Michael Armbrust] more test coverage. 0d9d56a [Michael Armbrust] add more native commands to parser 78d730d [Michael Armbrust] Load src test table on RESET. 8364ec2 [Michael Armbrust] whitelist all possible partition values. b01468d [Michael Armbrust] support path rewrites when the query begins with a comment. 4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail. 4c5fb0f [Michael Armbrust] makefile target for building new whitelist. 4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO. 516481c [Michael Armbrust] Ignore requests to explain native commands. 68aa2e6 [Michael Armbrust] Stronger type for Token extractor. ca4ea26 [Michael Armbrust] Support for parsing UDF(*). 1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset. 9627616 [Michael Armbrust] Use current database as default database. 9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode. 6f64cee [Michael Armbrust] don't line wrap string literal eafaeed [Michael Armbrust] add type documentation f54c94c [Michael Armbrust] make golden answers file a test dependency 5362365 [Michael Armbrust] push conditions into join 0d2388b [Michael Armbrust] Point at databricks hosted scaladoc. 73b29cd [Michael Armbrust] fix bad casting 9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes 7eff191 [Michael Armbrust] link all the expression names. 83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules 9de6b74 [Michael Armbrust] fix language feature and deprecation warnings. 0b1960a [Michael Armbrust] Fix broken scala doc links / warnings. b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions 01c00c2 [Michael Armbrust] new golden 5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching 66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork 1a393da [Yin Huai] folded -> foldable 1e964ea [Yin Huai] update a43d41c [Michael Armbrust] more tests passing! 8ca38d0 [Michael Armbrust] begin support for varchar / binary types. ab8bbd1 [Michael Armbrust] parsing % operator c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests. 3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline. 5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 367fb9e [Yin Huai] update 0cd5cc6 [Michael Armbrust] add BIGINT cast parsing 61b266f [Michael Armbrust] comment for eliminate subqueries. d72a5a2 [Michael Armbrust] add long to literal factory object. b3bd15f [Michael Armbrust] blacklist more mr requiring tests. e06fd38 [Michael Armbrust] black list map reduce tests. 8e7ce30 [Michael Armbrust] blacklist some env specific tests. 6250cbd [Michael Armbrust] Do not exit on test failure b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath. b6e4899 [Yin Huai] formatting e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12 5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing? 9e190f5 [Michael Armbrust] drop unneeded () 68b58c1 [Michael Armbrust] drop a few more tests. b0aa400 [Michael Armbrust] update whitelist. c99012c [Michael Armbrust] skip tests with hooks db00ebf [Michael Armbrust] more types for hive udfs dbc3678 [Michael Armbrust] update ghpages repo 138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure. 6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests. 46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel. usage: "make -j 8 -i" 8d47ed4 [Yin Huai] comment 2795f05 [Yin Huai] comment e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer 2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 0bd1688 [Yin Huai] update 6a7bd75 [Michael Armbrust] fix partition column delimiter configuration. e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0. b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean 52864da [Reynold Xin] Added executeCollect method to SharkPlan. f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan. b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException. 38124bd [Yin Huai] formatting 2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively 555d839 [Reynold Xin] More cleaning ... d48d0e1 [Reynold Xin] Code review feedback. aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency. 479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if) da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope. 72426ed [Reynold Xin] Rename shark2 package to execution. 0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore 3f9fee1 [Michael Armbrust] rewrite push filter through join optimization. c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory. c9777d8 [Reynold Xin] Put all source files in a catalyst directory. 019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files. 80ca4be [Timothy Chen] Address comments 0079392 [Michael Armbrust] support for multiple insert commands in a single query 75b5a01 [Michael Armbrust] remove space. 4283400 [Timothy Chen] Add limited predicate push down e547e50 [Michael Armbrust] implement First. e77c9b6 [Michael Armbrust] more work on unique join. c795e06 [Michael Armbrust] improve star expansion a26494e [Michael Armbrust] allow aliases to have qualifiers d078333 [Michael Armbrust] remove extra space a75c023 [Michael Armbrust] implement Coalesce 3a018b6 [Michael Armbrust] fix up docs. ab6f67d [Michael Armbrust] import the string "null" as actual null. 5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved. 191ce3e [Michael Armbrust] analyze rewrite test query. 60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved. 2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues. e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD c086a35 [Michael Armbrust] docs, spacing c4060e4 [Michael Armbrust] cleanup 3b85462 [Michael Armbrust] more tests passing bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data. c944a95 [Michael Armbrust] First aggregate expression. 1e28311 [Michael Armbrust] make tests execute in alpha order again a287481 [Michael Armbrust] spelling 8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing. a6ab6c7 [Michael Armbrust] add != 4529594 [Michael Armbrust] draft of coalesce 70f253f [Michael Armbrust] more tests passing! 7349e7b [Michael Armbrust] initial support for test thrift table d3c9305 [Michael Armbrust] fix > 100 char line 93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE" 06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths 6355d0e [Michael Armbrust] match actual return type of count with expected cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty. fd4b096 [Michael Armbrust] fix casing of null strings as well. 4632695 [Michael Armbrust] support for megastore bigint 67b88cf [Michael Armbrust] more verbose debugging of evaluation return types c680e0d [Michael Armbrust] Failed string => number conversion should return null. 2326be1 [Michael Armbrust] make getClauses case insensitive. dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types. 045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions fb5ddfd [Michael Armbrust] move ViewExamples to examples/ 83833e8 [Michael Armbrust] more tests passing! 47c98d6 [Michael Armbrust] add query tests for like and hash. 1724c16 [Michael Armbrust] clear lines that contain last updated times. cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse. 9b2642b [Michael Armbrust] make the blacklist support regexes 1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs 910e33e [Michael Armbrust] basic support for building an assembly jar. d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore. 495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf. 65f4e69 [Michael Armbrust] remove incorrect comments 0831a3c [Michael Armbrust] support for parsing some operator udfs. 6c27aa7 [Michael Armbrust] more cast parsing. 43db061 [Michael Armbrust] significant generalization of hive udf functionality. 3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines. e5690a6 [Michael Armbrust] add BinaryType adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called. d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]). 8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage. 21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function. 0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons. 2b70abf [Michael Armbrust] true and false literals. ef8b0a5 [Michael Armbrust] more tests. 14d070f [Michael Armbrust] add support for correctly extracting partition keys. 0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions 69a0bd4 [Michael Armbrust] promote strings in predicates with number too. 3946e31 [Michael Armbrust] don't build strings unless assertion fails. 90c453d [Michael Armbrust] more tests passing! 6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting. 8000504 [Michael Armbrust] Improve type coercion. 9087152 [Michael Armbrust] fix toString of Not. 58b111c [Michael Armbrust] fix bad scaladoc tag. d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there. ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness. 1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types. d873b2b [Yin Huai] Remove numbers associated with test cases. 8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown d1e7b8e [Michael Armbrust] Update README.md c8b1553 [Michael Armbrust] Update README.md 9307ef9 [Michael Armbrust] update list of passing tests. 934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers. a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now. ae0024a [Yin Huai] update cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions 21976ae [Yin Huai] update b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions dedbf0c [Yin Huai] support Boolean literals eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals 37817b5 [Yin Huai] add a comment to EvaluateLiterals. 468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. b1d1843 [Michael Armbrust] more work on big data benchmark tests. cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark 7d7fa9f [Michael Armbrust] support for create table as 5f54f03 [Michael Armbrust] parsing for ASC d42b725 [Michael Armbrust] Sum of strings requires cast 34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.) 81659cb [Michael Armbrust] implement transform operator. 5cd76d6 [Michael Armbrust] break up the file based test case code for reuse 1031b65 [Michael Armbrust] support for case insensitive resolution. 320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots) b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt d9d18b4 [Michael Armbrust] debug logging implicit. 669089c [Yin Huai] support Boolean literals ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals 73a05fd [Yin Huai] add a comment to EvaluateLiterals. 191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is. 80039cc [Yin Huai] Merge pull request #1 from yhuai/master cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide 5c518e4 [Michael Armbrust] fix bug in test. b50dd0e [Michael Armbrust] fix return type of overloaded method 05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview. 8c60cc0 [Michael Armbrust] Update README.md 03b9526 [Michael Armbrust] First draft of optimizer tests. f392755 [Michael Armbrust] Add flatMap to TreeNode 6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings 15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions. 06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression. 4b0a888 [Michael Armbrust] implement string comparison and more casts. 356b321 [Michael Armbrust] Update README.md 3776395 [Michael Armbrust] Update README.md 304d17d [Michael Armbrust] Create README.md b7d8be0 [Michael Armbrust] more tests passing. b82481f [Michael Armbrust] add todo comment. 02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist. cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support. c43a259 [Michael Armbrust] comments 15ff448 [Michael Armbrust] better error message when a dsl test throws an exception 76ec650 [Michael Armbrust] fix join conditions e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan. 91573a4 [Michael Armbrust] initial type promotion e2ef4a5 [Michael Armbrust] logging e43dc1e [Michael Armbrust] add string => int cast evaluation f1f7e96 [Michael Armbrust] fix incorrect generation of join keys 2b27230 [Michael Armbrust] add depth based subtree access 0f6279f [Michael Armbrust] broken tests. 389bc0b [Michael Armbrust] support for partitioned columns in output. 12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name. b67a225 [Michael Armbrust] better errors when types don't match up. 9e74808 [Michael Armbrust] add children resolved. 6d03ce9 [Michael Armbrust] defaults for unresolved relation 2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions be5ae2c [Michael Armbrust] better resolution logging cb7b5af [Michael Armbrust] views example 420e05b [Michael Armbrust] more tests passing! 6916c63 [Michael Armbrust] Reading from partitioned hive tables. a1245f9 [Michael Armbrust] more tests passing 956e760 [Michael Armbrust] extended explain 5f14c35 [Michael Armbrust] more test tables supported 175c43e [Michael Armbrust] better errors for parse exceptions 480ade5 [Michael Armbrust] don't use partial cached results. 8a9d21c [Michael Armbrust] fix evaluation 7aee69c [Michael Armbrust] parsing for joins, boolean logic 7fcf480 [Michael Armbrust] test for and logic 3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines. 6902490 [Michael Armbrust] fix boolean logic evaluation 4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic 8b2a2ee [Michael Armbrust] more tests passing! ad1f3b4 [Michael Armbrust] toString for null literals a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work) 60ec19d [Michael Armbrust] initial support for udfs c45b440 [Michael Armbrust] support for is (not) null and boolean logic 7f4a1dc [Michael Armbrust] add NoRelation logical operator 72e183b [Michael Armbrust] support for null values in tree node args. ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs e5c9d1a [Michael Armbrust] use nonEmpty dcc4fe1 [Michael Armbrust] support for src1 test table. c78b587 [Michael Armbrust] casting. 75c3f3f [Michael Armbrust] add support for logging with scalalogging. da2c011 [Michael Armbrust] make it more obvious when results are being truncated. 96b73ba [Michael Armbrust] more docs in TestShark 18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive. e6d063b [Michael Armbrust] more join tests. 664c1c3 [Michael Armbrust] make parsing of function names case insensitive. 0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome. 1a6db68 [Michael Armbrust] spelling 7638cb4 [Michael Armbrust] simple join execution with dsl tests. no hive tests yes. 859d4c9 [Michael Armbrust] better argString printing of nested trees. fc53615 [Michael Armbrust] add same instance comparisons for tree nodes. a026e6b [Michael Armbrust] move out hive specific operators fff4d1c [Michael Armbrust] add simple query execution debugging e2120ab [Michael Armbrust] sorting for strings da06eb6 [Michael Armbrust] Parsing for sortby and joins 9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId. 8eb2460 [Michael Armbrust] add system property to override whitelist. 88124bb [Michael Armbrust] make strategy evaluation lazy. 74a3a21 [Michael Armbrust] implement outputSet d25b171 [Michael Armbrust] Add AND and OR expressions 67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll 12acf0a [Michael Armbrust] add .DS_Store for macs f7da6ce [Michael Armbrust] add agg with grouping expr in select test 36805b3 [Michael Armbrust] pull out and improve aggregation 75613e1 [Michael Armbrust] better evaluations failure messages. 4789a35 [Michael Armbrust] weaken type since its hard to create pure references. e89dd36 [Michael Armbrust] no newline for online trees d0590d4 [Michael Armbrust] include stack trace for catalyst failures. 081c0d9 [Michael Armbrust] more generic computation of agg functions. 31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser ecd45b2 [Michael Armbrust] Add more passing tests. 97d5419 [Michael Armbrust] fix alignment. 565cc13 [Michael Armbrust] make the canary query optional. a95e65c [Michael Armbrust] support for resolving qualified attribute references. e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails. 4640a0b [Michael Armbrust] handle test tables when database is specified. bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis. fec5158 [Michael Armbrust] add hive / idea files to .gitignore 3f97ffe [Michael Armbrust] Rename Hive => HiveQl 656b836 [Michael Armbrust] Support for parsing select clause aliases. 3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs. 3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception. 8cbef8a [Michael Armbrust] spelling aa8c37c [Michael Armbrust] Better toString for SortOrder 1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions a2e0327 [Michael Armbrust] add a bunch of tests. 4a3e1ea [Michael Armbrust] docs and use shark for data loading. 339bb8f [Michael Armbrust] better docs, Not support 1d7b2d9 [Michael Armbrust] Add NaN conversions. 46a2534 [Michael Armbrust] only run canary query on failure. 8996066 [Michael Armbrust] remove protected from makeCopy 53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass 04a372a [Michael Armbrust] add a flag for running all tests. 3b2235b [Michael Armbrust] More general implementation of arithmetic. edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results da6c577 [Michael Armbrust] add string <==> file utility functions. 3adf5ca [Michael Armbrust] Initial support for groupBy and count. 7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ. a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product. d66ba7e [Michael Armbrust] drop printlns. 88f2efd [Michael Armbrust] Add sum / count distinct expressions. 05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark 07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands. b8a9910 [Michael Armbrust] quote tests passing. 8e5e267 [Michael Armbrust] handle aliased select expressions. 4286a96 [Michael Armbrust] drop debugging println ac34aeb [Michael Armbrust] proof of concept for hive ast transformations. 2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general. 74a6337 [Michael Armbrust] use fastEquals when doing transformations. 1184a23 [Michael Armbrust] add native test for escapes. b972b18 [Michael Armbrust] create BaseRelation class fa6bce9 [Michael Armbrust] implement union 6391a87 [Michael Armbrust] count aggregate. d47c317 [Michael Armbrust] add unary minus, more tests passing. c7114e4 [Michael Armbrust] first draft of star expansion. 044c43d [Michael Armbrust] better support for numeric literal parsing. 1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view. 61503c5 [Michael Armbrust] add cached toRdd 2036883 [Michael Armbrust] skip explain queries when testing. ebac4b1 [Michael Armbrust] fix bug in sort reference calculation ca0dee0 [Michael Armbrust] docs. 1ee0471 [Michael Armbrust] string literal parsing. 357278b [Michael Armbrust] add limit support 9b3e479 [Michael Armbrust] creation of string literals. 02efa30 [Michael Armbrust] alias evaluation cb68b33 [Michael Armbrust] parsing for random sample in hive ql. 126dd36 [Michael Armbrust] include query plans in failure output bb59ae9 [Michael Armbrust] doc fixes 7e68286 [Michael Armbrust] fix confusing naming 768bb25 [Michael Armbrust] handle errors in shark query toString 829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark. Make test shark a singleton to avoid weirdness with the hive megastore. ad02e41 [Michael Armbrust] comment jdo dependency 7bc89fe [Michael Armbrust] add collect to TreeNode. 438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs. 09679ee [Michael Armbrust] fix bug in TreeNode foreach 2930b27 [Michael Armbrust] more specific name for del query tests. 8842549 [Michael Armbrust] docs. da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL. a8969b9 [Michael Armbrust] Factor out hive query comparison test framework. 1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations. a36dd9a [Michael Armbrust] evaluation for other > data types. cae729b [Michael Armbrust] remove unnecessary lazy vals. d8e12af [Michael Armbrust] docs 3a60d67 [Michael Armbrust] implement average, placeholder for count f05c106 [Michael Armbrust] checkAnswer handles single row results. 2730534 [Michael Armbrust] implement inputSet a9aa79d [Michael Armbrust] debugging for sort exec 8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors. 554b4b2 [Michael Armbrust] BoundAttribute pretty printing. 754f5fa [Michael Armbrust] dsl for setting nullability a206d7a [Michael Armbrust] clean up query tests. 84ad6ef [Michael Armbrust] better sort implementation and tests. de24923 [Michael Armbrust] add double type. 9611a2c [Michael Armbrust] literal creation for doubles. 7358313 [Michael Armbrust] sort order returns child type. b544715 [Michael Armbrust] implement eval for rand, and > for doubles 7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols) 1c1a35e [Michael Armbrust] add simple Rand expression. 3ca51de [Michael Armbrust] add orderBy to dsl 7ae41ab [Michael Armbrust] more literal implicit conversions b18b675 [Michael Armbrust] First cut at native query tests for shark. d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark. 5eac895 [Michael Armbrust] better error when descending is specified. 2b16f86 [Michael Armbrust] add todo e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization 9dde3c8 [Michael Armbrust] add project and filter operations. ad9037b [Michael Armbrust] Add support for local relations. 6227143 [Michael Armbrust] evaluation of Equals. 7526290 [Michael Armbrust] BoundReference should also be an Attribute. bd33e26 [Michael Armbrust] more documentation 5de0ea3 [Michael Armbrust] Move all shark specific into a separate package. Lots of documentation improvements. 0ae292b [Michael Armbrust] implement calculation of sort expressions. 9fd5011 [Michael Armbrust] First cut at expression evaluation. 6259e3a [Michael Armbrust] cleanup 787e5a2 [Michael Armbrust] use fastEquals f90da36 [Michael Armbrust] better printing of optimization exceptions b05dd67 [Michael Armbrust] Application of rules to fixed point. bb2e0db [Michael Armbrust] pretty print for literals. 1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals. d3a3687 [Michael Armbrust] add fastEquals 2b4935b [Michael Armbrust] set sbt.version explicitly 46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests. c79f2fd [Michael Armbrust] insert operator should return an empty rdd. 14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input. ae7b4c3 [Michael Armbrust] remove implicit dependencies. now compiles without copying things into lib/ manually. 84082f9 [Michael Armbrust] add sbt binaries and scripts 15371a8 [Michael Armbrust] First draft of simple Hive DDL parser. 063bf44 [Michael Armbrust] Periods should end all comments. e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack. ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing! b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust. e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected. 26f410a [Michael Armbrust] physical traits should extend PhysicalPlan. dc72469 [Michael Armbrust] beginning of hive compatibility testing framework. 0763490 [Michael Armbrust] support for hive native command pass-through. d8a924f [Michael Armbrust] scaladoc 29a7163 [Michael Armbrust] Insert into hive table physical operator. 633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy. 59ac444 [Michael Armbrust] add unary expression 3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName' 665f7d0 [Michael Armbrust] add logical nodes for hive data sinks. 64d2923 [Michael Armbrust] Add classes for representing sorts. f72b7ce [Michael Armbrust] first trivial end to end query execution. 5c7d244 [Michael Armbrust] first draft of references implementation. 7bff274 [Michael Armbrust] point at new shark. c7cd57f [Michael Armbrust] docs for util function. 910811c [Michael Armbrust] check each item of the sequence ef21a0b [Michael Armbrust] line up comments. 4b765d5 [Michael Armbrust] docs, drop println 6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution. a703c49 [Michael Armbrust] this order works better until fixed point is implemented. ec1d7c0 [Michael Armbrust] Simple attribute resolution. 069df02 [Michael Armbrust] parsing binary predicates a1cf754 [Michael Armbrust] add joins and equality. 3f5bc98 [Michael Armbrust] add optiq to sbt. 54f3460 [Michael Armbrust] initial optiq parsing. d9161ce [Michael Armbrust] add join operator 1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs 24ef6fb [Michael Armbrust] toString for alias. ae7d776 [Michael Armbrust] add nullability changing function d49dc02 [Michael Armbrust] scaladoc for named exprs 7c45dd7 [Michael Armbrust] pretty printing of trees. 78e34bf [Michael Armbrust] simple git ignore. 7ba19be [Michael Armbrust] First draft of interface to hive metastore. 7e7acf0 [Michael Armbrust] physical placeholder. 1c11136 [Michael Armbrust] first draft of error handling / plans for debugging. 3766a41 [Michael Armbrust] rearrange utility functions. 7fb3d5e [Michael Armbrust] docs and equality improvements. 45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions. 002d4d4 [Michael Armbrust] default to no alias. be25003 [Michael Armbrust] add repl initialization to sbt. 0608a00 [Michael Armbrust] tighten public interface a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms. daa71ca [Michael Armbrust] foreach, maps, and scaladoc 6a158cb [Michael Armbrust] simple transform working. db0299f [Michael Armbrust] basic analysis of relations minus transform function. f74c4ee [Michael Armbrust] parsing a simple query. 08e4f57 [Michael Armbrust] upgrade scala include shark. d3c6404 [Michael Armbrust] initial commit
2014-03-20 21:03:20 -04:00
// Queries are expressed in HiveQL
spark.sql("FROM src SELECT key, value").collect().foreach(println)
{% endhighlight %}
</div>
<div data-lang="java" markdown="1">
When working with Hive, one must instantiate `SparkSession` with Hive support, including
connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions.
Users who do not have an existing Hive deployment can still enable Hive support. When not configured
by the `hive-site.xml`, the context automatically creates `metastore_db` in the current directory and
creates a directory configured by `spark.sql.warehouse.dir`, which defaults to the directory
`spark-warehouse` in the current directory that the spark application is started. Note that
the `hive.metastore.warehouse.dir` property in `hive-site.xml` is deprecated since Spark 2.0.0.
Instead, use `spark.sql.warehouse.dir` to specify the default location of database in warehouse.
You may need to grant write privilege to the user who starts the spark application.
{% highlight java %}
SparkSession spark = SparkSession.builder().appName("JavaSparkSQL").getOrCreate();
spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)");
spark.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src");
// Queries are expressed in HiveQL.
List<Row> results = spark.sql("FROM src SELECT key, value").collectAsList();
{% endhighlight %}
</div>
SPARK-1374: PySpark API for SparkSQL An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries. ``` from pyspark.context import SQLContext sqlCtx = SQLContext(sc) rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}]) srdd = sqlCtx.applySchema(rdd) sqlCtx.registerRDDAsTable(srdd, "table1") srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1") srdd2.collect() ``` The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]``` Author: Ahir Reddy <ahirreddy@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #363 from ahirreddy/pysql and squashes the following commits: 0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns 307d6e0 [Ahir Reddy] Style fix 6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies 3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py 29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD f2312c7 [Ahir Reddy] Moved everything into sql.py a19afe4 [Ahir Reddy] Doc fixes 6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL 521ff6d [Ahir Reddy] Trying to get spark to build with hive ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins ded03e7 [Ahir Reddy] Added doc test for HiveContext 22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency e4da06c [Ahir Reddy] Display message if hive is not built into spark 227a0be [Michael Armbrust] Update API links. Fix Hive example. 58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes. 4285340 [Michael Armbrust] Fix building of Hive API Docs. 38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs. 337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build 40491c9 [Ahir Reddy] PR Changes + Method Visibility 1836944 [Michael Armbrust] Fix comments. e00980f [Michael Armbrust] First draft of python sql programming guide. b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test f98a422 [Ahir Reddy] HiveContexts 79621cf [Ahir Reddy] cleaning up cruft b406ba0 [Ahir Reddy] doctest formatting 20936a5 [Ahir Reddy] Added tests and documentation e4d21b4 [Ahir Reddy] Added pyrolite dependency 79f739d [Ahir Reddy] added more tests 7515ba0 [Ahir Reddy] added more tests :) d26ec5e [Ahir Reddy] added test e9f5b8d [Ahir Reddy] adding tests 906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python 251f99d [Ahir Reddy] for now only allow dictionaries as input 09b9980 [Ahir Reddy] made jrdd explicitly lazy c608947 [Ahir Reddy] SchemaRDD now has all RDD operations 725c91e [Ahir Reddy] awesome row objects 55d1c76 [Ahir Reddy] return row objects 4fe1319 [Ahir Reddy] output dictionaries correctly be079de [Ahir Reddy] returning dictionaries works cd5f79f [Ahir Reddy] Switched to using Scala SQLContext e948bd9 [Ahir Reddy] yippie 4886052 [Ahir Reddy] even better c0fb1c6 [Ahir Reddy] more working 043ca85 [Ahir Reddy] working 5496f9f [Ahir Reddy] doesn't crash b8b904b [Ahir Reddy] Added schema rdd class 67ba875 [Ahir Reddy] java to python, and python to java bcc0f23 [Ahir Reddy] Java to python ab6025d [Ahir Reddy] compiling
2014-04-15 03:07:55 -04:00
<div data-lang="python" markdown="1">
When working with Hive, one must instantiate `SparkSession` with Hive support, including
connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions.
Users who do not have an existing Hive deployment can still enable Hive support. When not configured
by the `hive-site.xml`, the context automatically creates `metastore_db` in the current directory and
creates a directory configured by `spark.sql.warehouse.dir`, which defaults to the directory
`spark-warehouse` in the current directory that the spark application is started. Note that
the `hive.metastore.warehouse.dir` property in `hive-site.xml` is deprecated since Spark 2.0.0.
Instead, use `spark.sql.warehouse.dir` to specify the default location of database in warehouse.
You may need to grant write privilege to the user who starts the spark application.
SPARK-1374: PySpark API for SparkSQL An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries. ``` from pyspark.context import SQLContext sqlCtx = SQLContext(sc) rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}]) srdd = sqlCtx.applySchema(rdd) sqlCtx.registerRDDAsTable(srdd, "table1") srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1") srdd2.collect() ``` The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]``` Author: Ahir Reddy <ahirreddy@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #363 from ahirreddy/pysql and squashes the following commits: 0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns 307d6e0 [Ahir Reddy] Style fix 6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies 3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py 29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD f2312c7 [Ahir Reddy] Moved everything into sql.py a19afe4 [Ahir Reddy] Doc fixes 6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL 521ff6d [Ahir Reddy] Trying to get spark to build with hive ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins ded03e7 [Ahir Reddy] Added doc test for HiveContext 22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency e4da06c [Ahir Reddy] Display message if hive is not built into spark 227a0be [Michael Armbrust] Update API links. Fix Hive example. 58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes. 4285340 [Michael Armbrust] Fix building of Hive API Docs. 38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs. 337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build 40491c9 [Ahir Reddy] PR Changes + Method Visibility 1836944 [Michael Armbrust] Fix comments. e00980f [Michael Armbrust] First draft of python sql programming guide. b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test f98a422 [Ahir Reddy] HiveContexts 79621cf [Ahir Reddy] cleaning up cruft b406ba0 [Ahir Reddy] doctest formatting 20936a5 [Ahir Reddy] Added tests and documentation e4d21b4 [Ahir Reddy] Added pyrolite dependency 79f739d [Ahir Reddy] added more tests 7515ba0 [Ahir Reddy] added more tests :) d26ec5e [Ahir Reddy] added test e9f5b8d [Ahir Reddy] adding tests 906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python 251f99d [Ahir Reddy] for now only allow dictionaries as input 09b9980 [Ahir Reddy] made jrdd explicitly lazy c608947 [Ahir Reddy] SchemaRDD now has all RDD operations 725c91e [Ahir Reddy] awesome row objects 55d1c76 [Ahir Reddy] return row objects 4fe1319 [Ahir Reddy] output dictionaries correctly be079de [Ahir Reddy] returning dictionaries works cd5f79f [Ahir Reddy] Switched to using Scala SQLContext e948bd9 [Ahir Reddy] yippie 4886052 [Ahir Reddy] even better c0fb1c6 [Ahir Reddy] more working 043ca85 [Ahir Reddy] working 5496f9f [Ahir Reddy] doesn't crash b8b904b [Ahir Reddy] Added schema rdd class 67ba875 [Ahir Reddy] java to python, and python to java bcc0f23 [Ahir Reddy] Java to python ab6025d [Ahir Reddy] compiling
2014-04-15 03:07:55 -04:00
{% highlight python %}
# spark is an existing SparkSession
SPARK-1374: PySpark API for SparkSQL An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries. ``` from pyspark.context import SQLContext sqlCtx = SQLContext(sc) rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}]) srdd = sqlCtx.applySchema(rdd) sqlCtx.registerRDDAsTable(srdd, "table1") srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1") srdd2.collect() ``` The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]``` Author: Ahir Reddy <ahirreddy@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #363 from ahirreddy/pysql and squashes the following commits: 0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns 307d6e0 [Ahir Reddy] Style fix 6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies 3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py 29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD f2312c7 [Ahir Reddy] Moved everything into sql.py a19afe4 [Ahir Reddy] Doc fixes 6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL 521ff6d [Ahir Reddy] Trying to get spark to build with hive ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins ded03e7 [Ahir Reddy] Added doc test for HiveContext 22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency e4da06c [Ahir Reddy] Display message if hive is not built into spark 227a0be [Michael Armbrust] Update API links. Fix Hive example. 58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes. 4285340 [Michael Armbrust] Fix building of Hive API Docs. 38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs. 337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build 40491c9 [Ahir Reddy] PR Changes + Method Visibility 1836944 [Michael Armbrust] Fix comments. e00980f [Michael Armbrust] First draft of python sql programming guide. b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test f98a422 [Ahir Reddy] HiveContexts 79621cf [Ahir Reddy] cleaning up cruft b406ba0 [Ahir Reddy] doctest formatting 20936a5 [Ahir Reddy] Added tests and documentation e4d21b4 [Ahir Reddy] Added pyrolite dependency 79f739d [Ahir Reddy] added more tests 7515ba0 [Ahir Reddy] added more tests :) d26ec5e [Ahir Reddy] added test e9f5b8d [Ahir Reddy] adding tests 906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python 251f99d [Ahir Reddy] for now only allow dictionaries as input 09b9980 [Ahir Reddy] made jrdd explicitly lazy c608947 [Ahir Reddy] SchemaRDD now has all RDD operations 725c91e [Ahir Reddy] awesome row objects 55d1c76 [Ahir Reddy] return row objects 4fe1319 [Ahir Reddy] output dictionaries correctly be079de [Ahir Reddy] returning dictionaries works cd5f79f [Ahir Reddy] Switched to using Scala SQLContext e948bd9 [Ahir Reddy] yippie 4886052 [Ahir Reddy] even better c0fb1c6 [Ahir Reddy] more working 043ca85 [Ahir Reddy] working 5496f9f [Ahir Reddy] doesn't crash b8b904b [Ahir Reddy] Added schema rdd class 67ba875 [Ahir Reddy] java to python, and python to java bcc0f23 [Ahir Reddy] Java to python ab6025d [Ahir Reddy] compiling
2014-04-15 03:07:55 -04:00
spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
spark.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")
SPARK-1374: PySpark API for SparkSQL An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries. ``` from pyspark.context import SQLContext sqlCtx = SQLContext(sc) rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}]) srdd = sqlCtx.applySchema(rdd) sqlCtx.registerRDDAsTable(srdd, "table1") srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1") srdd2.collect() ``` The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]``` Author: Ahir Reddy <ahirreddy@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #363 from ahirreddy/pysql and squashes the following commits: 0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns 307d6e0 [Ahir Reddy] Style fix 6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies 3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py 29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD f2312c7 [Ahir Reddy] Moved everything into sql.py a19afe4 [Ahir Reddy] Doc fixes 6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL 521ff6d [Ahir Reddy] Trying to get spark to build with hive ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins ded03e7 [Ahir Reddy] Added doc test for HiveContext 22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency e4da06c [Ahir Reddy] Display message if hive is not built into spark 227a0be [Michael Armbrust] Update API links. Fix Hive example. 58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes. 4285340 [Michael Armbrust] Fix building of Hive API Docs. 38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs. 337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build 40491c9 [Ahir Reddy] PR Changes + Method Visibility 1836944 [Michael Armbrust] Fix comments. e00980f [Michael Armbrust] First draft of python sql programming guide. b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test f98a422 [Ahir Reddy] HiveContexts 79621cf [Ahir Reddy] cleaning up cruft b406ba0 [Ahir Reddy] doctest formatting 20936a5 [Ahir Reddy] Added tests and documentation e4d21b4 [Ahir Reddy] Added pyrolite dependency 79f739d [Ahir Reddy] added more tests 7515ba0 [Ahir Reddy] added more tests :) d26ec5e [Ahir Reddy] added test e9f5b8d [Ahir Reddy] adding tests 906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python 251f99d [Ahir Reddy] for now only allow dictionaries as input 09b9980 [Ahir Reddy] made jrdd explicitly lazy c608947 [Ahir Reddy] SchemaRDD now has all RDD operations 725c91e [Ahir Reddy] awesome row objects 55d1c76 [Ahir Reddy] return row objects 4fe1319 [Ahir Reddy] output dictionaries correctly be079de [Ahir Reddy] returning dictionaries works cd5f79f [Ahir Reddy] Switched to using Scala SQLContext e948bd9 [Ahir Reddy] yippie 4886052 [Ahir Reddy] even better c0fb1c6 [Ahir Reddy] more working 043ca85 [Ahir Reddy] working 5496f9f [Ahir Reddy] doesn't crash b8b904b [Ahir Reddy] Added schema rdd class 67ba875 [Ahir Reddy] java to python, and python to java bcc0f23 [Ahir Reddy] Java to python ab6025d [Ahir Reddy] compiling
2014-04-15 03:07:55 -04:00
# Queries can be expressed in HiveQL.
results = spark.sql("FROM src SELECT key, value").collect()
SPARK-1374: PySpark API for SparkSQL An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries. ``` from pyspark.context import SQLContext sqlCtx = SQLContext(sc) rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}]) srdd = sqlCtx.applySchema(rdd) sqlCtx.registerRDDAsTable(srdd, "table1") srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1") srdd2.collect() ``` The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]``` Author: Ahir Reddy <ahirreddy@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #363 from ahirreddy/pysql and squashes the following commits: 0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns 307d6e0 [Ahir Reddy] Style fix 6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies 3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py 29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD f2312c7 [Ahir Reddy] Moved everything into sql.py a19afe4 [Ahir Reddy] Doc fixes 6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL 521ff6d [Ahir Reddy] Trying to get spark to build with hive ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins ded03e7 [Ahir Reddy] Added doc test for HiveContext 22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency e4da06c [Ahir Reddy] Display message if hive is not built into spark 227a0be [Michael Armbrust] Update API links. Fix Hive example. 58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes. 4285340 [Michael Armbrust] Fix building of Hive API Docs. 38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs. 337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build 40491c9 [Ahir Reddy] PR Changes + Method Visibility 1836944 [Michael Armbrust] Fix comments. e00980f [Michael Armbrust] First draft of python sql programming guide. b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test f98a422 [Ahir Reddy] HiveContexts 79621cf [Ahir Reddy] cleaning up cruft b406ba0 [Ahir Reddy] doctest formatting 20936a5 [Ahir Reddy] Added tests and documentation e4d21b4 [Ahir Reddy] Added pyrolite dependency 79f739d [Ahir Reddy] added more tests 7515ba0 [Ahir Reddy] added more tests :) d26ec5e [Ahir Reddy] added test e9f5b8d [Ahir Reddy] adding tests 906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python 251f99d [Ahir Reddy] for now only allow dictionaries as input 09b9980 [Ahir Reddy] made jrdd explicitly lazy c608947 [Ahir Reddy] SchemaRDD now has all RDD operations 725c91e [Ahir Reddy] awesome row objects 55d1c76 [Ahir Reddy] return row objects 4fe1319 [Ahir Reddy] output dictionaries correctly be079de [Ahir Reddy] returning dictionaries works cd5f79f [Ahir Reddy] Switched to using Scala SQLContext e948bd9 [Ahir Reddy] yippie 4886052 [Ahir Reddy] even better c0fb1c6 [Ahir Reddy] more working 043ca85 [Ahir Reddy] working 5496f9f [Ahir Reddy] doesn't crash b8b904b [Ahir Reddy] Added schema rdd class 67ba875 [Ahir Reddy] java to python, and python to java bcc0f23 [Ahir Reddy] Java to python ab6025d [Ahir Reddy] compiling
2014-04-15 03:07:55 -04:00
{% endhighlight %}
</div>
<div data-lang="r" markdown="1">
When working with Hive one must instantiate `SparkSession` with Hive support. This
adds support for finding tables in the MetaStore and writing queries using HiveQL.
{% highlight r %}
# enableHiveSupport defaults to TRUE
sparkR.session(enableHiveSupport = TRUE)
sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")
# Queries can be expressed in HiveQL.
results <- collect(sql("FROM src SELECT key, value"))
{% endhighlight %}
</div>
</div>
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
### Interacting with Different Versions of Hive Metastore
One of the most important pieces of Spark SQL's Hive support is interaction with Hive metastore,
which enables Spark SQL to access metadata of Hive tables. Starting from Spark 1.4.0, a single binary
build of Spark SQL can be used to query different versions of Hive metastores, using the configuration described below.
Note that independent of the version of Hive that is being used to talk to the metastore, internally Spark SQL
will compile against Hive 1.2.1 and use those classes for internal execution (serdes, UDFs, UDAFs, etc).
The following options can be used to configure the version of Hive that is used to retrieve metadata:
<table class="table">
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
<tr>
<td><code>spark.sql.hive.metastore.version</code></td>
<td><code>1.2.1</code></td>
<td>
Version of the Hive metastore. Available
options are <code>0.12.0</code> through <code>1.2.1</code>.
</td>
</tr>
<tr>
<td><code>spark.sql.hive.metastore.jars</code></td>
<td><code>builtin</code></td>
<td>
Location of the jars that should be used to instantiate the HiveMetastoreClient. This
property can be one of three options:
<ol>
<li><code>builtin</code></li>
Use Hive 1.2.1, which is bundled with the Spark assembly when <code>-Phive</code> is
enabled. When this option is chosen, <code>spark.sql.hive.metastore.version</code> must be
either <code>1.2.1</code> or not defined.
<li><code>maven</code></li>
Use Hive jars of specified version downloaded from Maven repositories. This configuration
is not generally recommended for production deployments.
<li>A classpath in the standard format for the JVM. This classpath must include all of Hive
and its dependencies, including the correct version of Hadoop. These jars only need to be
present on the driver, but if you are running in yarn cluster mode then you must ensure
they are packaged with you application.</li>
</ol>
</td>
</tr>
<tr>
<td><code>spark.sql.hive.metastore.sharedPrefixes</code></td>
<td><code>com.mysql.jdbc,<br/>org.postgresql,<br/>com.microsoft.sqlserver,<br/>oracle.jdbc</code></td>
<td>
<p>
A comma separated list of class prefixes that should be loaded using the classloader that is
shared between Spark SQL and a specific version of Hive. An example of classes that should
be shared is JDBC drivers that are needed to talk to the metastore. Other classes that need
to be shared are those that interact with classes that are already shared. For example,
custom appenders that are used by log4j.
</p>
</td>
</tr>
<tr>
<td><code>spark.sql.hive.metastore.barrierPrefixes</code></td>
<td><code>(empty)</code></td>
<td>
<p>
A comma separated list of class prefixes that should explicitly be reloaded for each version
of Hive that Spark SQL is communicating with. For example, Hive UDFs that are declared in a
prefix that typically would be shared (i.e. <code>org.apache.spark.*</code>).
</p>
</td>
</tr>
</table>
## JDBC To Other Databases
Spark SQL also includes a data source that can read data from other databases using JDBC. This
functionality should be preferred over using [JdbcRDD](api/scala/index.html#org.apache.spark.rdd.JdbcRDD).
This is because the results are returned
as a DataFrame and they can easily be processed in Spark SQL or joined with other data sources.
The JDBC data source is also easier to use from Java or Python as it does not require the user to
provide a ClassTag.
(Note that this is different than the Spark SQL JDBC server, which allows other applications to
run queries using Spark SQL).
To get started you will need to include the JDBC driver for you particular database on the
spark classpath. For example, to connect to postgres from the Spark Shell you would run the
following command:
{% highlight bash %}
bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar
{% endhighlight %}
Tables from the remote database can be loaded as a DataFrame or Spark SQL Temporary table using
the Data Sources API. The following options are supported:
<table class="table">
<tr><th>Property Name</th><th>Meaning</th></tr>
<tr>
<td><code>url</code></td>
<td>
The JDBC URL to connect to.
</td>
</tr>
<tr>
<td><code>dbtable</code></td>
<td>
The JDBC table that should be read. Note that anything that is valid in a <code>FROM</code> clause of
a SQL query can be used. For example, instead of a full table you could also use a
subquery in parentheses.
</td>
</tr>
<tr>
<td><code>driver</code></td>
<td>
[SPARK-12579][SQL] Force user-specified JDBC driver to take precedence Spark SQL's JDBC data source allows users to specify an explicit JDBC driver to load (using the `driver` argument), but in the current code it's possible that the user-specified driver will not be used when it comes time to actually create a JDBC connection. In a nutshell, the problem is that you might have multiple JDBC drivers on the classpath that claim to be able to handle the same subprotocol, so simply registering the user-provided driver class with the our `DriverRegistry` and JDBC's `DriverManager` is not sufficient to ensure that it's actually used when creating the JDBC connection. This patch addresses this issue by first registering the user-specified driver with the DriverManager, then iterating over the driver manager's loaded drivers in order to obtain the correct driver and use it to create a connection (previously, we just called `DriverManager.getConnection()` directly). If a user did not specify a JDBC driver to use, then we call `DriverManager.getDriver` to figure out the class of the driver to use, then pass that class's name to executors; this guards against corner-case bugs in situations where the driver and executor JVMs might have different sets of JDBC drivers on their classpaths (previously, there was the (rare) potential for `DriverManager.getConnection()` to use different drivers on the driver and executors if the user had not explicitly specified a JDBC driver class and the classpaths were different). This patch is inspired by a similar patch that I made to the `spark-redshift` library (https://github.com/databricks/spark-redshift/pull/143), which contains its own modified fork of some of Spark's JDBC data source code (for cross-Spark-version compatibility reasons). Author: Josh Rosen <joshrosen@databricks.com> Closes #10519 from JoshRosen/jdbc-driver-precedence.
2016-01-04 13:39:42 -05:00
The class name of the JDBC driver to use to connect to this URL.
</td>
</tr>
<tr>
<td><code>partitionColumn, lowerBound, upperBound, numPartitions</code></td>
<td>
These options must all be specified if any of them is specified. They describe how to
partition the table when reading in parallel from multiple workers.
<code>partitionColumn</code> must be a numeric column from the table in question. Notice
that <code>lowerBound</code> and <code>upperBound</code> are just used to decide the
partition stride, not for filtering the rows in table. So all rows in the table will be
partitioned and returned.
</td>
</tr>
<tr>
<td><code>fetchSize</code></td>
<td>
The JDBC fetch size, which determines how many rows to fetch per round trip. This can help performance on JDBC drivers which default to low fetch size (eg. Oracle with 10 rows).
</td>
</tr>
</table>
<div class="codetabs">
<div data-lang="scala" markdown="1">
{% highlight scala %}
val jdbcDF = spark.read.format("jdbc").options(
Map("url" -> "jdbc:postgresql:dbserver",
"dbtable" -> "schema.tablename")).load()
{% endhighlight %}
</div>
<div data-lang="java" markdown="1">
{% highlight java %}
Map<String, String> options = new HashMap<>();
options.put("url", "jdbc:postgresql:dbserver");
options.put("dbtable", "schema.tablename");
Dataset<Row> jdbcDF = spark.read().format("jdbc"). options(options).load();
{% endhighlight %}
</div>
<div data-lang="python" markdown="1">
{% highlight python %}
df = spark.read.format('jdbc').options(url='jdbc:postgresql:dbserver', dbtable='schema.tablename').load()
{% endhighlight %}
</div>
<div data-lang="r" markdown="1">
{% highlight r %}
df <- read.jdbc("jdbc:postgresql:dbserver", "schema.tablename", user = "username", password = "password")
{% endhighlight %}
</div>
<div data-lang="sql" markdown="1">
{% highlight sql %}
CREATE TEMPORARY VIEW jdbcTable
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:postgresql:dbserver",
dbtable "schema.tablename"
)
{% endhighlight %}
</div>
</div>
## Troubleshooting
* The JDBC driver class must be visible to the primordial class loader on the client session and on all executors. This is because Java's DriverManager class does a security check that results in it ignoring all drivers not visible to the primordial class loader when one goes to open a connection. One convenient way to do this is to modify compute_classpath.sh on all worker nodes to include your driver JARs.
* Some databases, such as H2, convert all names to upper case. You'll need to use upper case to refer to those names in Spark SQL.
# Performance Tuning
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
For some workloads it is possible to improve performance by either caching data in memory, or by
turning on some experimental options.
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
## Caching Data In Memory
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
Spark SQL can cache tables using an in-memory columnar format by calling `spark.cacheTable("tableName")` or `dataFrame.cache()`.
Then Spark SQL will scan only required columns and will automatically tune compression to minimize
memory usage and GC pressure. You can call `spark.uncacheTable("tableName")` to remove the table from memory.
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 22:14:59 -04:00
Configuration of in-memory caching can be done using the `setConf` method on `SparkSession` or by running
`SET key=value` commands using SQL.
<table class="table">
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
<tr>
<td><code>spark.sql.inMemoryColumnarStorage.compressed</code></td>
<td>true</td>
<td>
When set to true Spark SQL will automatically select a compression codec for each column based
on statistics of the data.
</td>
</tr>
<tr>
<td><code>spark.sql.inMemoryColumnarStorage.batchSize</code></td>
<td>10000</td>
<td>
Controls the size of batches for columnar caching. Larger batch sizes can improve memory utilization
and compression, but risk OOMs when caching data.
</td>
</tr>
</table>
## Other Configuration Options
The following options can also be used to tune the performance of query execution. It is possible
that these options will be deprecated in future release as more optimizations are performed automatically.
<table class="table">
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
<tr>
<td><code>spark.sql.files.maxPartitionBytes</code></td>
<td>134217728 (128 MB)</td>
<td>
The maximum number of bytes to pack into a single partition when reading files.
</td>
</tr>
<tr>
<td><code>spark.sql.files.openCostInBytes</code></td>
<td>4194304 (4 MB)</td>
<td>
The estimated cost to open a file, measured by the number of bytes could be scanned in the same
time. This is used when putting multiple files into a partition. It is better to over estimated,
then the partitions with small files will be faster than partitions with bigger files (which is
scheduled first).
</td>
</tr>
<tr>
<td><code>spark.sql.autoBroadcastJoinThreshold</code></td>
<td>10485760 (10 MB)</td>
<td>
Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when
performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently
statistics are only supported for Hive Metastore tables where the command
<code>ANALYZE TABLE &lt;tableName&gt; COMPUTE STATISTICS noscan</code> has been run.
</td>
</tr>
<tr>
<td><code>spark.sql.tungsten.enabled</code></td>
<td>true</td>
<td>
When true, use the optimized Tungsten physical execution backend which explicitly manages memory
and dynamically generates bytecode for expression evaluation.
</td>
</tr>
<tr>
<td><code>spark.sql.shuffle.partitions</code></td>
<td>200</td>
<td>
Configures the number of partitions to use when shuffling data for joins or aggregations.
</td>
</tr>
</table>
# Distributed SQL Engine
Spark SQL can also act as a distributed query engine using its JDBC/ODBC or command-line interface.
In this mode, end-users or applications can interact with Spark SQL directly to run SQL queries,
without the need to write any code.
## Running the Thrift JDBC/ODBC server
The Thrift JDBC/ODBC server implemented here corresponds to the [`HiveServer2`](https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2)
in Hive 1.2.1 You can test the JDBC server with the beeline script that comes with either Spark or Hive 1.2.1.
To start the JDBC/ODBC server, run the following in the Spark directory:
./sbin/start-thriftserver.sh
This script accepts all `bin/spark-submit` command line options, plus a `--hiveconf` option to
specify Hive properties. You may run `./sbin/start-thriftserver.sh --help` for a complete list of
all available options. By default, the server listens on localhost:10000. You may override this
behaviour via either environment variables, i.e.:
{% highlight bash %}
export HIVE_SERVER2_THRIFT_PORT=<listening-port>
export HIVE_SERVER2_THRIFT_BIND_HOST=<listening-host>
./sbin/start-thriftserver.sh \
--master <master-uri> \
...
{% endhighlight %}
or system properties:
{% highlight bash %}
./sbin/start-thriftserver.sh \
--hiveconf hive.server2.thrift.port=<listening-port> \
--hiveconf hive.server2.thrift.bind.host=<listening-host> \
--master <master-uri>
...
{% endhighlight %}
Now you can use beeline to test the Thrift JDBC/ODBC server:
./bin/beeline
Connect to the JDBC/ODBC server in beeline with:
beeline> !connect jdbc:hive2://localhost:10000
Beeline will ask you for a username and password. In non-secure mode, simply enter the username on
your machine and a blank password. For secure mode, please follow the instructions given in the
[beeline documentation](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients).
Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` and `hdfs-site.xml` files in `conf/`.
You may also use the beeline script that comes with Hive.
Thrift JDBC server also supports sending thrift RPC messages over HTTP transport.
Use the following setting to enable HTTP mode as system property or in `hive-site.xml` file in `conf/`:
hive.server2.transport.mode - Set this to value: http
hive.server2.thrift.http.port - HTTP port number fo listen on; default is 10001
hive.server2.http.endpoint - HTTP endpoint; default is cliservice
To test, use beeline to connect to the JDBC/ODBC server in http mode with:
beeline> !connect jdbc:hive2://<host>:<port>/<database>?hive.server2.transport.mode=http;hive.server2.thrift.http.path=<http_endpoint>
## Running the Spark SQL CLI
The Spark SQL CLI is a convenient tool to run the Hive metastore service in local mode and execute
queries input from the command line. Note that the Spark SQL CLI cannot talk to the Thrift JDBC server.
To start the Spark SQL CLI, run the following in the Spark directory:
./bin/spark-sql
Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` and `hdfs-site.xml` files in `conf/`.
You may run `./bin/spark-sql --help` for a complete list of all available
options.
# Migration Guide
## Upgrading From Spark SQL 1.6 to 2.0
- `SparkSession` is now the new entry point of Spark that replaces the old `SQLContext` and
`HiveContext`. Note that the old SQLContext and HiveContext are kept for backward compatibility.
- Dataset API and DataFrame API are unified. In Scala, `DataFrame` becomes a type alias for
`Dataset[Row]`, while Java API users must replace `DataFrame` with `Dataset<Row>`. Both the typed
transformations (e.g. `map`, `filter`, and `groupByKey`) and untyped transformations (e.g.
`select` and `groupBy`) are available on the Dataset class. Since compile-time type-safety in
Python and R is not a language feature, the concept of Dataset does not apply to these languages
APIs. Instead, `DataFrame` remains the primary programing abstraction, which is analogous to the
single-node data frame notion in these languages.
## Upgrading From Spark SQL 1.5 to 1.6
- From Spark 1.6, by default the Thrift server runs in multi-session mode. Which means each JDBC/ODBC
connection owns a copy of their own SQL configuration and temporary function registry. Cached
tables are still shared though. If you prefer to run the Thrift server in the old single-session
mode, please set option `spark.sql.hive.thriftServer.singleSession` to `true`. You may either add
this option to `spark-defaults.conf`, or pass it to `start-thriftserver.sh` via `--conf`:
{% highlight bash %}
./sbin/start-thriftserver.sh \
--conf spark.sql.hive.thriftServer.singleSession=true \
...
{% endhighlight %}
- Since 1.6.1, withColumn method in sparkR supports adding a new column to or replacing existing columns
of the same name of a DataFrame.
- From Spark 1.6, LongType casts to TimestampType expect seconds instead of microseconds. This
change was made to match the behavior of Hive 1.2 for more consistent type casting to TimestampType
from numeric types. See [SPARK-11724](https://issues.apache.org/jira/browse/SPARK-11724) for
details.
## Upgrading From Spark SQL 1.4 to 1.5
- Optimized execution using manually managed memory (Tungsten) is now enabled by default, along with
code generation for expression evaluation. These features can both be disabled by setting
`spark.sql.tungsten.enabled` to `false`.
- Parquet schema merging is no longer enabled by default. It can be re-enabled by setting
`spark.sql.parquet.mergeSchema` to `true`.
- Resolution of strings to columns in python now supports using dots (`.`) to qualify the column or
access nested values. For example `df['table.column.nestedField']`. However, this means that if
your column name contains any dots you must now escape them using backticks (e.g., ``table.`column.with.dots`.nested``).
- In-memory columnar storage partition pruning is on by default. It can be disabled by setting
`spark.sql.inMemoryColumnarStorage.partitionPruning` to `false`.
- Unlimited precision decimal columns are no longer supported, instead Spark SQL enforces a maximum
precision of 38. When inferring schema from `BigDecimal` objects, a precision of (38, 18) is now
used. When no precision is specified in DDL then the default remains `Decimal(10, 0)`.
- Timestamps are now stored at a precision of 1us, rather than 1ns
- In the `sql` dialect, floating point numbers are now parsed as decimal. HiveQL parsing remains
unchanged.
- The canonical name of SQL/DataFrame functions are now lower case (e.g. sum vs SUM).
- JSON data source will not automatically load new files that are created by other applications
(i.e. files that are not inserted to the dataset through Spark SQL).
For a JSON persistent table (i.e. the metadata of the table is stored in Hive Metastore),
users can use `REFRESH TABLE` SQL command or `HiveContext`'s `refreshTable` method
to include those new files to the table. For a DataFrame representing a JSON dataset, users need to recreate
the DataFrame and the new DataFrame will include new files.
- DataFrame.withColumn method in pySpark supports adding a new column or replacing existing columns of the same name.
## Upgrading from Spark SQL 1.3 to 1.4
#### DataFrame data reader/writer interface
Based on user feedback, we created a new, more fluid API for reading data in (`SQLContext.read`)
and writing data out (`DataFrame.write`),
and deprecated the old APIs (e.g. `SQLContext.parquetFile`, `SQLContext.jsonFile`).
See the API docs for `SQLContext.read` (
<a href="api/scala/index.html#org.apache.spark.sql.SQLContext@read:DataFrameReader">Scala</a>,
<a href="api/java/org/apache/spark/sql/SQLContext.html#read()">Java</a>,
<a href="api/python/pyspark.sql.html#pyspark.sql.SQLContext.read">Python</a>
) and `DataFrame.write` (
<a href="api/scala/index.html#org.apache.spark.sql.DataFrame@write:DataFrameWriter">Scala</a>,
<a href="api/java/org/apache/spark/sql/DataFrame.html#write()">Java</a>,
<a href="api/python/pyspark.sql.html#pyspark.sql.DataFrame.write">Python</a>
) more information.
#### DataFrame.groupBy retains grouping columns
Based on user feedback, we changed the default behavior of `DataFrame.groupBy().agg()` to retain the
grouping columns in the resulting `DataFrame`. To keep the behavior in 1.3, set `spark.sql.retainGroupColumns` to `false`.
<div class="codetabs">
<div data-lang="scala" markdown="1">
{% highlight scala %}
// In 1.3.x, in order for the grouping column "department" to show up,
// it must be included explicitly as part of the agg function call.
df.groupBy("department").agg($"department", max("age"), sum("expense"))
// In 1.4+, grouping column "department" is included automatically.
df.groupBy("department").agg(max("age"), sum("expense"))
// Revert to 1.3 behavior (not retaining grouping column) by:
sqlContext.setConf("spark.sql.retainGroupColumns", "false")
{% endhighlight %}
</div>
<div data-lang="java" markdown="1">
{% highlight java %}
// In 1.3.x, in order for the grouping column "department" to show up,
// it must be included explicitly as part of the agg function call.
df.groupBy("department").agg(col("department"), max("age"), sum("expense"));
// In 1.4+, grouping column "department" is included automatically.
df.groupBy("department").agg(max("age"), sum("expense"));
// Revert to 1.3 behavior (not retaining grouping column) by:
sqlContext.setConf("spark.sql.retainGroupColumns", "false");
{% endhighlight %}
</div>
<div data-lang="python" markdown="1">
{% highlight python %}
import pyspark.sql.functions as func
# In 1.3.x, in order for the grouping column "department" to show up,
# it must be included explicitly as part of the agg function call.
df.groupBy("department").agg(df["department"], func.max("age"), func.sum("expense"))
# In 1.4+, grouping column "department" is included automatically.
df.groupBy("department").agg(func.max("age"), func.sum("expense"))
# Revert to 1.3.x behavior (not retaining grouping column) by:
sqlContext.setConf("spark.sql.retainGroupColumns", "false")
{% endhighlight %}
</div>
</div>
#### Behavior change on DataFrame.withColumn
Prior to 1.4, DataFrame.withColumn() supports adding a column only. The column will always be added
as a new column with its specified name in the result DataFrame even if there may be any existing
columns of the same name. Since 1.4, DataFrame.withColumn() supports adding a column of a different
name from names of all existing columns or replacing existing columns of the same name.
Note that this change is only for Scala API, not for PySpark and SparkR.
## Upgrading from Spark SQL 1.0-1.2 to 1.3
In Spark 1.3 we removed the "Alpha" label from Spark SQL and as part of this did a cleanup of the
available APIs. From Spark 1.3 onwards, Spark SQL will provide binary compatibility with other
releases in the 1.X series. This compatibility guarantee excludes APIs that are explicitly marked
as unstable (i.e., DeveloperAPI or Experimental).
#### Rename of SchemaRDD to DataFrame
The largest change that users will notice when upgrading to Spark SQL 1.3 is that `SchemaRDD` has
been renamed to `DataFrame`. This is primarily because DataFrames no longer inherit from RDD
directly, but instead provide most of the functionality that RDDs provide though their own
implementation. DataFrames can still be converted to RDDs by calling the `.rdd` method.
In Scala there is a type alias from `SchemaRDD` to `DataFrame` to provide source compatibility for
some use cases. It is still recommended that users update their code to use `DataFrame` instead.
Java and Python users will need to update their code.
#### Unification of the Java and Scala APIs
Prior to Spark 1.3 there were separate Java compatible classes (`JavaSQLContext` and `JavaSchemaRDD`)
that mirrored the Scala API. In Spark 1.3 the Java API and Scala API have been unified. Users
of either language should use `SQLContext` and `DataFrame`. In general theses classes try to
use types that are usable from both languages (i.e. `Array` instead of language specific collections).
In some cases where no common type exists (e.g., for passing in closures or Maps) function overloading
is used instead.
Additionally the Java specific types API has been removed. Users of both Scala and Java should
use the classes present in `org.apache.spark.sql.types` to describe schema programmatically.
#### Isolation of Implicit Conversions and Removal of dsl Package (Scala-only)
Many of the code examples prior to Spark 1.3 started with `import sqlContext._`, which brought
all of the functions from sqlContext into scope. In Spark 1.3 we have isolated the implicit
conversions for converting `RDD`s into `DataFrame`s into an object inside of the `SQLContext`.
Users should now write `import sqlContext.implicits._`.
Additionally, the implicit conversions now only augment RDDs that are composed of `Product`s (i.e.,
case classes or tuples) with a method `toDF`, instead of applying automatically.
When using function inside of the DSL (now replaced with the `DataFrame` API) users used to import
`org.apache.spark.sql.catalyst.dsl`. Instead the public dataframe functions API should be used:
`import org.apache.spark.sql.functions._`.
#### Removal of the type aliases in org.apache.spark.sql for DataType (Scala-only)
Spark 1.3 removes the type aliases that were present in the base sql package for `DataType`. Users
should instead import the classes in `org.apache.spark.sql.types`
#### UDF Registration Moved to `sqlContext.udf` (Java & Scala)
Functions that are used to register UDFs, either for use in the DataFrame DSL or SQL, have been
moved into the udf object in `SQLContext`.
<div class="codetabs">
<div data-lang="scala" markdown="1">
{% highlight scala %}
sqlContext.udf.register("strLen", (s: String) => s.length())
{% endhighlight %}
</div>
<div data-lang="java" markdown="1">
{% highlight java %}
sqlContext.udf().register("strLen", (String s) -> s.length(), DataTypes.IntegerType);
{% endhighlight %}
</div>
</div>
Python UDF registration is unchanged.
#### Python DataTypes No Longer Singletons
When using DataTypes in Python you will need to construct them (i.e. `StringType()`) instead of
referencing a singleton.
## Compatibility with Apache Hive
Spark SQL is designed to be compatible with the Hive Metastore, SerDes and UDFs.
Currently Hive SerDes and UDFs are based on Hive 1.2.1,
and Spark SQL can be connected to different versions of Hive Metastore
(from 0.12.0 to 1.2.1. Also see [Interacting with Different Versions of Hive Metastore] (#interacting-with-different-versions-of-hive-metastore)).
#### Deploying in Existing Hive Warehouses
The Spark SQL Thrift JDBC server is designed to be "out of the box" compatible with existing Hive
installations. You do not need to modify your existing Hive Metastore or change the data placement
or partitioning of your tables.
### Supported Hive Features
Spark SQL supports the vast majority of Hive features, such as:
* Hive query statements, including:
* `SELECT`
* `GROUP BY`
* `ORDER BY`
* `CLUSTER BY`
* `SORT BY`
* All Hive operators, including:
* Relational operators (`=`, `⇔`, `==`, `<>`, `<`, `>`, `>=`, `<=`, etc)
* Arithmetic operators (`+`, `-`, `*`, `/`, `%`, etc)
* Logical operators (`AND`, `&&`, `OR`, `||`, etc)
* Complex type constructors
* Mathematical functions (`sign`, `ln`, `cos`, etc)
* String functions (`instr`, `length`, `printf`, etc)
* User defined functions (UDF)
* User defined aggregation functions (UDAF)
* User defined serialization formats (SerDes)
* Window functions
* Joins
* `JOIN`
* `{LEFT|RIGHT|FULL} OUTER JOIN`
* `LEFT SEMI JOIN`
* `CROSS JOIN`
* Unions
* Sub-queries
* `SELECT col FROM ( SELECT a + b AS col from t1) t2`
* Sampling
* Explain
* Partitioned tables including dynamic partition insertion
* View
* All Hive DDL Functions, including:
* `CREATE TABLE`
* `CREATE TABLE AS SELECT`
* `ALTER TABLE`
* Most Hive Data types, including:
* `TINYINT`
* `SMALLINT`
* `INT`
* `BIGINT`
* `BOOLEAN`
* `FLOAT`
* `DOUBLE`
* `STRING`
* `BINARY`
* `TIMESTAMP`
* `DATE`
* `ARRAY<>`
* `MAP<>`
* `STRUCT<>`
### Unsupported Hive Functionality
Below is a list of Hive features that we don't support yet. Most of these features are rarely used
in Hive deployments.
**Major Hive Features**
* Tables with buckets: bucket is the hash partitioning within a Hive table partition. Spark SQL
doesn't support buckets yet.
**Esoteric Hive Features**
* `UNION` type
* Unique join
* Column statistics collecting: Spark SQL does not piggyback scans to collect column statistics at
the moment and only supports populating the sizeInBytes field of the hive metastore.
**Hive Input/Output Formats**
* File format for CLI: For results showing back to the CLI, Spark SQL only supports TextOutputFormat.
* Hadoop archive
**Hive Optimizations**
A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are
less important due to Spark SQL's in-memory computational model. Others are slotted for future
releases of Spark SQL.
* Block level bitmap indexes and virtual columns (used to build indexes)
* Automatically determine the number of reducers for joins and groupbys: Currently in Spark SQL, you
need to control the degree of parallelism post-shuffle using "`SET spark.sql.shuffle.partitions=[num_tasks];`".
* Meta-data only query: For queries that can be answered by using only meta data, Spark SQL still
launches tasks to compute the result.
* Skew data flag: Spark SQL does not follow the skew data flags in Hive.
* `STREAMTABLE` hint in join: Spark SQL does not follow the `STREAMTABLE` hint.
* Merge multiple small files for query results: if the result output contains multiple small files,
Hive can optionally merge the small files into fewer large files to avoid overflowing the HDFS
metadata. Spark SQL does not support that.
# Reference
## Data Types
Spark SQL and DataFrames support the following data types:
* Numeric types
- `ByteType`: Represents 1-byte signed integer numbers.
The range of numbers is from `-128` to `127`.
- `ShortType`: Represents 2-byte signed integer numbers.
The range of numbers is from `-32768` to `32767`.
- `IntegerType`: Represents 4-byte signed integer numbers.
The range of numbers is from `-2147483648` to `2147483647`.
- `LongType`: Represents 8-byte signed integer numbers.
The range of numbers is from `-9223372036854775808` to `9223372036854775807`.
- `FloatType`: Represents 4-byte single-precision floating point numbers.
- `DoubleType`: Represents 8-byte double-precision floating point numbers.
- `DecimalType`: Represents arbitrary-precision signed decimal numbers. Backed internally by `java.math.BigDecimal`. A `BigDecimal` consists of an arbitrary precision integer unscaled value and a 32-bit integer scale.
* String type
- `StringType`: Represents character string values.
* Binary type
- `BinaryType`: Represents byte sequence values.
* Boolean type
- `BooleanType`: Represents boolean values.
* Datetime type
- `TimestampType`: Represents values comprising values of fields year, month, day,
hour, minute, and second.
- `DateType`: Represents values comprising values of fields year, month, day.
* Complex types
- `ArrayType(elementType, containsNull)`: Represents values comprising a sequence of
elements with the type of `elementType`. `containsNull` is used to indicate if
elements in a `ArrayType` value can have `null` values.
- `MapType(keyType, valueType, valueContainsNull)`:
Represents values comprising a set of key-value pairs. The data type of keys are
described by `keyType` and the data type of values are described by `valueType`.
For a `MapType` value, keys are not allowed to have `null` values. `valueContainsNull`
is used to indicate if values of a `MapType` value can have `null` values.
- `StructType(fields)`: Represents values with the structure described by
a sequence of `StructField`s (`fields`).
* `StructField(name, dataType, nullable)`: Represents a field in a `StructType`.
The name of a field is indicated by `name`. The data type of a field is indicated
by `dataType`. `nullable` is used to indicate if values of this fields can have
`null` values.
<div class="codetabs">
<div data-lang="scala" markdown="1">
All data types of Spark SQL are located in the package `org.apache.spark.sql.types`.
You can access them by doing
{% highlight scala %}
import org.apache.spark.sql.types._
{% endhighlight %}
<table class="table">
<tr>
<th style="width:20%">Data type</th>
<th style="width:40%">Value type in Scala</th>
<th>API to access or create a data type</th></tr>
<tr>
<td> <b>ByteType</b> </td>
<td> Byte </td>
<td>
ByteType
</td>
</tr>
<tr>
<td> <b>ShortType</b> </td>
<td> Short </td>
<td>
ShortType
</td>
</tr>
<tr>
<td> <b>IntegerType</b> </td>
<td> Int </td>
<td>
IntegerType
</td>
</tr>
<tr>
<td> <b>LongType</b> </td>
<td> Long </td>
<td>
LongType
</td>
</tr>
<tr>
<td> <b>FloatType</b> </td>
<td> Float </td>
<td>
FloatType
</td>
</tr>
<tr>
<td> <b>DoubleType</b> </td>
<td> Double </td>
<td>
DoubleType
</td>
</tr>
<tr>
<td> <b>DecimalType</b> </td>
<td> java.math.BigDecimal </td>
<td>
DecimalType
</td>
</tr>
<tr>
<td> <b>StringType</b> </td>
<td> String </td>
<td>
StringType
</td>
</tr>
<tr>
<td> <b>BinaryType</b> </td>
<td> Array[Byte] </td>
<td>
BinaryType
</td>
</tr>
<tr>
<td> <b>BooleanType</b> </td>
<td> Boolean </td>
<td>
BooleanType
</td>
</tr>
<tr>
<td> <b>TimestampType</b> </td>
<td> java.sql.Timestamp </td>
<td>
TimestampType
</td>
</tr>
<tr>
<td> <b>DateType</b> </td>
<td> java.sql.Date </td>
<td>
DateType
</td>
</tr>
<tr>
<td> <b>ArrayType</b> </td>
<td> scala.collection.Seq </td>
<td>
ArrayType(<i>elementType</i>, [<i>containsNull</i>])<br />
<b>Note:</b> The default value of <i>containsNull</i> is <i>true</i>.
</td>
</tr>
<tr>
<td> <b>MapType</b> </td>
<td> scala.collection.Map </td>
<td>
MapType(<i>keyType</i>, <i>valueType</i>, [<i>valueContainsNull</i>])<br />
<b>Note:</b> The default value of <i>valueContainsNull</i> is <i>true</i>.
</td>
</tr>
<tr>
<td> <b>StructType</b> </td>
<td> org.apache.spark.sql.Row </td>
<td>
StructType(<i>fields</i>)<br />
<b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same
name are not allowed.
</td>
</tr>
<tr>
<td> <b>StructField</b> </td>
<td> The value type in Scala of the data type of this field
(For example, Int for a StructField with the data type IntegerType) </td>
<td>
StructField(<i>name</i>, <i>dataType</i>, <i>nullable</i>)
</td>
</tr>
</table>
</div>
<div data-lang="java" markdown="1">
All data types of Spark SQL are located in the package of
`org.apache.spark.sql.types`. To access or create a data type,
please use factory methods provided in
`org.apache.spark.sql.types.DataTypes`.
<table class="table">
<tr>
<th style="width:20%">Data type</th>
<th style="width:40%">Value type in Java</th>
<th>API to access or create a data type</th></tr>
<tr>
<td> <b>ByteType</b> </td>
<td> byte or Byte </td>
<td>
DataTypes.ByteType
</td>
</tr>
<tr>
<td> <b>ShortType</b> </td>
<td> short or Short </td>
<td>
DataTypes.ShortType
</td>
</tr>
<tr>
<td> <b>IntegerType</b> </td>
<td> int or Integer </td>
<td>
DataTypes.IntegerType
</td>
</tr>
<tr>
<td> <b>LongType</b> </td>
<td> long or Long </td>
<td>
DataTypes.LongType
</td>
</tr>
<tr>
<td> <b>FloatType</b> </td>
<td> float or Float </td>
<td>
DataTypes.FloatType
</td>
</tr>
<tr>
<td> <b>DoubleType</b> </td>
<td> double or Double </td>
<td>
DataTypes.DoubleType
</td>
</tr>
<tr>
<td> <b>DecimalType</b> </td>
<td> java.math.BigDecimal </td>
<td>
DataTypes.createDecimalType()<br />
DataTypes.createDecimalType(<i>precision</i>, <i>scale</i>).
</td>
</tr>
<tr>
<td> <b>StringType</b> </td>
<td> String </td>
<td>
DataTypes.StringType
</td>
</tr>
<tr>
<td> <b>BinaryType</b> </td>
<td> byte[] </td>
<td>
DataTypes.BinaryType
</td>
</tr>
<tr>
<td> <b>BooleanType</b> </td>
<td> boolean or Boolean </td>
<td>
DataTypes.BooleanType
</td>
</tr>
<tr>
<td> <b>TimestampType</b> </td>
<td> java.sql.Timestamp </td>
<td>
DataTypes.TimestampType
</td>
</tr>
<tr>
<td> <b>DateType</b> </td>
<td> java.sql.Date </td>
<td>
DataTypes.DateType
</td>
</tr>
<tr>
<td> <b>ArrayType</b> </td>
<td> java.util.List </td>
<td>
DataTypes.createArrayType(<i>elementType</i>)<br />
<b>Note:</b> The value of <i>containsNull</i> will be <i>true</i><br />
DataTypes.createArrayType(<i>elementType</i>, <i>containsNull</i>).
</td>
</tr>
<tr>
<td> <b>MapType</b> </td>
<td> java.util.Map </td>
<td>
DataTypes.createMapType(<i>keyType</i>, <i>valueType</i>)<br />
<b>Note:</b> The value of <i>valueContainsNull</i> will be <i>true</i>.<br />
DataTypes.createMapType(<i>keyType</i>, <i>valueType</i>, <i>valueContainsNull</i>)<br />
</td>
</tr>
<tr>
<td> <b>StructType</b> </td>
<td> org.apache.spark.sql.Row </td>
<td>
DataTypes.createStructType(<i>fields</i>)<br />
<b>Note:</b> <i>fields</i> is a List or an array of StructFields.
Also, two fields with the same name are not allowed.
</td>
</tr>
<tr>
<td> <b>StructField</b> </td>
<td> The value type in Java of the data type of this field
(For example, int for a StructField with the data type IntegerType) </td>
<td>
DataTypes.createStructField(<i>name</i>, <i>dataType</i>, <i>nullable</i>)
</td>
</tr>
</table>
</div>
<div data-lang="python" markdown="1">
All data types of Spark SQL are located in the package of `pyspark.sql.types`.
You can access them by doing
{% highlight python %}
from pyspark.sql.types import *
{% endhighlight %}
<table class="table">
<tr>
<th style="width:20%">Data type</th>
<th style="width:40%">Value type in Python</th>
<th>API to access or create a data type</th></tr>
<tr>
<td> <b>ByteType</b> </td>
<td>
int or long <br />
<b>Note:</b> Numbers will be converted to 1-byte signed integer numbers at runtime.
Please make sure that numbers are within the range of -128 to 127.
</td>
<td>
ByteType()
</td>
</tr>
<tr>
<td> <b>ShortType</b> </td>
<td>
int or long <br />
<b>Note:</b> Numbers will be converted to 2-byte signed integer numbers at runtime.
Please make sure that numbers are within the range of -32768 to 32767.
</td>
<td>
ShortType()
</td>
</tr>
<tr>
<td> <b>IntegerType</b> </td>
<td> int or long </td>
<td>
IntegerType()
</td>
</tr>
<tr>
<td> <b>LongType</b> </td>
<td>
long <br />
<b>Note:</b> Numbers will be converted to 8-byte signed integer numbers at runtime.
Please make sure that numbers are within the range of
-9223372036854775808 to 9223372036854775807.
Otherwise, please convert data to decimal.Decimal and use DecimalType.
</td>
<td>
LongType()
</td>
</tr>
<tr>
<td> <b>FloatType</b> </td>
<td>
float <br />
<b>Note:</b> Numbers will be converted to 4-byte single-precision floating
point numbers at runtime.
</td>
<td>
FloatType()
</td>
</tr>
<tr>
<td> <b>DoubleType</b> </td>
<td> float </td>
<td>
DoubleType()
</td>
</tr>
<tr>
<td> <b>DecimalType</b> </td>
<td> decimal.Decimal </td>
<td>
DecimalType()
</td>
</tr>
<tr>
<td> <b>StringType</b> </td>
<td> string </td>
<td>
StringType()
</td>
</tr>
<tr>
<td> <b>BinaryType</b> </td>
<td> bytearray </td>
<td>
BinaryType()
</td>
</tr>
<tr>
<td> <b>BooleanType</b> </td>
<td> bool </td>
<td>
BooleanType()
</td>
</tr>
<tr>
<td> <b>TimestampType</b> </td>
<td> datetime.datetime </td>
<td>
TimestampType()
</td>
</tr>
<tr>
<td> <b>DateType</b> </td>
<td> datetime.date </td>
<td>
DateType()
</td>
</tr>
<tr>
<td> <b>ArrayType</b> </td>
<td> list, tuple, or array </td>
<td>
ArrayType(<i>elementType</i>, [<i>containsNull</i>])<br />
<b>Note:</b> The default value of <i>containsNull</i> is <i>True</i>.
</td>
</tr>
<tr>
<td> <b>MapType</b> </td>
<td> dict </td>
<td>
MapType(<i>keyType</i>, <i>valueType</i>, [<i>valueContainsNull</i>])<br />
<b>Note:</b> The default value of <i>valueContainsNull</i> is <i>True</i>.
</td>
</tr>
<tr>
<td> <b>StructType</b> </td>
<td> list or tuple </td>
<td>
StructType(<i>fields</i>)<br />
<b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same
name are not allowed.
</td>
</tr>
<tr>
<td> <b>StructField</b> </td>
<td> The value type in Python of the data type of this field
(For example, Int for a StructField with the data type IntegerType) </td>
<td>
StructField(<i>name</i>, <i>dataType</i>, <i>nullable</i>)
</td>
</tr>
</table>
</div>
<div data-lang="r" markdown="1">
<table class="table">
<tr>
<th style="width:20%">Data type</th>
<th style="width:40%">Value type in R</th>
<th>API to access or create a data type</th></tr>
<tr>
<td> <b>ByteType</b> </td>
<td>
integer <br />
<b>Note:</b> Numbers will be converted to 1-byte signed integer numbers at runtime.
Please make sure that numbers are within the range of -128 to 127.
</td>
<td>
"byte"
</td>
</tr>
<tr>
<td> <b>ShortType</b> </td>
<td>
integer <br />
<b>Note:</b> Numbers will be converted to 2-byte signed integer numbers at runtime.
Please make sure that numbers are within the range of -32768 to 32767.
</td>
<td>
"short"
</td>
</tr>
<tr>
<td> <b>IntegerType</b> </td>
<td> integer </td>
<td>
"integer"
</td>
</tr>
<tr>
<td> <b>LongType</b> </td>
<td>
integer <br />
<b>Note:</b> Numbers will be converted to 8-byte signed integer numbers at runtime.
Please make sure that numbers are within the range of
-9223372036854775808 to 9223372036854775807.
Otherwise, please convert data to decimal.Decimal and use DecimalType.
</td>
<td>
"long"
</td>
</tr>
<tr>
<td> <b>FloatType</b> </td>
<td>
numeric <br />
<b>Note:</b> Numbers will be converted to 4-byte single-precision floating
point numbers at runtime.
</td>
<td>
"float"
</td>
</tr>
<tr>
<td> <b>DoubleType</b> </td>
<td> numeric </td>
<td>
"double"
</td>
</tr>
<tr>
<td> <b>DecimalType</b> </td>
<td> Not supported </td>
<td>
Not supported
</td>
</tr>
<tr>
<td> <b>StringType</b> </td>
<td> character </td>
<td>
"string"
</td>
</tr>
<tr>
<td> <b>BinaryType</b> </td>
<td> raw </td>
<td>
"binary"
</td>
</tr>
<tr>
<td> <b>BooleanType</b> </td>
<td> logical </td>
<td>
"bool"
</td>
</tr>
<tr>
<td> <b>TimestampType</b> </td>
<td> POSIXct </td>
<td>
"timestamp"
</td>
</tr>
<tr>
<td> <b>DateType</b> </td>
<td> Date </td>
<td>
"date"
</td>
</tr>
<tr>
<td> <b>ArrayType</b> </td>
<td> vector or list </td>
<td>
list(type="array", elementType=<i>elementType</i>, containsNull=[<i>containsNull</i>])<br />
<b>Note:</b> The default value of <i>containsNull</i> is <i>True</i>.
</td>
</tr>
<tr>
<td> <b>MapType</b> </td>
<td> environment </td>
<td>
list(type="map", keyType=<i>keyType</i>, valueType=<i>valueType</i>, valueContainsNull=[<i>valueContainsNull</i>])<br />
<b>Note:</b> The default value of <i>valueContainsNull</i> is <i>True</i>.
</td>
</tr>
<tr>
<td> <b>StructType</b> </td>
<td> named list</td>
<td>
list(type="struct", fields=<i>fields</i>)<br />
<b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same
name are not allowed.
</td>
</tr>
<tr>
<td> <b>StructField</b> </td>
<td> The value type in R of the data type of this field
(For example, integer for a StructField with the data type IntegerType) </td>
<td>
list(name=<i>name</i>, type=<i>dataType</i>, nullable=<i>nullable</i>)
</td>
</tr>
</table>
</div>
</div>
## NaN Semantics
There is specially handling for not-a-number (NaN) when dealing with `float` or `double` types that
does not exactly match standard floating point semantics.
Specifically:
- NaN = NaN returns true.
- In aggregations all NaN values are grouped together.
- NaN is treated as a normal value in join keys.
- NaN values go last when in ascending order, larger than any other numeric value.