-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathREADME.rmd
253 lines (203 loc) · 10.7 KB
/
README.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
---
title: "The xml2relational Package"
author: "Joachim Zuckarelli"
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_knit$set(root.dir = "C:\\Users\\Joachim\\Desktop\\Dokumente\\500 Projekte\\510 R-Packages\\516 xml2relational\\")
```
## What `xml2relational` does
`xml2relational` is designed to convert XML documents with nested object hierarchies into a set of R dataframes. These dataframes represent the different tables in a relational data model and are connected amongst each other by foreign keys. Essentially, `xml2relational` flattens an object-oriented data structure into a relational data structure.
Once the relational structure is created (and that is basically a list of dataframes representing the different tables) you can export both the data model (as SQL `CREATE` statements) and the data (either as SQL `INSERT` statements or as CSV files) to get the data easily into a relational database.
## Getting started
## Installing and loading `xml2relational`
You can install the `xml2relational` package from CRAN by executing the following code in your script or in the R console:
```{r eval=FALSE}
install.packages("xml2relational", dependencies = TRUE)
```
After having installed the package you need to load it (attach it to the search path) by calling `library()`:
```{r eval=FALSE}
library(xml2relational)
```
### Our example data set
To demonstrate how `xml2relational` works, we will use a small sample dataset that is shipped together with the `xml2relational` package: the `customer` dataset.
Here is how it looks like:
```
<xml>
<customer>
<customerno>C0023751</customerno>
<givenname>Sarah</givenname>
<surname>Durbin</surname>
<email>[email protected]</email>
<address>
<street>139 W Jackson Blvd</street>
<postalcode>60604</postalcode>
<city>
<name>Chicago</name>
<state>Illinois</state>
</city>
<country>
<name>United States of America</name>
<isocode>US</isocode>
</country>
</address>
<username>queenofqueens</username>
</customer>
<customer>
<customerno>C0017439</customerno>
<givenname>Mark</givenname>
<surname>Durbin</surname>
<email>[email protected]</email>
<address>
<street>139 W Jackson Blvd</street>
<postalcode>60604</postalcode>
<city>
<name>Chicago</name>
<state>Illinois</state>
</city>
<country>
<name>United States of America</name>
<isocode>US</isocode>
</country>
</address>
<username>durby82</username>
</customer>
<customer>
<customerno>C0248538</customerno>
<givenname>Max</givenname>
<surname>Brunner</surname>
<email>[email protected]</email>
<address>
<street>Rotkreuzplatz 5</street>
<postalcode>80634</postalcode>
<city>
<name>Munich</name>
<state>Bavaria</state>
</city>
<country>
<name>Germany</name>
<isocode>DE</isocode>
</country>
</address>
<username>brunnermax_69</username>
</customer>
<customer>
<customerno>C0271182</customerno>
<givenname>Urs</givenname>
<surname>Richli</surname>
<email>[email protected]</email>
<address>
<street>Seestrasse 43</street>
<postalcode>6052</postalcode>
<city>
<name>Hergiswil</name>
<state>Luzern</state>
</city>
<country>
<name>Switzerland</name>
<isocode>CH</isocode>
</country>
</address>
<username>ursrichli</username>
</customer>
<customer>
<customerno>C0019935</customerno>
<givenname>Clara-Sophie</givenname>
<surname>Dr. Hellmann</surname>
<email>[email protected]</email>
<address>
<street>Brienner Strasse 11</street>
<postalcode>80333</postalcode>
<city>
<name>Munich</name>
<state>Bavaria</state>
</city>
<country>
<name>Germany</name>
<isocode>DE</isocode>
</country>
</address>
<username>helli</username>
</customer>
<customer>
<customerno>C0019935</customerno>
<givenname>Thomas</givenname>
<surname>Chang</surname>
<email>[email protected]</email>
<address>
<street>539 Lombard St</street>
<postalcode>94133</postalcode>
<city>
<name>San Francisco</name>
<state>California</state>
</city>
<country>
<name>United States of America</name>
<isocode>US</isocode>
</country>
</address>
<username>tchango123</username>
</customer>
</xml>
```
In this dataset we have a nested object structure. Specifically, each customer has an address consisting of several elements. Among those elements is the city which is again an object of its own, with a city name and state. The same applies to the country which is included with its name and its ISO country code.
When you look at the (completely made-up) customers here, you will notice that the customers Sarah Durbin and Mark Durbin (the first two customers) share the same address. Also, Max Brunner and Clara-Sophie Hellmann both live in Munich, Germany (although at different addresses). Thomas Chang of San Francisco lives in the USA, as do the Durbins.
When we now process the data and derive the relational data model, `xml2relational` will take care of these 'duplicates'.
### Processing the data
```{r eval=TRUE, include=FALSE}
source("xml2relational_v9.r")
```
Deriving the relational data model from this XML data is fairly simple:
```{r eval=TRUE}
customer.data <- toRelational("customers.xml")
```
The `toRelational()` function flattens the hierarchical structure of the XML data and distributes the data to a set of dataframes representing the tables of our relational data model. It returns these dataframes as a list (`customer.data`). We can now inspect this list to see the tables that have been generated:
```{r eval=TRUE}
class(customer.data)
names(customer.data)
class(customer.data$customer)
```
Let us have a closer look at the `customer` dataframe:
```{r eval=TRUE}
customer.data$customer
```
As you can see, each customer record has been assigned a primary key, `ID_customer`. The argument `prefix.primary` of the `toRelational()` function lets you change the prefix that is used to identify primary key fields. Its default value is `"ID_"`.
Similiarly, using the `prefix.foreign` argument you can change the prefix used for the names of foreign key fields from its default value `"FKID_"` to whatever you like. The name of the key fields always consists of the prefix and the name of the table.
In the `customer` table we have a foreign key that relates to the address. You may have noticed that, as expected, the data records of Sarah and Mark Durbin point to the same `address` record as they live in the same place.
Let us now look into the address table:
```{r eval=TRUE}
customer.data$address
```
Again, the address points to other tables, namely the `city` and the `country` table. As we would have expected, the two Munich addresses point to the same city and the same country, and the two US addresses point to the same record in the `country` table.
You see how easy it is to flatten a hierarchical, objected-oriented XML data structure to a relational data model using the `toRelational()` function.
### Saving the results
In the next step, we want to export our results. That can mean two things:
* exporting the data model (i.e. the structure of the tables)
* exporting the data, the content of the tables.
For the first task, `xml2relational` provides the `getCreateSQL()` function. This function returns ready-to-excecute SQL `CREATE` statements. It supports three built-in SQL flavors, `MySQL`, `TransactSQL` and `Oracle`. You add additional SQL flavors, if you like. In this case, you would use `sql.style` argument to provide a special dataframe containing the required definitions for the new SQL dialect. Please consult the online help texts for more information on how this is done.
In order to generate proper SQL `CREATE` statements, `getCreateSQL()` guesses the data types of the table fields from the data. If you do not like the results, you can provide your own function to derive the data types as `datatype.func` argument. This function would need to accept exactly one argument, a vector with the field vales of the field for which a datatype needs to be guessed. It then must return the datatype as a one-element character vector.
If you are not going to change the behavior of `getCreateSQL()` using these options, generating the SQL `CREATE` statements is very straightforward:
```{r eval=TRUE}
create.sql <- getCreateSQL(customer.data, "MySQL")
cat(create.sql, sep="\n\n")
```
`xml2relational` tries to guess the datatype from the actual data. When you are working with the `MySQL`, `Transact SQL` (`T-SQL`) and `Oracle` dialects/flavors of SQL, this should be alright. Nevertheless, using the `datatype.func` argument of `getcreateSQL()` you can also provide your own function to determine the data type. This function would need to take exactly one argument, a data vector from a data table, and return the appropriate SQL data type as a one-element character vector.
Alternatively, you can also use the built-in mechanism for determining the data type and just supply additional information on the SQL flavor that you use. Please consult the online help with `?getCreateSQL` to learn more on providing the necessary information.
By setting the logical `one.statement` argument to `TRUE` you can let `getcreateSQL()` return the `CREATE` statements in one character value instead of a vector with one element per `CREATE` statement. In this case you can use the `line.break` argument to define how the different `CREATE` statement are to be separated (apart from a semicolon that is added by default).
To export the data as such you have two options:
* you export ready-to-execute SQL `INSERT` statements using `getInsertSQL()` function
* you save the data to CSV files using `savetofiles()`.
Producing SQL `INSERT` statements for the data in one of the tables is very easy with `getInsertSQL()`:
```{r eval=TRUE}
insert.sql <- getInsertSQL(customer.data, table.name = "city")
cat(insert.sql, sep="\n")
```
You can also export all the tables of your relational model with `savetofiles()`:
```{r eval=TRUE}
savetofiles(customer.data)
```
This will save as many CSV files to your current working directory as you have tables in your model (`customer.data`). Each file is named for the name of the dataframe connected to the respective table, so `city.csv` will store the data from the `city` table.
More optional arguments for most of the functions discussed here are available. Please check the online help for more details.
## Contact the author
I appreciate your questions, issues and feature requests. Contact me on [email protected], visit the GitHub repository on https://github.com/jsugarelli/xml2relational for the package source and [follow me on Twitter](https://twitter.com/jsugarelli) to stay up-to-date!