1. Background
2. Basic merges
A.Merge multiple sheets of one Excel file
january_2013:
february_2013:
The merging requires getting two fields, Customer Name & Sales Amount, from each worksheet. Below is the final merging result:
esProc SPL script:
A | |
---|---|
1 | =file("D:/sales_2013.xlsx").xlsopen() |
2 | =A1.conj(A1.xlsimport@t('Customer Name','Sale Amount';~.stname)) |
3 | >file("D:/result_2013.xlsx"). xlsexport@t(A2;"merge_sheets") |
Explanation:
A1: Open the Excel file and generate a sequence of 3 worksheets.
A2: conj() function traverses the member worksheets in A1’s sequence to import the specified fields ‘Customer Name’ and ‘Sale Amount’ and merges them. xlsimport() function imports the fields, where a semicolon follows the last field and parameter ~.stname represents the current worksheet. @t option specifies that the first row in each worksheet will be the field names. The function is nested within conj() function’s loop traversal to import data from every worksheet.
A3: Save A2’s table sequence as a new worksheet in the original Excel file with the name “merge_sheets”. @t option is used to make the first row the column headers.
Only three lines of code, the script is short, concise yet clear and easy to understand.
B.Merge multiple sheets from different Excel files
Here are several Excel files in which every worksheet containing a year’s sales data has the same structure as the worksheets in the previous instance:
esProc SPL script:
A | B | |
---|---|---|
1 | for directory@p("d:/excel/*.xlsx") | =file(A1).xlsopen() |
2 | =B1.conj(B1.xlsimport@t('Customer Name','Sale Amount','Purchase Date';~.stname)) | |
3 | =@|B2 | |
4 | > file("d:/result.xlsx"). xlsexport@t(B3;"merge_data") |
The desired merging effect:
Explanation:
A1: The for loop traverses every Excel file under the given directory to perform a series of operations over every worksheet in B1~B3.
B1: Open an Excel file and generate a sequence.
B2: Import the specified fields, ‘Customer Name’, ‘Sale Amount’ and ‘Purchase Date’, from every worksheet of the current file and merge them. This is similar to the operations in A2 in the previous instance.
B3: Combine B2’s table sequence with the current value of B3.
A4: Save B3’s table sequence as a worksheet, named merge_data, in file result.xlsx.
The program achieves multiple Excel file merging with only two rounds loop. The outer loop traverses each of the Excel files under the given directory while the inner loop B1.conj() function merges data from every worksheet in an Excel file.
C.File merging with big result set
Both A2 in the first instance and B3 in the second instance receive all merged data in the memory and then write it out at once. But a big merged result set can use up too much memory space or the available memory space won’t be enough to hold the data. In that case, stream-style appending can be used to generate a big file.
esProc SPL script:
A | B | |
---|---|---|
1 | =file("D:/out.xlsx") | |
2 | for directory@p("d:/excel/*.xlsx") | =file(A2).xlsopen() |
3 | =if(A1.exists(),B2.xlsimport@t(),B2.xlsimport()) | |
4 | >A1.xlsexport@s(B3;"merger") |
The desired merging effect:
Explanation:
A1: Open a specified file into which data will be output.
A2: Traverse all to-be-merged Excel files under the given directory.
B2: Open a to-be-merged Excel file.
B3: If the file into which data will be output exists, import all rows, including the column headers, of a worksheet; if the file exists, use @t option to make the first row the column headers and import data from the second row.
B4: Import data in a stream style in B3 and append it to the worksheet named merger in A1’s specified file.
The stream-style importing and appending is suited to combine many small excel files into a big one.
3. Grouping & aggregation
A. Group data by one or more fields
A | |
---|---|
1 | =file("D:/sales_2013.xlsx").xlsopen() |
2 | =A1.conj(A1.xlsimport@t(;~.stname)) |
3 | =A2.groups('Customer ID';sum('Sale Amount'):Total,avg('Sale Amount'):Average) |
4 | =A2.groups('Customer ID','Purchase Date';sum('Sale Amount'):Total) |
A3’s grouping result:
A4’s grouping result:
A1: Open the specified Excel file.
A2: Import all data from the worksheets in the file and perform merging.
A3: Group the merged data by ‘Customer ID’ and calculate total and average for each group.
A4: Group the merged data by ‘Customer ID’ and ‘Purchase Date’, and calculate total for each group.
B. Group data in a certain order
esProc SPL script:
A | B | |
---|---|---|
1 | for directory@p("d:/excel/*.xlsx") | =file(A1).xlsopen() |
2 | =B1.conj(B1.xlsimport@t(;~.stname)) | |
3 | =@|B2 | |
4 | =B3.derive(year('Purchase Date'):Year,month('Purchase Date'):Month) | |
5 | =A4.groups (month('Purchase Date'):Month;sum('Sale Amount'):Total,avg('Sale Amount'):Average) | |
6 | =A4.groups@o (month('Purchase Date'):Month;sum('Sale Amount'):Total,avg('Sale Amount'):Average) |
A5’s grouping result:
A1~B3: Merge all worksheets in the Excel files of same structure under the specified directory.
A4: Generate a new table sequence base on B3’s table sequence by splitting the Purchase Date field into a new Year field and a new Month field.
A5: groups() function groups data by month and calculates total and average for each group.
A6: groups@o groups data in each year by month, calculates total and average for each group, and, with @o present, merges the summarized results.
A4 contains detailed sales records; A5 summarizes data only by month, ignoring the year; A6 summarizes data by both year and month. The three cells shows summarized data of different levels.
C. Group data by segment
esProc SPL script:
A | B | |
---|---|---|
1 | for directory@p("d:/excel/*.xlsx") | =file(A1).xlsopen() |
2 | =B1.conj(B1.xlsimport@t(;~.stname)) | |
3 | =@|B2 | |
4 | =B3.groups(if ('Sale Amount'<1000,"1::<1000", if ('Sale Amount'<1500,"2::1000~~1500", if ('Sale Amount'<2000,"3::1500~~2000", if ('Sale Amount'<2500,"4::2000~~2500", "5::>=2500")))):Segment; count(1):Number,sum('Sale Amount'):Total) |
Grouping result:
About the code block A1~B3, refer to the explanations of previous instances.
A4: It defines 5 intervals for the value of ‘Sale Amount’ field and calculates number and total for within each range.
But the above script is still not convenient-to-use enough. When we want to change the way of dividing the data, we need to modify the parameter expressions in groups() function. This is a little complicated. Another function, pseg(), helps to achieve an easier modification. Here’s the SPL code:
A | B | |
---|---|---|
1 | [0,1000,1500,2000,2500] | |
2 | for directory@p("d:/excel/*.xlsx") | =file(A1).xlsopen() |
3 | =B1.conj(B1.xlsimport@t(;~.stname)) | |
4 | =@|B2 | |
5 | =B4.groups(A1.pseg(~.'Sale Amount'):Segment; count(1):Number,sum('Sale Amount'):Total) |
We can also group data by multiple fields according to a set of conditions and then summarize each group. For example, we divide the scores of subjects into 5 grades – Excellent, Good, Satisfactory, Poor, Passing, and summarize each group. For more uses of groups() function, refer to esProc Function Reference.
D. Group big data
esProc SPL script:
A | B | |
---|---|---|
1 | =file("d:/tdata.xlsx").xlsopen@r() | |
2 | for A1.count() | =A1.xlsimport@ct(;A1(A2). stname) |
3 | =@|B2 | |
4 | = B3.conjx() | =A4.groups('Customer ID';sum('Sale Amount'):SaleTotal) |
5 | >file(“d:/out.xlsx”).exportxls@bt(B4;"Customer&Sales") |
The result of filtering and then grouping data:
Explanation:
A1: Use @r option to open an Excel file in a stream style.
A2: Traverse every worksheet in the Excel file.
B2: Use @c option to import data with cursor.
B3: Union B2’s cursor into B3.
A4: Union members of B3’s cursor sequence into a new cursor.
B4: Group A4’s sequence by Customer ID and calculate Sale Amount for each group.
A5: Export and save B4’s result set into an Excel file.
Each time, the stream retrieval reads a part of data with cursor and groups and summarizes it.
4.Deduplication
A. Via primary key
A | |
---|---|
1 | =file("d:/sales_2013.xlsx").xlsopen() |
2 | =A1.conj(A1.xlsimport@t('Customer Name', 'Invoice Number', 'Sale Amount';~. stname)) |
3 | =A2.group@1('Invoice Number') |
4 | >file(“d:/out.xlsx”). xlsexport@t(A3;"result") |
Concatenated deduplicated data:
A1: Open a certain Excel file.
A2: Import specified columns from every worksheet.
A3: Group A2’s table sequence by primary key ‘Invoice Number’ and deduplicate each group.
A4: Export and save A3’s result set into an Excel file.
Data is distinct in each worksheet, but probably not once the worksheets are concatenated. Primary key is one of the ways of deduplicating data.
B. Via a certain field
A | |
---|---|
1 | =file("d:/sales_2013.xlsx").xlsopen() |
2 | =A1.conj(A1.xlsimport@t('Customer ID', 'Customer Name';~. stname)) |
3 | =A2.id('Customer Name') |
4 | =A2.group@1('Customer Name') |
5 | >file("d:/out.xlsx"). xlsexport@t(A4;"result") |
Explanation:
A1: Open a certain Excel file.
A2: Import specified columns from every worksheet.
A3: Get a sequence of distinct names from A2’s table sequence.
A4: Get a table sequence of distinct names from A2’s table sequence.
A5: Export and save data of A4’s table sequence into an Excel file, with the first row as the column headers.
A3’s deduplication result:
C. Via multiple fields
A | |
---|---|
1 | =file("d:/sales_2013.xlsx").importxls@t() |
2 | =file("d:/sales_2014.xlsx").importxls@t() |
3 | =[A1,A2].merge('Customer ID', 'Purchase Date') |
4 | =A3.group@1('Customer ID', 'Purchase Date') |
5 | >file("d:/out.xlsx"). xlsexport@t(A4;"result") |
Explanation:
A1: Import a specified Excel file.
A2: Import another specified Excel file.
A3: Merge A1’s table sequence and A2’s table sequence by fields ‘Customer ID’ and ‘Purchase Date’ and return a new table sequence.
A4: Group and deduplicate A3’s table sequence by ‘Customer ID’ and ‘Purchase Date’.
A5: Export and save 43’s result set into an Excel file.
We can perform the merge and grouping and deduplication by more fields as needed.
D. Deduplicate records
A | B | |
---|---|---|
1 | =file("d:/sales_2013.xlsx").importxls@t() | =A1.group@1('Invoice Number') |
2 | =file("d:/sales_2014.xlsx").importxls@t() | =A2.group@1('Invoice Number') |
3 | =[B1,B2].merge@u() | =A3.count() |
Explanation:
A1: Import a specified Excel file.
B1: Deduplicate A1’s table sequence by ‘Invoice Number’.
A2&B2: Same as A1&B1.
A3: Merge B1’s table sequence and B2’s table sequence, deduplicate data, and return a table sequence. @u option is used to merge multiple table sequences in order, remove duplicates, and generate a new table sequence.
B3: Count the records in the merged table sequence. merge@u() function merges multiple ordered, deduplicated table sequences.
In the above, we explain how to handle merging, grouping and summarization, and deduplication over same-structure Excel files in esProc SPL. For Excel files of different structures, we just need to read to-be-merged fields as a sequence object and then process it in same ways. It is similar to handle other types of text files.
No comments:
Post a Comment