FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Ubuntu > Ubuntu User

 
 
LinkBack Thread Tools
 
Old 11-08-2009, 12:31 AM
Patton Echols
 
Default scripting / data problem.

I have a data file problem that I hope I can get an assist on. Not an
"Ubuntu" question really, but this is about the smartest group I know to
ask!

Note: I'm not asking anyone to do my homework. In fact, I'd be happy
with the RTFM answer, as long as you could tell me which FM and why you
say so. Unfortunately, it takes a bit to explain the problem, so this
is kind of long.

Here is the issue:

I have a number of different, partially overlapping in the form of (2)
MS Outlook contact databases, (1) MS Access database, (1) quickbooks and
(1) SQL db.

Though we don't have access the SQL available yet, we currently think
that CSV files are the common denominator that everything can import and
export.

The problem is that the various lists don't all have the same
information so I can't just cat them together and sort with a "unique"
operator. That's a vague statement. Here is what I mean by file and field:

file 1 - | email |
file 2 - | email | f-name | l-name |
file 3 - | f-name | l-name | company | phone |
file 4 - | company | address | city | state |

(By the way, each one has some duplicates with less than complete info.)

What we need is to be able to populate all of these files with all the
available data for each contact. The concept I have in mind is to
compare the files in pairs. So file 1 would go into file-L (for Last)
Then file 1 would be compared with each line of file-L and if there is a
record with matching email, then any extra info is added , but not
overwriting anything that is already in any field So the final file
would look like this:

| file-1 | file-2 | file-3 | file-4 | email | f-name | l-name | company
| address | city | state | phone |

The first four fields would have an "x" or something that could be
subsequently used to parse them back to the originals.

I have thought about scripting something that would, "for each line in
file 1, use the fields in the line for variables, feed them to "awk"
which tests file 2 and returns results to the script to be added to file-L"

I have also thought that awk could do the whole thing, but I'd hate to
plow through the entire user guide if someone told me it was hopeless.

I also wondered whether I could create a mysql database with this info,
and there are commands that can merge records (rather than just
eliminate what looks like a duplicate).

Or perhaps there is another strategy???

Obviously I have a lot of learning to do, but I am wondering about what?

Thanks very much for all of you who have read this far. I'd appreciate
any insight you may have into the best approach.

--PE


--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 11-08-2009, 01:26 AM
Hal Burgiss
 
Default scripting / data problem.

On Sat, Nov 07, 2009 at 05:31:25PM -0800, Patton Echols wrote:

> file 1 - | email |
> file 2 - | email | f-name | l-name |
> file 3 - | f-name | l-name | company | phone |
> file 4 - | company | address | city | state |
>
> (By the way, each one has some duplicates with less than complete info.)
>
> What we need is to be able to populate all of these files with all the
> available data for each contact. The concept I have in mind is to
> compare the files in pairs. So file 1 would go into file-L (for Last)
> Then file 1 would be compared with each line of file-L and if there is a
> record with matching email, then any extra info is added , but not
> overwriting anything that is already in any field So the final file
> would look like this:
>
> | file-1 | file-2 | file-3 | file-4 | email | f-name | l-name | company
> | address | city | state | phone |
>

[...]

> I also wondered whether I could create a mysql database with this info,
> and there are commands that can merge records (rather than just
> eliminate what looks like a duplicate).

I'll vote for the mysql approach. Create a csv for each file, import that into
mysql so you have 4 tables. Then create a query that will dump out the data
the way you want. You can dump that into another table or export it. For this
to work, the fields will have to match in each table. So Comany in file3 is
equivalent to Comany in file4 (not something like Acme, Inc vs Acme,
Incorporated).

SELECT a.email, b.firstname, b.lastname, c.company, d.address FROM file1 a,
file2 b, file3 c, file4 d WHERE a.email = b.email and
concat(b.firstname,b.lastname) = concat(c.firstname,c.lastname) and c.company
= d.company INTO OUTFILE '/tmp/output.txt' (crude and not tested).

Data manipulation is much more flexible once its in a database. At least when
you have something this squirrelly.

--
Hal

--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 11-09-2009, 06:41 AM
Patton Echols
 
Default scripting / data problem.

Thanks for the reply.


On 11/07/2009 06:26 PM, Hal Burgiss wrote:
> On Sat, Nov 07, 2009 at 05:31:25PM -0800, Patton Echols wrote:
>
>
>> file 1 - | email |
>> file 2 - | email | f-name | l-name |
>> file 3 - | f-name | l-name | company | phone |
>> file 4 - | company | address | city | state |
>>
>> (By the way, each one has some duplicates with less than complete info.)
>>
>> What we need is to be able to populate all of these files with all the
>> available data for each contact. The concept I have in mind is to
>> compare the files in pairs. So file 1 would go into file-L (for Last)
>> Then file 1 would be compared with each line of file-L and if there is a
>> record with matching email, then any extra info is added , but not
>> overwriting anything that is already in any field So the final file
>> would look like this:
>>
>> | file-1 | file-2 | file-3 | file-4 | email | f-name | l-name | company
>> | address | city | state | phone |
>>
>>
>
> [...]
>
>
>> I also wondered whether I could create a mysql database with this info,
>> and there are commands that can merge records (rather than just
>> eliminate what looks like a duplicate).
>>
>
> I'll vote for the mysql approach. Create a csv for each file, import that into
> mysql so you have 4 tables. Then create a query that will dump out the data
> the way you want. You can dump that into another table or export it. For this
> to work, the fields will have to match in each table. So Comany in file3 is
> equivalent to Comany in file4 (not something like Acme, Inc vs Acme,
> Incorporated).
>

Do you mean that all the field data must match? Or just, for example,
the email fields? Because part of the problem is that I have non
matching / missing data depending on file.

> SELECT a.email, b.firstname, b.lastname, c.company, d.address FROM file1 a,
> file2 b, file3 c, file4 d WHERE a.email = b.email and
> concat(b.firstname,b.lastname) = concat(c.firstname,c.lastname) and c.company
> = d.company INTO OUTFILE '/tmp/output.txt' (crude and not tested).
>
> Data manipulation is much more flexible once its in a database. At least when
> you have something this squirrelly.
>
>

Does the concat commands mean that there would no longer be a separate
first name and last name?


Also, it looks like the "=" is a requirement. Is that what you meant by
the fields having to be the same? Or could I arrange for it to have
file 1 overwrite file 2 for certain fields. Or only blank fields?

--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 11-09-2009, 07:08 AM
Justin Gruenberg
 
Default scripting / data problem.

On Sat, Nov 7, 2009 at 7:31 PM, Patton Echols <p.echols@comcast.net> wrote:
> The problem is that the various lists don't all have the same
> information so I can't just cat them together and sort with a "unique"
> operator. *That's a vague statement. *Here is what I mean by file and field:
>

I assume you're going to need to get this data back out and into the
original applications, eventually.

The basic strategy I'd take is to import everything into seperate
tables in mysql. Create additional tables that you will export out
of. Massage the data from the import tables into your output tables,
merging and correcting as you can (this may be really easy or really
hard depending on how clean your data is). Chances are you're going
to need some cheap labor to clean the data up (got interns?) depending
on the amount of data.

I'd also suggest adding a unique ID to each record so that if you have
to do this merge again, this will be a bit easier to handle.

--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 11-09-2009, 10:31 AM
Hal Burgiss
 
Default scripting / data problem.

On Sun, Nov 08, 2009 at 11:41:34PM -0800, Patton Echols wrote:
> >
> >> file 1 - | email |
> >> file 2 - | email | f-name | l-name |
> >> file 3 - | f-name | l-name | company | phone |
> >> file 4 - | company | address | city | state |
> >>
> >> (By the way, each one has some duplicates with less than complete info.)
> >>
> >> What we need is to be able to populate all of these files with all the
> >> available data for each contact. The concept I have in mind is to
> >> compare the files in pairs. So file 1 would go into file-L (for Last)
> >> Then file 1 would be compared with each line of file-L and if there is a
> >> record with matching email, then any extra info is added , but not
> >> overwriting anything that is already in any field So the final file
> >> would look like this:

[...]

> > I'll vote for the mysql approach. Create a csv for each file, import that into
> > mysql so you have 4 tables. Then create a query that will dump out the data
> > the way you want. You can dump that into another table or export it. For this
> > to work, the fields will have to match in each table. So Comany in file3 is
> > equivalent to Comany in file4 (not something like Acme, Inc vs Acme,
> > Incorporated).
> >
>
> Do you mean that all the field data must match? Or just, for example,
> the email fields? Because part of the problem is that I have non
> matching / missing data depending on file.

In order for the relations to work across all 4 files, the fields that are
being keyed on, *should* match exactly. You can get around the missing data by
using a 'left join' syntax.

> > SELECT a.email, b.firstname, b.lastname, c.company, d.address FROM file1 a,
> > file2 b, file3 c, file4 d WHERE a.email = b.email and
> > concat(b.firstname,b.lastname) = concat(c.firstname,c.lastname) and c.company
> > = d.company INTO OUTFILE '/tmp/output.txt' (crude and not tested).
> >
> > Data manipulation is much more flexible once its in a database. At least when
> > you have something this squirrelly.

> Does the concat commands mean that there would no longer be a separate
> first name and last name?

No, that's just part of the query that contstructs a relationship across all
four tables. The output part is the 'b.firstname'. The concat thing is just
one way to make sure the columns you are trying to match, actually do match.
This will be problematic if one table has 'John Doe' and the other has
'Johnathon Q. Doe, III' type of stuff.

>
> Also, it looks like the "=" is a requirement. Is that what you meant by
> the fields having to be the same? Or could I arrange for it to have
> file 1 overwrite file 2 for certain fields. Or only blank fields?

An "equivalency" is a lot cleaner for the query part anyway (where you are
matching data in one table against another). But you can do anything you want
with the output part. So, instead of using b.firstname, you could use
c.firstname instead. They would both be in their respective tables (a, b, c,
and d).


--
Hal

--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 11-09-2009, 08:13 PM
Patton Echols
 
Default scripting / data problem.

On 11/09/2009 03:31 AM, Hal Burgiss wrote:
> On Sun, Nov 08, 2009 at 11:41:34PM -0800, Patton Echols wrote:
>
>>>
>>>
>>>> file 1 - | email |
>>>> file 2 - | email | f-name | l-name |
>>>> file 3 - | f-name | l-name | company | phone |
>>>> file 4 - | company | address | city | state |
>>>>
>>>> (By the way, each one has some duplicates with less than complete info.)
>>>>
>>>> What we need is to be able to populate all of these files with all the
>>>> available data for each contact. The concept I have in mind is to
>>>> compare the files in pairs. So file 1 would go into file-L (for Last)
>>>> Then file 1 would be compared with each line of file-L and if there is a
>>>> record with matching email, then any extra info is added , but not
>>>> overwriting anything that is already in any field So the final file
>>>> would look like this:
>>>>
>
> [...]
>
>
>>> I'll vote for the mysql approach. Create a csv for each file, import that into
>>> mysql so you have 4 tables. Then create a query that will dump out the data
>>> the way you want. You can dump that into another table or export it. For this
>>> to work, the fields will have to match in each table. So Comany in file3 is
>>> equivalent to Comany in file4 (not something like Acme, Inc vs Acme,
>>> Incorporated).
>>>
>>>
>> Do you mean that all the field data must match? Or just, for example,
>> the email fields? Because part of the problem is that I have non
>> matching / missing data depending on file.
>>
>
> In order for the relations to work across all 4 files, the fields that are
> being keyed on, *should* match exactly. You can get around the missing data by
> using a 'left join' syntax.
>

Ok, I'll read up on that syntax. Since the point of the exercise is to
merge records where there is overlap AND no conflicting data.

>
>
>>> SELECT a.email, b.firstname, b.lastname, c.company, d.address FROM file1 a,
>>> file2 b, file3 c, file4 d WHERE a.email = b.email and
>>> concat(b.firstname,b.lastname) = concat(c.firstname,c.lastname) and c.company
>>> = d.company INTO OUTFILE '/tmp/output.txt' (crude and not tested).
>>>
>>> Data manipulation is much more flexible once its in a database. At least when
>>> you have something this squirrelly.
>>>
>
>
>> Does the concat commands mean that there would no longer be a separate
>> first name and last name?
>>
>
> No, that's just part of the query that contstructs a relationship across all
> four tables. The output part is the 'b.firstname'. The concat thing is just
> one way to make sure the columns you are trying to match, actually do match.
> This will be problematic if one table has 'John Doe' and the other has
> 'Johnathon Q. Doe, III' type of stuff.
>

My current sense of the data is that the missing info is the basic
problem, not conflicting info. What may need to happen is that we merge
the missing info and leave "duplicate" records to be dealt with by hand
where "John" has his name spelled different ways.

>
>
>> Also, it looks like the "=" is a requirement. Is that what you meant by
>> the fields having to be the same? Or could I arrange for it to have
>> file 1 overwrite file 2 for certain fields. Or only blank fields?
>>
>
> An "equivalency" is a lot cleaner for the query part anyway (where you are
> matching data in one table against another). But you can do anything you want
> with the output part. So, instead of using b.firstname, you could use
> c.firstname instead. They would both be in their respective tables (a, b, c,
> and d).
>
>
>

Ok, so the output is the part between the SELECT and FROM statements?

So much to learn, so little time! (kidding, this is really helpful)

Thanks.

--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 11-09-2009, 08:22 PM
Patton Echols
 
Default scripting / data problem.

On 11/09/2009 12:08 AM, Justin Gruenberg wrote:
> On Sat, Nov 7, 2009 at 7:31 PM, Patton Echols <p.echols@comcast.net> wrote:
>
>> The problem is that the various lists don't all have the same
>> information so I can't just cat them together and sort with a "unique"
>> operator. That's a vague statement. Here is what I mean by file and field:
>>
>>
>
> I assume you're going to need to get this data back out and into the
> original applications, eventually.
>
> The basic strategy I'd take is to import everything into seperate
> tables in mysql. Create additional tables that you will export out
> of. Massage the data from the import tables into your output tables,
> merging and correcting as you can (this may be really easy or really
> hard depending on how clean your data is). Chances are you're going
> to need some cheap labor to clean the data up (got interns?) depending
> on the amount of data.
>

To "massage" would you use the same basic approach that Hal did? Or was
there some other way? Note: Where there are blank fields, then a full
one would always "win". Where there is conflicting data, I'd probably
want separate records to hand massage. but the owners of all this data
may tell me that one of the tables is of sufficiently poor quality that
it gets overwritten unless there is no better answer. Not sure about
that yet.

Interns? That'd be great, but I'm already the volunteer! (And while the
computer guru in comparison, obviously no great database expert.
</understatement>



> I'd also suggest adding a unique ID to each record so that if you have
> to do this merge again, this will be a bit easier to handle.
>
>

Yeah, I don't want to do this again. The goal is to clean it up so that
it can be used and (pray) kept clean until the organization can migrate
everything to a more "professional" solution.

Thanks for the reply.

--PE

--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 11-10-2009, 02:25 AM
Hal Burgiss
 
Default scripting / data problem.

On Mon, Nov 09, 2009 at 01:13:54PM -0800, Patton Echols wrote:
> Ok, so the output is the part between the SELECT and FROM statements?

Yes, you 'select' which columns from which tables. As long as you build your
relationships (the stuff being done with 'where) correctly the data should
line up correctly and you can output it however you like, eg to a tab
delimited file that can be read easily by a spreadsheet program.

--
Hal

--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 

Thread Tools




All times are GMT. The time now is 10:42 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org