Sunday, July 16, 2017

Find and remove duplicate data

Find and remove duplicate data

Duplicate data can cause confusion that might cost you money, trust or goodwill. For example, a duplicate record might result in a perishable delivery to the wrong address. Access can help you avoid duplicate data by providing ways to reuse data, and help prevent duplicate data input by requiring that new values be unique.

Note:  The methods in this article do not apply to web databases. However, you can use them in a desktop database, and then use that database as a source for creating a new web database. For more information about web databases, see the article Build an Access database to share on the Web.

In this article

Overview

Before you begin

Prepare sample data to follow along with some examples

Delete or edit duplicate data in one table

Resolve duplicate data in multiple tables

Help prevent users from entering duplicate data in a table

Overview

One of the primary reasons for using a relational database such as Access is to avoid duplicate data, which can cause confusion and problems. Sometimes if you create a database by gathering data from existing systems, say a set of Excel workbooks, you may end up with duplicate data that needs to be resolved before you start using the database. Moreover, if certain design precautions are not taken (or are impractical for some reason), a database can sometimes acquire duplicate values, especially when multiple users enter data simultaneously.

What duplicate data is

The term duplicate data doesn't mean that there are two exact copies of the same data. Rather, duplicate data is data that is slightly different but which refers to the same thing as some other data. It may not be incorrect – for example a full name with and without a middle initial.

Duplicate data can occur at various levels:

  • Two or more records might contain duplicate fields.    Two records might be considered duplicates, even though not all fields of the records contain matching values. For example, in this figure, you see two records for Antonio Moreno Taquería.

    Duplicate records for a customer in the Customers table

    Even though each record has a unique customer ID (the value in the left-most column), the values in the Name, Address, and City fields match. Perhaps a record was created twice for one customer, or maybe there is a legitimate reason that two customers are with the same company. You will have to use your knowledge of your business to review such records to see if they are duplicates.

  • Two or more tables might contain similar data.    For example, you might find that a Customers table and a Clients table both contain records for the same customers.

    Clients and Customers tables with overlapping data

    Even though the tables might have somewhat different structures, both tables contain the same kind of information — customer data — so you should consider creating a new table by combining them.

  • Two or more databases might contain similar data.    If you find that two or more databases contain similar data, or you inherit a database that overlaps with your current database, you should compare the data and structure of the databases and then take necessary steps to consolidate them.

    Manually comparing databases, even small ones, is not an easy task. If you need help with a comparison, several third-party vendors sell tools for comparing the contents and structure of Access databases.

    Important:  If you take steps to eliminate duplicate data from your tables and you still see duplicate data in a form or report, it might be due to incorrect design of a form or report. Make sure that the underlying tables and queries are joined properly, and that the form or report sections do not include more than one control that is bound to the same control source. For more information about report design, see the article Introduction to reports.

What duplicate data is not

  • Sometimes two or more records will share a set of values but they will not refer to the same thing. For example, many customers might have the full name John Smith, but the values are not duplicates because they refer to different people. When that happens, you are not in danger of duplicating data as long as other fields in your database (such as ID fields, user names, addresses, or postal codes) contain enough unique information to keep the records from being considered duplicates.

  • Data returned by a query might appear to contain duplicate records if the query does not include fields that uniquely identify each record. For example, this figure shows data returned by a query:

    A view that that does not include a field that could uniquely identify the records

    Several rows in the query results appear to be duplicates. However, the same query with the primary key field included shows that each record is actually unique:

    View now includes the OrderID field to eliminate duplicate records

    To avoid this confusion, when you create a query include fields that identify each record uniquely. A primary key field can serve that function, but you can also use combinations of other fields instead. For example, you have many instances of a common contact name and many instances of a city name, but if you include a phone number and an address, the combination of data might be enough to make each record unique.

Resolving and preventing duplicate data

To resolve duplicate data, you either edit it or you delete it.

  • Edit a duplicate record if it actually represents a distinct entity – for example, a customer for whom there is no other record – and has become a duplicate of some other record by some error, perhaps erroneous input.

  • Delete a duplicate record if it does not represent some other entity; that is, there is another record for the same entity that you prefer to use.

  • In some cases, you will delete one duplicate and edit the other, because neither record is completely accurate. If you want to delete from a table that is related to other tables, you might need to enable cascade deletes for some of the relationships.

To help prevent duplicate data, consider the following:

  • Make sure that each of your tables has a primary key

  • If there are fields that are not part of a primary key but must contain unique values, create unique indexes for those fields

  • Consider creating forms for data input, and using list boxes on those forms to make finding existing values easier

Top of Page

Before you begin

Before you start deleting duplicate data, take the following steps:

  • Make sure that the database file is not read-only.

  • If other people use your database and you want to minimize their service disruption, ask them to close the objects that you want to work with. Otherwise, data conflicts might occur.

  • If it is acceptable for the database to be unavailable while you work on it, open the database in exclusive mode. To do so, click the File tab, and then click Open. In the dialog box, browse to and select the database, click the arrow next to the Open button, and then click Open Exclusive.

    Opening a file in Exclusive mode

  • Back up your database before you delete records. You cannot reverse or undo a delete operation. The only way to recover deleted records is to restore them from a backup. A delete operation might also delete records in related tables, so it is best to back up the entire database before starting the operation.

Back up a database

  1. Click the File tab, click Save & Publish, and then under Save Database As click Back Up Database.
    The Save As dialog box appears, and Access appends the current date to the file name. For example, if you have a database named Assets, Access creates the following type of file name: Assets_2011-10-29.

  2. Accept the default name and location, or select another name or location, and then click Save.
    Access closes the original file, creates a backup, and then reopens the original file.

Note:  To restore from a backup, close and rename the original file so that the backup copy can use the name of the original version. Assign the name of the original version to the backup copy, and open it in Access.

Top of Page

Prepare sample data to follow along with some examples

The how-to steps in this article use the following sample tables. To follow along, copy the sample tables into a blank new database by using the steps at the end of this section.

The Customers table:

CompanyName

ContactName

Address

City

PostalCode

Phone

Baldwin Museum of Science

Josh Barnhill

1 Main St.

New York

12345

(505) 555-2122

Blue Yonder Airlines

Waleed Heloo

52 1st St.

Boston

23456

(104) 555-2123

Coho Winery

Pica Guido

3122 75th Ave. S.W.

Seattle

34567

(206) 555-2124

Contoso Pharmaceuticals

Jean Philippe Bagel

1 Contoso Blvd.

London

NS1 EW2

(171) 555-2125

Fourth Coffee

Julian Price

Calle Smith 2

Mexico City

56789

(7) 555-2126

Coho Winery

Christine Hughes

3122 75th St. S.

Seattle

34567

(206) 555-2125

Humongous Insurance

Steve Riley

67 Big St.

Tampa

01234

(916) 555-2128

Trey Research

Dana Birkby

2 Nosey Pkwy

Portland

43210

(503) 555-2129

Fourth Coffee

Reshma Patel

Calle Smith 2

Mexico City

56789

(7) 555-2233

The Employees table:

Last Name

FirstName

Address

City

BirthDate

HireDate

Barnhill

Josh

1 Main St.

New York

05-Feb-1968

10-Jun-1994

Heloo

Waleed

52 1st St.

Boston

22-May-1957

22-Nov-1996

Guido

Pica

3122 75th Ave. S.W.

Seattle

11-Nov-1960

11-Mar-2000

Bagel

Jean Philippe

1 Contoso Blvd.

London

22-Mar-1964

22-Jun-1998

Price

Julian

Calle Smith 2

Mexico City

05-Jun-1972

05-Jan-2002

Hughes

Christine

3122 75th St. S.

Seattle

23-Jan-1970

23-Apr-1999

Riley

Steve

67 Big St.

Tampa

14-Apr-1964

14-Oct-2004

Birkby

Dana

2 Nosey Pkwy

Portland

29-Oct-1959

29-Mar-1997

Bagel

Jean Philippe

1 Contoso Blvd.

London

22-Mar-1964

20-Jun-1998

The Amount Owed table:

CompanyName

ContactName

Address

City

PostalCode

Phone

AmountDue

Baldwin Museum of Science

Josh Barnhill

1 Main St.

New York

12345

(505) 555-2122

$556.78

Blue Yonder Airlines

Waleed Heloo

52 1st St.

Boston

23456

(104) 555-2123

$1,893.24

Coho Winery

Pica Guido

3122 75th Ave. S.W.

Seattle

34567

(206) 555-2124

$321.79

Contoso Pharmaceuticals

Jean Philippe Bagel

1 Contoso Blvd.

London

NS1 EW2

(171) 555-2125

£457.68

Fourth Coffee

Julian Price

Calle Smith 2

Mexico City

56789

(7) 555-2126

$98.75

Coho Winery

Christine Hughes

3122 75th St. S.

Seattle

34567

(206) 555-2125

$321.79

Humongous Insurance

Steve Riley

67 Big St.

Tampa

01234

(916) 555-2128

$297.45

Trey Research

Dana Birkby

2 Nosey Pkwy

Portland

43210

(503) 555-2129

$509.09

Fourth Coffee

Reshma Patel

2 Calle Smith

Mexico City

56789

(7) 555-2233

$98.75

The Clients table:

Name

Address

City

Phone

Fax

Baldwin Museum of Science

1 Main St.

New York

(505) 555-2122

(505) 555-2122

Blue Yonder Airlines

52 1st St.

Boston

(104) 555-2123

(104) 555-2123

Coho Winery

3122 75th Ave. S.W.

Seattle

(206) 555-2124

(206) 555-2124

Contoso Pharmaceuticals

1 Contoso Blvd.

London

(171) 555-2125

(171) 555-2125

Fourth Coffee

Calle Smith 2

Mexico City

(7) 555-2126

(7) 555-2126

Consolidated Messenger

3122 75th St. S.

Seattle

(206) 555-2125

(206) 555-2129

Graphic Design Institute

67 Big St.

Tampa

(916) 555-2128

(916) 555-2128

Litware, Inc.

3 Microsoft Way

Portland

(503) 555-2129

(503) 555-2110

Tailspin Toys

4 Microsoft Way

Portland

(503) 555-2233

(503) 555-2239

Paste the sample data into Excel worksheets

  1. Start Excel. An empty workbook opens.

  2. Press SHIFT+F11 to insert a worksheet (you will need four).

  3. Copy the data from each sample table into an empty worksheet. Include the column headings (the first row).

Create database tables from the worksheets

  1. Select the data from the first worksheet, including the column headings.

  2. Right-click the Navigation Pane, and then click Paste.

  3. Click Yes to confirm that the first row contains column headings.

  4. Repeat steps 1-3 for each of the remaining worksheets.

Top of Page

Delete or edit duplicate data in one table

Find records where some fields wholly or partially match

Use a query to delete duplicate records from one table and their related records in other tables

Create and run a delete query

Find records where some fields wholly or partially match

  • Create a query by using the Find Duplicates Wizard. By default, the query returns matching records only when the values in each field match character for character. If you need to find partial matches, you can use an expression in your query, or you can alter the Structured Query Language (SQL) code.

  • Optionally, edit the field values or delete records when viewing the query results in Datasheet view.

  • Optionally, use an expression in the query to include values that partially match.

The following steps use the sample Customers table provided in the section Prepare sample data to follow along with some examples.

Create a find duplicates query

  1. On the Create tab, in the Queries group, click Query Wizard.

  2. In the New Query dialog box, click Find Duplicates Query Wizard, and then click OK.

  3. In the list of tables, select the Customers table, and then click Next.

  4. In the list of available fields, select the CompanyName, Address, and City fields – the fields that you want to use to match. Click Next.

  5. In the next list of available fields, select the field or fields that contain the data that you want to inspect or update, or those that contain data that can help you distinguish duplicate from non-duplicate records. If you use the Customers table, you add the Contact Name and Phone fields, because the data in those fields can help you find the duplicate values and possibly identify why the values were entered. Click Next.

  6. Accept the suggested name (Find duplicates for Customers) or enter your own name, and then click Finish to run the query:

    CompanyName

    Address

    City

    ContactName

    Phone

    Fourth Coffee

    Calle Smith 2

    Mexico City

    Julian Price

    (7) 555-2126

    Fourth Coffee

    Calle Smith 2

    Mexico City

    Reshma Patel

    (7) 555-2233

  7. When you created the Customers table, you might have noticed more than two duplicate records (the table contains four). You don't see the other duplicates because the values in the Address field don't match character-for-character. You can modify the query to return values that partially match — the next set of steps explains how.

Customize the duplicates query so that it includes partial matches

  1. Switch the query to SQL view. To do so, you can:

    • Right-click the document tab for the query, and then click SQL View.

    • In the Navigation Pane, right-click the query, and then click SQL View.

  2. Modify the SQL code to find and compare partial values.

    If you used the Customers table in the previous steps, you see the following SQL statement:

    SELECT Customers.[Company Name], Customers.[Address], Customers.[City], Customers.[Contact Name], Customers.[Phone]
    FROM Customers
    WHERE (((Customers.[CompanyName]) In (SELECT [CompanyName] FROM [Customers] As Tmp GROUP BY [CompanyName],[Address],[City] HAVING Count(*)>1 And [Address] = [Customers].[Address] And [City] = [Customers].[City])))
    ORDER BY Customers.[CompanyName], Customers.[Address], Customers.[City];

    Note that the WHERE clause uses a SELECT statement to identify sets of duplicates. The SELECT statement produces a list of company names that occur in more than one record with the same address and city, and the WHERE clause uses that list of names to limit the results of the duplicates query.

    Suppose that you decide to modify the Address field to return partial matches – say the first seven characters of the address must match – so that you will find potential duplicates arising from simple typographical differences. You modify the WHERE clause as follows:
    WHERE (((Customers.[CompanyName]) In (SELECT [CompanyName] FROM [Customers] As Tmp GROUP BY [CompanyName],Left([Address],7),[City] HAVING Count(*)>1 And Left([Address],7) = Left([Customers].[Address],7) And [City] = [Customers].[City])))If you want to make the criteria more or less restrictive, you change the number used by the Left function. For example, you might use Left([Address],5) to match using only the first five characters.

  3. After you finish modifying the statement, click Run Button image to run the query and display the results in Datasheet view.

    The query now returns all four duplicate records:

Company Name

Address

City

ContactName

Phone

Coho Winery

3122 75th Ave. S.W.

Seattle

Pica Guido

(206) 555-2124

Coho Winery

3122 75th St. S.W

Seattle

Christine Hughes

(206) 555-2125

Fourth Coffee

Calle Smith 2

Mexico City

Julian Price

(7) 555-2126

Fourth Coffee

Calle Smith 2

Mexico City

Reshma Patel

(7) 555-2233

Edit or delete the records

Do one of the following:

  • To edit the value in a field, select the field and enter a new value. In the example data, if Reshma Patel works at the Fourth Coffee in a different Mexico City location, you'd change the value of the Address field for that record.

  • To delete an entire record (a row), click the row selector (the empty box next to the row) and press DELETE. In the example data, if there is only one Coho Winery location and the contact is Pica Guido, you'd delete the other Coho Winery record.

Use a query to delete duplicate records from one table and their related records in other tables

If you have too many duplicates to resolve them all by hand, consider using a delete query. Before you use a delete query, check the table relationships. If the table that has duplicate data is related to another table, you should determine whether the duplicate data is in the table on the "one" side of a one-to-many relationship. You can use the Relationships tool to check:

  1. On the Database Tools tab, in the Relationships group, click Relationships.

  2. Find the tables in the Relationships window and look at the symbols next to each table. If one table has a "1" and the other has an infinity symbol, the relationship is one-to-many.

  3. If the data is on the "one" side (there is a "1" next to the table), you should enable cascade delete for the relationship. Cascade delete deletes records that are related to the record that you are deleting. For example, a customer has many orders, and you don't want to keep any of the order records when you delete the customer records to which they belong.

  4. If the data resides on the "many" side (there is an infinity symbol next to the table), and it does not matter whether cascade delete is enabled.

Create and run a delete query

Important: You cannot undo the results of a delete query. You should back up your database before you run your delete query.

  1. On the Create tab, in the Queries group, click Query Design.

  2. In the Show Table dialog box, double-click the table from which you want to delete records.
    The table appears in the query designer. This figure shows a typical table in the query designer.

    A table in the query designer

    1. Table in the query designer

    2. Query design grid

  3. Double-click the asterisk (*) to add all of the fields in the table to the design grid. Adding all the table fields enables the delete query to delete entire records (rows) from the table.

  4. Add one or more fields that you will use to identify the records to delete. For example, suppose that a customer goes out of business and you need to delete all of the pending orders for that customer. To find just those records, you might the Customer ID and Order Date fields to the design grid.

  5. Under the fields that you just added, enter criteria in the Criteria row of the design grid. For example, you might enter the Customer ID of the customer that went out of business and the date after which that customer's orders are invalid.
    For more information about using criteria, see the article Examples of query criteria.

  6. Clear the Show check box for each criteria field.

  7. On the Design tab, in the Results group, click Run. Verify that the query returns the records that you want to delete.

  8. On the Design tab, in the Results group, click Design View.

  9. On the Design tab, in the Query Type group, click Delete.
    Access changes the select query to a delete query, hides the Show row in the lower section of the design grid, and adds the Delete row.

  10. Make sure that the Delete row in the * (all fields) column displays From. The word Where should appear in any criteria columns.

  11. On the Design tab, in the Results group, click Run.

  12. Click Yes to confirm that you want delete the data.

Hide the confirmation message

Follow these steps if you do not want to see a confirmation message each time you run a delete query or other action query.

  1. On the File tab, click Options.

  2. In the Access Options dialog box, click Client Settings, and in the Editing section, under Confirm, clear the Action queries check box.

Top of Page

Resolve duplicate data in multiple tables

Data that is duplicated across tables is usually a case of overlapping records: a record exists for the same entity in more than one table. This can be complicated if the tables don't have exactly the same fields. For example, one table might use full names but the other has separate fields for first and last names. This kind of situation sometimes occurs:

  • When you create a new database from a variety of existing data sources

  • When you integrate a new data source into an existing database

  • When you merge two similar databases

To resolve this, combine all the data from the overlapping tables into a new table, decide which records to delete, and then delete the records that you don't want from that new table. Then, use the new table to replace both of the old tables.

Create a query that shows all the records from the overlapping tables together

Create a new table from the records in the overlapping tables

Decide and indicate which records to delete from the new table

Create a query to delete the unwanted duplicates

Create a query that shows all the records from the overlapping tables together

To bring together overlapping records from multiple tables, you create a union query. A union query usually filters out duplicates, but you can use the ALL keyword to let Access know that you want to see duplicate records. You should also sort the records by a field that has values you want to compare, such as names. The following procedure uses the Clients and Customers sample tables provided in the section Prepare sample data to follow along with some examples.

Note:  You can use more than two tables in a union query; just add a UNION ALL clause between each pair of SELECT statements.

  1. On the Create tab, in the Queries group, click Query Design.

  2. Close the Show Table dialog box without adding any tables or queries to the design grid.

  3. On the Design tab, in the Results group, click View, and then click SQL View.

  4. Paste the following SQL statement in the window.

    SELECT CompanyName, ContactName, Address, City, PostalCode, Phone, "" AS Remove FROM Customers 
    UNION ALL
    SELECT Name, "" AS ContactName, Address, City, "" AS PostalCode, Phone, "" AS Remove FROM Clients
    ORDER BY [Company Name];

    The first SELECT statement retrieves records from the Customers table, and the second SELECT statement retrieves records from the Clients table, and also creates empty values for the fields that are in Customers but not Clients. Both SELECT statements also contain a new field that you can use later to indicate which records to delete. The UNION ALL clause brings all records from both tables together. The ORDER BY clause sorts the records so that duplicate records are placed together for easy reviewing.

    Note: Each SELECT statement in a union query must return the same number of fields, and in the same order. The corresponding fields must have compatible data types, but with one exception: you can use a Number field and a Text field as corresponding fields. Also, remember that field can names differ: the query results will use the fields name from the first SELECT statement.

  5. On the Design tab, in the Results group, click Run to review the results.

  6. Save the query. You will use this query as a data source for a new table.

Create a new table from the records in the overlapping tables

You can use a union query as the data source of a new query that makes a new table. The new table will contain all the records from the overlapping tables. You can then edit the records and delete the ones that you don't want

  1. On the Create tab, in the Queries group, click Query Design.

  2. In the Show Table dialog box, click the Queries tab, and then double-click the union query that you created in the previous section.

  3. In the query designer, double-click the asterisk on the union query to add all its fields to the query results.

  4. On the Design tab, in the Query Type group, click Make Table.

  5. In the Make Table dialog box, enter a name for the new table.

  6. On the Design tab, in the Results group, click Run.

Decide and indicate which records to delete from the new table

After you combine the data into a new table, review it, and mark the records that you want to delete

  1. Open the new table.

  2. Review each set of duplicates, and enter an "x" in the Remove field for the duplicate records that you want to delete. You should only keep one record out of each set of duplicates.

Create a query to delete the unwanted duplicates

To delete the unwanted duplicates, follow the steps in the section Create and run a delete query, modified as follows:

  1. At step 4 (add fields to use as deletion criteria), add the Remove field.

  2. At step 5 (enter criteria), enter "x" in the Criteria row of the Remove field.

Top of Page

Help prevent users from entering duplicate data in a table

Here are some tips to help you prevent users from entering duplicate data in a table.

  • Set a field to contain only unique values.    Do one of the following:

    • Designate the field as a primary key. You can do this only if the table does not already have a primary key. To set the field as a primary key, open the table in Design view, right-click the field name, and then click Primary Key Button image . A primary key field accepts only unique values and will prompt users if they enter a duplicate.

    • If the table already has a primary key, and a different field that should contain unique values is a potential source of duplicate data, set the Indexed property of that field to Yes (No Duplicates) as follows:

      1. Right-click the table in Navigation Pane, and then click Design view.

      2. In the Field Properties pane, click Indexed, and then select Yes (No Duplicates) from the drop-down list.

  • Specify that a combination of values in two or more fields must be unique for each record.    For example, you might designate a combination of fields, such as Name, Address, and City, as the primary key for a table. Note that if you do ever need to enter a record that duplicates these values, you must remove the restriction, which could incur significant follow-up work.

    To create a multi-field primary key, follow these steps:

    1. Open the table in Design view.

    2. Click the row selector to the left of the first field so that the entire row is selected.

    3. Hold down CTRL while you click the row selector for the other fields that you want to include in the primary key.

    4. Right-click the selected fields, and then click Primary Key Button image .

      This step sets the Indexed property of each field in the primary key to Yes (Duplicates OK) so that, for example, you can enter the same name, or address, or city in multiple records, but you cannot enter an identical combination of name, address, and city in more than one record. For example, you can have two contacts named Jean Philippe Bagel who live at 2345 Main Street, as long as they don't both live in London.

Top of Page

No comments:

Post a Comment