- Processing data is a fundamental operation in Computer Science
- Two fundamental operations in processing data are searching and sorting
- Form the basis or preprocessing step of many algorithms
- Large variety of algorithms have been developed
- What are you searching? strings? numbers? files? objects?
- How are you searching? (Not algorithmically, but by what criteria or criterion)
- You could be searching for the first or the last or all instances of a particular criterion
- You could search for extremal elements: minimum, maximum, median, i-th order statistic
- There are many basic and advanced criteria that you can search on
- We want to develop a general, abstract search algorithm
- Pseudocode: a general, generic "code" language that allows you to specify a general abstract algorithm without getting to the specifics of any particular programming language
- Linear search pseudocode: see book
- Let's translate the linear search algorithm into actual C code
/**
* This function searches the given array for
* the given key value and returns the first
* index at which it finds a matching element.
*
* This function returns -1 if no matching element
* is found
*/
int linearSearch(const int *arr, int n, int key) {
for(int i=0; i<n; i++) {
if(a[i] == key) {
return i;
}
}
return -1;
}
//alternatively:
double * linearSearch(const double *arr, int n, double key) {
for(int i=0; i<n; i++) {
if(a[i] == key) {
return &a[i];
}
}
return NULL;
}- Problem: the above functions are only good for: arrays of integers.
- If we wanted a version for arrays of
doubleorchar *(strings) orStudentstructures, etc. - Goal (eventually): design/implement ONE search/sort function to search/sort any type of variable
- Suppose that the input array (collection) were sorted: how can you exploit this structure?
- You can exploit this structure by checking the middle element
$m$ (searching for$k$ ):- if equal
$m = k$ , you stop because you've found your needle - if
$k < m$ : you "recursively" search on the left hand side - if
$m < k$ : you recursively search on the right hand side - You stop when the size of the array is zero
- if equal
int binarySearch(const int *a, int i, int j, int key) {
if(i > j) {
//no such element exists, the array is of "size" zero
return -1;
} else {
int m = (i + j) / 2; //m is the middle index
if(a[m] == key) {
return m;
} else if(key < a[m]) {
//recurse on the left hand side
binarySearch(a, i, m-1, key);
} else if(a[m] < key) {
//recurse on the right hand side
binarySearch(a, m+1, j, key);
}
}
}- Suppose you have an array of
$n$ elements - How many comparisons does linear search perform to search the given array?
- Best case scenario: you find the key at the first element: one comparison!
- Worst case scenario: you find the key at the end of the array (or you don't find it at all):
$n$ comparisons - Average case scenario: naive analysis:
$$\frac{n + 1}{2} \approx \frac{n}{2}$$ - Both the average and worst cases are linear
- Binary Search?
- Derivation: whitepaper and in the book
- Only takes about
$\log_2{(n)}$ comparisons
-
Suppose you have a database of
$10^{12}$ (1 trillion) elements- If unsorted you would have to use linear search to find a particular element:
$$500 \times 10^{11}$$ 500 billion comparisons - If sorted (indexed) then you can exploit the sort using binary search:
$$\log_2{(10^{12})} < 40$$
- If unsorted you would have to use linear search to find a particular element:
-
Another perspective: suppose you "double" the input size of an array
$$n \rightarrow 2n$$ - linear search: the number of comparisons goes from
$$\frac{n}{2} \rightarrow n$$ The number of comparisons doubles - binary search:
$$\log_2{(n)} \rightarrow \log_2{(2n)} = \log_2{(n)} + \log_2{(2)} = \log_2{(n)} + 1$$
- linear search: the number of comparisons goes from
- In practice: you don't generally roll your own searching and sorting algorithms unless you have a Very Good Reason to do so
- In practice you use built-in search/sort algorithms/functions in a standard library
- To avoid multiple solutions you use generic searching/sorting functions
- Problem: a sorting algorithm/function may know how to "sort" but if it is generic, it may not know how to "order"
- You can use a comparator function that specifies how to "order" two elements,
$a, b$ - Given two elements
$a, b$ which goes first? Are$a, b$ in order or out of order? Should they be swapped? - A comparator takes two elements
$a, b$ as input and returns- something negative if
$a < b$ - zero if
$a = b$ - something positive if
$a > b$
- something negative if
- In C, a comparator function has the following signature:
int cmp(const void *a, const void *b)
void *is a "generic void pointer": it can point to anything!- the compartor returns an integer according to the "contract" outlined above
- Standard pattern:
- cast the pointers to the specific type of data you want to compare
- Use the state of the "object" or structure to determine the proper ordering
- Return a valid integer value that expresses the order of the two elements
- Best practices:
- Use descriptive names for your functions
- Be explicit in your comparisons
- Avoid "tricks" (difference tricks)
- Reuse comaprator functionality whenever possible
- Examples:
- Write a comparator to order integers in non-decreasing order
- Write a comparator to order integers in non-increasing order
- Write a comparator to order
Studentstructures by NUID - Write a comparator to order
Studentstructures by GPA - Write a comparator to order
Studentstructures by last name/first name
- The standard C library provides a function:
void qsort(void *base,
size_t nel,
size_t size,
int (*compar)(const void *, const void *));baseis the array of elements to be sorted (also note: it is notconst)nelis the number of elements in the arraysizeis the number of bytes each element in the array takes, usually you use:sizeoffor this parametercompar- a function pointer to a comparator function that defines the order that you want to sort in
- How can we "pass" a function to another function as an argument?
- Generally these are referred to as "callbacks"
- Ex: GUIs: you can define a button but how do you specify the code that runs when someone presses the button: you provide a callback
- Ex:
qsortneeds a way to order elements, so it needs a comparator function - In C pointers can point to data stored in memory
- But, a pointer is just a memory location
- What else is stored in memory? Your code!
- It makes sense that you can create a pointer that "points" to your code; points to a function in your code
//create a function pointer called ptrToFunc that can
//point to any function that returns
//an integer and takes three arguments:
//(int, double, char)
int (*ptrToFunc)(int, double, char) = NULL;
//declare a pointer that can point to math's sqrt function:
double (*ptrToSqrt)(double) = NULL;
//make our new pointer point to math's sqrt function:
ptrToSqrt = sqrt;
//cool: you can also invoke the function via its pointer:
double y = ptrToSqrt(2.0);
//a whole world of hurt:
sqrt = sin;
ptrToSqrt = sin;- Searching:
search.hcontains linear search functions,lfind,lsearch - Standard library:
void * bsearch(const void *key,
const void *base,
size_t nel,
size_t size,
int (*compar) (const void *, const void *));#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <string.h>
typedef struct {
int nuid;
char *firstName;
char *lastName;
double gpa;
//forget the date for now
} Student;
int cmpInt(const void *a, const void *b) {
const int *x = (const int *)a;
const int *y = (const int *)b;
if(*x < *y) {
//avoid the difference trick: return *x - *y
//suceptible to overflow
return -1;
} else if(*x > *y) {
return 1;
} else {
return 0;
}
return 0;
}
int cmpIntDesc(const void *a, const void *b) {
return cmpInt(b, a);
}
int cmpDouble(const void *a, const void *b) {
const double *x = (const double *)a;
const double *y = (const double *)b;
if(*x < *y) {
//avoid the difference trick: return *x - *y
//.5 - .75 = -.25 -> 0
return -1;
} else if(*x > *y) {
return 1;
} else {
return 0;
}
return 0;
}
int cmpStudentByNuid(const void *a, const void *b) {
const Student *x = (const Student *) a;
const Student *y = (const Student *) b;
return cmpInt( &(x->nuid), &(y->nuid) );
}
int cmpStudentByGpa(const void *a, const void *b) {
const Student *x = (const Student *) a;
const Student *y = (const Student *) b;
return cmpDouble( &(y->gpa), &(x->gpa) );
}
int cmpStudentByName(const void *a, const void *b) {
const Student *x = (const Student *) a;
const Student *y = (const Student *) b;
int result = strcmp(x->lastName, y->lastName);
if(result != 0) {
return result;
} else if(result == 0) {
//tie: same last name!
return strcmp(x->firstName, y->firstName);
//there may still be a tie, but that's okay
}
}
int main(void) {
int n = 10;
int arr[] = {6, 3, 9, 4, 1, 8, 3, 12, 2, 21};
for(int i=0; i<n; i++) {
printf("%d ", arr[i]);
}
printf("\n");
qsort(arr, n, sizeof(int), cmpIntDesc);
for(int i=0; i<n; i++) {
printf("%d ", arr[i]);
}
printf("\n");
int k = 8;
int *ptr = (int *) bsearch(&k, arr, n, sizeof(int), cmpDouble);
if(ptr != NULL) {
printf("%d found at memory location %p, which contains %d\n", k, ptr, *ptr);
} else {
printf("not found!\n");
}
return 0;
}