File Organization and Access; Programmer's Perspective

One of the key elements of a file system is the way the files are organized. File organization is the logical structuring as well as the access method(s) of files.
Given an operating system supporting unstructured files that are stream-of-bytes, how should one organize the contents of the files?
Performance considerations:
- File system performance affects overall system performance
- Organization of the file system affects performance
- File organization (data layout) affects performance; depends on access patterns
Possible access patterns:
- Read the whole file
- Read individual blocks or records from a file
- Read blocks or records preceding or following the current one
- Retrieve a set of records
- Write a whole file sequentially
- Insert/delete/update records in a file
- Update blocks in a file
Criteria for File Organization
- Rapid access
  - Needed when accessing a single record
  - Not needed for batch mode
- Ease of update; File on CDROM will not be updated, so this is not a concern
- Economy of storage
  - Should be minimum redundancy in the data
  - Redundancy can be used to speed access such as an index
- Simple maintenance
- Reliability
Fundamental File Organizations; Common file organization schemes are:
- Pile
- Sequential
- Indexed Sequential
- Indexed
- Direct or Hashed
Pile (see Fig. 8)
- Data are collected in the order they arrive
- Purpose is to accumulate a mass of data and save it
- Records may have different fields
- No structure
- Record access is by exhaustive search
- Update:
  - Same size record; okay
  - Variable size; poor
- Retrieval:
  - Single record; poor
  - Subset; poor
  - Exhaustive; okay
Sequential (see Fig. 8)
- Fixed format used for records
- Records are the same length
- Field names and lengths are attributes of the file
- One field is the key filed
  - Uniquely identifies the record
  - Records are stored in key sequence
- New records are placed in a log file or transaction file
- Batch update is performed to merge the log file with the master file
- Update:
  - Same size record; good
  - Variable size; No
- Retrieval:
  - Single record; poor
  - Subset; poor
  - Exhaustive; okay
Figure 8: Fundamental File Organizations; (a) Pile (b) Sequential (c) Indexed Sequential (d) Indexed.

$\includegraphics[scale=0.5]{figures/pile.ps}$ $\includegraphics[scale=0.5]{figures/sequential.ps}$ $\includegraphics[scale=0.4]{figures/indexedseq.ps}$ $\includegraphics[scale=0.4]{figures/indexed.ps}$
Indexed Sequential (see Figs. 8,9)
- Index provides a lookup capability to quickly reach the vicinity of the desired record
  - Contains key field and a pointer to the main file
  - Indexed is searched to find highest key value that is equal or less than the desired key value
  - Search continues in the main file at the location indicated by the pointer
- New records are added to an overflow file
- Record in main file that precedes it is updated to contain a pointer to the new record
- The overflow is merged with the main file during a batch update
- Update:
  - Same size record; good
  - Variable size; No
- Retrieval:
  - Single record; good
  - Subset; poor
  - Exhaustive; okay
Comparison of sequential and indexed sequential lookup
- Example: a file contains 1 million records
- On average 500,00 accesses are required to find a record in a sequential file
- If an index contains 1000 entries, it will take on average 500 accesses to find the key, followed by 500 accesses in the main file. Now on average it is 1000 accesses
Figure 9: IBM indexed-sequential access method (ISAM).

$\includegraphics[scale=0.7]{figures/indexedibm.ps}$
Indexed File (see Fig. 8)
- Uses multiple indexes for different key fields
- May contain an exhaustive index that contains one entry for every record in the main file
- May contain a partial index
- Update:
  - Same size record; good
  - Variable size; good
- Retrieval:
  - Single record; good
  - Subset; good (Assuming the selecting attribute is indexed on)
  - Exhaustive; okay
The Direct, or Hashed File
- Key field required for each record
- Key maps directly or via a hash mechanism to an address within the file
- Directly access a block at a the known address
- Update:
  - Same size record; good
  - Variable size; No (Fixed sized records used)
- Retrieval:
  - Single record; excellent
  - Subset; poor
  - Exhaustive; poor

**Figure 9:** IBM indexed-sequential access method (ISAM).
$\includegraphics[scale=0.7]{figures/indexedibm.ps}$

2004-05-16