I'm currently exploring different integrations that can be done between Google App Script and Git. This is part of a bigger attempt to integrate Git data into my project management system. I'll post more on that later. GitHub supports a number of very nice sub-systems. One of them is the GitHub Issues ticket tracking system. It's fairly robust and comes with GitHub - which means it's highly integrated out-of-the-box. Integration with Google's App Script is done via the fetchURL command and JSON object manipulation. After a good bit of experimentation, I've settled on querying GitHub Issues with an "on open spreadsheet event" to avoid issues with Google's quota system. Essentially, Google prevents you from calling the fetchURL command more than X times (see the quota limits ) per day. That's not great if you have people actively working and changing data in a spreadsheet. Some of my other App Script routines are running thousands of times per d...
RAID is a method that takes independent drives and lets a system group them together for security (redundancy or parity), speed enhancements, storage space increases or all three. One of the long-time stalwarts of the RAID environment is RAID 5. In RAID 5, you need at least 3 identically sized disks. They are combined so that the storage space is N-1 (i.e., in a 3-drive system, total space is 2x drive size). The last disk is used for parity. With a parity drive, you can remove any one of the drives and still have access to your data. If you remove two or more of the drives though, you'd better have a good backup.
How does this work? Through the magic of XOR. The following statements are all true:
That's how parity lets you lose one disk and still recover your data. The same rules also apply in larger sets. Notice that you have to rotate the position of the parity data though. So a 5 drive system looks like:
The other interesting thing to note is that the size stored on each drive is 1/(n-1) of the total file size. Thus a 100 byte file in a RAID 5 system only stores 20 bytes on each drive. This is where the space increase comes from.
Now, the question is, why do we care? Other than it's nice to know how something works, this technique could be applied to cloud storage. If you follow some of the events that have happened with online storage providers, you may have seen a number of them come and go. The problem is, if they disappear or lose your data, what happens then?
If you are using them as a convenient off-site data storage pool for large amounts of infrequently used data, you could implement a RAID 5-style data split. Keep in mind there's a bit of overhead in doing so, but the point of this exercise is to reduce your dependence on any one provider. Coincidentally, it will likely cost you the same or less than using a single cloud storage provider.
Take for instance Amazon's S3 service and Rackspace's Mosso Cloud Files. Both services charge per gigabyte per month. For arguments sake, let's assume we have a 10GB RAR file (BIGFILE.RAR) that we want to backup. If we split it into a RAID 5, 3 drive format, that would leave us with BIGFILE.RAR.1, BIGFILE.RAR.2 and BIGFILE.RAR.RAID. Each file will be 5GB (10GB / (3-1)) in size. I can upload one file to Amazon, one file to Rackspace and retain one file on my hard drive. I've now backed up the file in an online form that can be recovered even if I lose my hard drive or Amazon / Rackspace has an outage at the same moment I need to have access to my data. So long as I have access to any two parts of the file, I can create the original output file.
Obviously, Amazon and Rackspace are large enough it is unlikely they'd actually lose the data over a longer period of time. The same can't be said of some the smaller players in the market. Companies like Streamload managed to wipe out about half of their customers data during a reorganization before finally closing their doors. Anyone caught unaware lost all of their data.
I should also note that you could use a PAR or PAR2 system. The easiest to use implementation is probably QuickPAR. It uses a similar system to what I've shown here but the source is not very conducive to Delphi developers. From what I can tell, it was originally developed to push binary files around Usenet, but it would work equally well for cloud storage. If you're just looking for a good off-the-shelf tool, QuickPAR is probably the way to go. If you're interested in developing a parity solution that you can embed in your code, the source is included below.
How does this work? Through the magic of XOR. The following statements are all true:
A XOR B = PAR
PAR XOR B = A
A XOR PAR = B
That's how parity lets you lose one disk and still recover your data. The same rules also apply in larger sets. Notice that you have to rotate the position of the parity data though. So a 5 drive system looks like:
A XOR B XOR C XOR D XOR E = PAR
PAR XOR B XOR C XOR D XOR E = A
A XOR PAR XOR C XOR D XOR E = B
A XOR B XOR PAR XOR D XOR E = C
A XOR B XOR C XOR PAR XOR E = D
A XOR B XOR C XOR D XOR PAR = E
The other interesting thing to note is that the size stored on each drive is 1/(n-1) of the total file size. Thus a 100 byte file in a RAID 5 system only stores 20 bytes on each drive. This is where the space increase comes from.
Now, the question is, why do we care? Other than it's nice to know how something works, this technique could be applied to cloud storage. If you follow some of the events that have happened with online storage providers, you may have seen a number of them come and go. The problem is, if they disappear or lose your data, what happens then?
If you are using them as a convenient off-site data storage pool for large amounts of infrequently used data, you could implement a RAID 5-style data split. Keep in mind there's a bit of overhead in doing so, but the point of this exercise is to reduce your dependence on any one provider. Coincidentally, it will likely cost you the same or less than using a single cloud storage provider.
Take for instance Amazon's S3 service and Rackspace's Mosso Cloud Files. Both services charge per gigabyte per month. For arguments sake, let's assume we have a 10GB RAR file (BIGFILE.RAR) that we want to backup. If we split it into a RAID 5, 3 drive format, that would leave us with BIGFILE.RAR.1, BIGFILE.RAR.2 and BIGFILE.RAR.RAID. Each file will be 5GB (10GB / (3-1)) in size. I can upload one file to Amazon, one file to Rackspace and retain one file on my hard drive. I've now backed up the file in an online form that can be recovered even if I lose my hard drive or Amazon / Rackspace has an outage at the same moment I need to have access to my data. So long as I have access to any two parts of the file, I can create the original output file.
Obviously, Amazon and Rackspace are large enough it is unlikely they'd actually lose the data over a longer period of time. The same can't be said of some the smaller players in the market. Companies like Streamload managed to wipe out about half of their customers data during a reorganization before finally closing their doors. Anyone caught unaware lost all of their data.
I should also note that you could use a PAR or PAR2 system. The easiest to use implementation is probably QuickPAR. It uses a similar system to what I've shown here but the source is not very conducive to Delphi developers. From what I can tell, it was originally developed to push binary files around Usenet, but it would work equally well for cloud storage. If you're just looking for a good off-the-shelf tool, QuickPAR is probably the way to go. If you're interested in developing a parity solution that you can embed in your code, the source is included below.
procedure File2RaidFiles(fileName:string; raidLength:integer);
var FS:TFileStream;
outputFS:array of TFileStream;
byteArray:array of byte;
ix:integer;
begin
//test to make sure we were called correctly
if not FileExists(fileName) then
raise Exception.Create(Format('File %s doesn''t exist', [fileName]));
if raidLength<2 then
raise Exception.Create(Format('Raid Length must be greater than 1. Given value was %d',[raidLength]));
FS:=TFileStream.Create(fileName,fmOpenRead);
try
setLength(outputFS, raidLength+1); //+1 for the parity byte
setLength(byteArray, raidLength+1);
for ix := 0 to raidLength do //create an output location for each stripe (named .stripe#) and the parity file (named .raid)
begin
if ix=raidLength then
outputFS[ix]:=TFileStream.Create(fileName+'.raid',fmCreate)
else
outputFS[ix]:=TFileStream.Create(fileName+'.'+IntToStr(ix),fmCreate);
end;
try
while FS.Position<FS.Size do //while we haven't hit the end of the file
begin
FS.Read(byteArray[0],raidLength); //read in the bytes to the byteArray
for ix := 0 to raidLength-2 do //this calcs the parity byte, it's calculated by XORing the other bytes to it
byteArray[raidLength]:=byteArray[ix] xor byteArray[ix+1];
for ix := 0 to raidLength do //write out the bytes to the respective stripes (stripes are < raidlength) and the parity file (outputFS[raidlength])
outputFS[ix].Write(byteArray[ix],1);
end;
finally
for ix := 0 to raidLength do //clean up the output streams
outputFS[ix].Free;
end;
finally
FS.Free; //clean up the input stream
end;
end;
procedure RaidFiles2File(fileName, outputName:string; raidLength:integer);
var FS:TFileStream;
inputFS:array of TFileStream;
testFS:TFileStream;
byteArray:array of byte;
ix, damage:integer;
countOfMissing:integer;
checkByte:byte;
checkFails:boolean;
begin
//test to make sure we were called correctly
if FileExists(outputName) then
raise Exception.Create(Format('File %s already exist', [outputName]));
if raidLength<2 then
raise Exception.Create(Format('Raid Length must be greater than 1. Given value was %d',[raidLength]));
//init the basics
checkFails:=false;
testFS:=nil;
countOfMissing:=0;
damage:=0;
setLength(inputFS, raidLength+1); //+1 is the parity byte
setLength(byteArray, raidLength+1);
for ix := 0 to raidLength do
inputFS[ix]:=nil;
//setup the output file stream
FS:=TFileStream.Create(outputName,fmCreate);
try
//create the input file streams. make sure we pickup the count of missing streams and a pntr to an input stream for use later on (any stream will do, just testing for eof
for ix := 0 to raidLength do
begin
if ix=raidLength then //this is the parity stream
begin
if FileExists(fileName+'.raid') then
inputFS[ix]:=TFileStream.Create(fileName+'.raid',fmOpenRead)
end
else //this is a stripe stream
begin
if FileExists(fileName+'.'+IntToStr(ix)) then
begin
inputFS[ix]:=TFileStream.Create(fileName+'.'+IntToStr(ix),fmOpenRead);
testFS:=inputFS[ix]; //this is just to test for eof. all files are the same size
end;
end;
if inputFS[ix]=nil then inc(countOfMissing); //if we didn't get an input stream, add to the missing count
end;
//you are only allowed to have 1 missing input file
if countOfMissing>1 then
raise Exception.Create('Unable to recover file! You must have at least N-1 parts of the file');
assert(testFS<>nil, 'testFS=nil? This should never happen');
while testFS.Position<testFS.Size do //while we are not at the end of the input file
begin
if countOfMissing=0 then //if there are no missing streams, we can just remerge the data together
begin
for ix := 0 to raidLength do
inputFS[ix].Read(byteArray[ix],1);
checkByte:=0;
//calc a checkByte to make sure it all still agrees
for ix := 0 to raidLength-2 do //0 based, stop 1 short of the end
if ix=0 then
checkByte:=byteArray[ix] xor byteArray[ix+1] //seed check byte by XORing the 1st two bytes together
else
checkByte:=checkbyte xor byteArray[ix+1]; //XOR the bytes together
if checkByte<>byteArray[raidLength] then //the new checkByte doesn't match the old parity byte. That means we have a problem, the file doesn't match
checkFails:=true;
end
else //if there are missing streams, we have to calculate the missing data
begin //you have to reverse the XOR with the parity bit to determine the result
for ix := 0 to raidLength-1 do //not counting the end of the array, that's where we'll store the damaged byte
if inputFS[ix]<>nil then //if this is a valid stream, read it's data
inputFS[ix].Read(byteArray[ix],1)
else
begin //this isn't a valid stream, so remember this is damaged and read from the parity stream for this position
damage:=ix;
inputFS[raidLength].Read(byteArray[ix],1);
end;
//this is calcs the damaged byte into the end of the array (raidLength)
for ix := 0 to raidLength-2 do //0 based, stop 1 short of the end
byteArray[raidLength]:=byteArray[ix] xor byteArray[ix+1]; //XOR the bytes together
byteArray[damage]:=byteArray[raidLength]; //replace the damaged byte with the restored byte
end;
FS.Write(byteArray[0],raidLength); //write out the merged (and psbly restored) data
end;
finally
for ix := 0 to raidLength do //clean up the memory from the input streams
if inputFS[ix]<>nil then inputFS[ix].Free;
FS.Free; //clean up the memory for the output stream
end;
if checkFails then
raise Exception.Create(Format('This file (%s) does not match to its parity checks. Damage may be present', [outputName]));
end;
Comments
RAID 5 incurs a significate write penalty (due to parity calculation and writing), while read performance are good.
@Thomas: You are correct. Sorry, I made an inline edit and didn't catch the number change. Thanks!
1) for ix := 0 to ( rl- 2 ) do
ba[rl] := ba[ix] xor ba[ix+1];
2) ba[rl] := ba[rl- 2] xor ba[rl-1];
So I think your parity calculation is incorrect.
You're not accounting for the result of the previous loop iteration.