Skip to main content

Querying GitHub Issues from Google App Script

I'm currently exploring different integrations that can be done between Google App Script and Git. This is part of a bigger attempt to integrate Git data into my project management system. I'll post more on that later. GitHub supports a number of very nice sub-systems. One of them is the GitHub Issues ticket tracking system. It's fairly robust and comes with GitHub - which means it's highly integrated out-of-the-box. Integration with Google's App Script is done via the fetchURL command and JSON object manipulation.  After a good bit of experimentation, I've settled on querying GitHub Issues with an "on open spreadsheet event" to avoid issues with Google's quota system. Essentially, Google prevents you from calling the fetchURL command more than X times (see the quota limits ) per day. That's not great if you have people actively working and changing data in a spreadsheet. Some of my other App Script routines are running thousands of times per d...

RAID, files and cloud storage

RAID is a method that takes independent drives and lets a system group them together for security (redundancy or parity), speed enhancements, storage space increases or all three. One of the long-time stalwarts of the RAID environment is RAID 5. In RAID 5, you need at least 3 identically sized disks. They are combined so that the storage space is N-1 (i.e., in a 3-drive system, total space is 2x drive size). The last disk is used for parity. With a parity drive, you can remove any one of the drives and still have access to your data. If you remove two or more of the drives though, you'd better have a good backup.

How does this work? Through the magic of XOR. The following statements are all true:

A XOR B = PAR
PAR XOR B = A
A XOR PAR = B


That's how parity lets you lose one disk and still recover your data. The same rules also apply in larger sets. Notice that you have to rotate the position of the parity data though. So a 5 drive system looks like:

A XOR B XOR C XOR D XOR E = PAR
PAR XOR B XOR C XOR D XOR E = A
A XOR PAR XOR C XOR D XOR E = B
A XOR B XOR PAR XOR D XOR E = C
A XOR B XOR C XOR PAR XOR E = D
A XOR B XOR C XOR D XOR PAR = E


The other interesting thing to note is that the size stored on each drive is 1/(n-1) of the total file size. Thus a 100 byte file in a RAID 5 system only stores 20 bytes on each drive. This is where the space increase comes from.

Now, the question is, why do we care? Other than it's nice to know how something works, this technique could be applied to cloud storage. If you follow some of the events that have happened with online storage providers, you may have seen a number of them come and go. The problem is, if they disappear or lose your data, what happens then?

If you are using them as a convenient off-site data storage pool for large amounts of infrequently used data, you could implement a RAID 5-style data split. Keep in mind there's a bit of overhead in doing so, but the point of this exercise is to reduce your dependence on any one provider. Coincidentally, it will likely cost you the same or less than using a single cloud storage provider.

Take for instance Amazon's S3 service and Rackspace's Mosso Cloud Files. Both services charge per gigabyte per month. For arguments sake, let's assume we have a 10GB RAR file (BIGFILE.RAR) that we want to backup. If we split it into a RAID 5, 3 drive format, that would leave us with BIGFILE.RAR.1, BIGFILE.RAR.2 and BIGFILE.RAR.RAID. Each file will be 5GB (10GB / (3-1)) in size. I can upload one file to Amazon, one file to Rackspace and retain one file on my hard drive. I've now backed up the file in an online form that can be recovered even if I lose my hard drive or Amazon / Rackspace has an outage at the same moment I need to have access to my data. So long as I have access to any two parts of the file, I can create the original output file.

Obviously, Amazon and Rackspace are large enough it is unlikely they'd actually lose the data over a longer period of time. The same can't be said of some the smaller players in the market. Companies like Streamload managed to wipe out about half of their customers data during a reorganization before finally closing their doors. Anyone caught unaware lost all of their data.

I should also note that you could use a PAR or PAR2 system. The easiest to use implementation is probably QuickPAR. It uses a similar system to what I've shown here but the source is not very conducive to Delphi developers. From what I can tell, it was originally developed to push binary files around Usenet, but it would work equally well for cloud storage. If you're just looking for a good off-the-shelf tool, QuickPAR is probably the way to go. If you're interested in developing a parity solution that you can embed in your code, the source is included below.



procedure File2RaidFiles(fileName:string; raidLength:integer);
var FS:TFileStream;
outputFS:array of TFileStream;
byteArray:array of byte;
ix:integer;
begin
//test to make sure we were called correctly
if not FileExists(fileName) then
raise Exception.Create(Format('File %s doesn''t exist', [fileName]));
if raidLength<2 then
raise Exception.Create(Format('Raid Length must be greater than 1. Given value was %d',[raidLength]));

FS:=TFileStream.Create(fileName,fmOpenRead);
try
setLength(outputFS, raidLength+1); //+1 for the parity byte
setLength(byteArray, raidLength+1);
for ix := 0 to raidLength do //create an output location for each stripe (named .stripe#) and the parity file (named .raid)
begin
if ix=raidLength then
outputFS[ix]:=TFileStream.Create(fileName+'.raid',fmCreate)
else
outputFS[ix]:=TFileStream.Create(fileName+'.'+IntToStr(ix),fmCreate);
end;
try
while FS.Position<FS.Size do //while we haven't hit the end of the file
begin
FS.Read(byteArray[0],raidLength); //read in the bytes to the byteArray
for ix := 0 to raidLength-2 do //this calcs the parity byte, it's calculated by XORing the other bytes to it
byteArray[raidLength]:=byteArray[ix] xor byteArray[ix+1];
for ix := 0 to raidLength do //write out the bytes to the respective stripes (stripes are < raidlength) and the parity file (outputFS[raidlength])
outputFS[ix].Write(byteArray[ix],1);
end;
finally
for ix := 0 to raidLength do //clean up the output streams
outputFS[ix].Free;
end;
finally
FS.Free; //clean up the input stream
end;
end;

procedure RaidFiles2File(fileName, outputName:string; raidLength:integer);
var FS:TFileStream;
inputFS:array of TFileStream;
testFS:TFileStream;
byteArray:array of byte;
ix, damage:integer;
countOfMissing:integer;
checkByte:byte;
checkFails:boolean;
begin
//test to make sure we were called correctly
if FileExists(outputName) then
raise Exception.Create(Format('File %s already exist', [outputName]));
if raidLength<2 then
raise Exception.Create(Format('Raid Length must be greater than 1. Given value was %d',[raidLength]));
//init the basics
checkFails:=false;
testFS:=nil;
countOfMissing:=0;
damage:=0;
setLength(inputFS, raidLength+1); //+1 is the parity byte
setLength(byteArray, raidLength+1);
for ix := 0 to raidLength do
inputFS[ix]:=nil;

//setup the output file stream
FS:=TFileStream.Create(outputName,fmCreate);
try
//create the input file streams. make sure we pickup the count of missing streams and a pntr to an input stream for use later on (any stream will do, just testing for eof
for ix := 0 to raidLength do
begin
if ix=raidLength then //this is the parity stream
begin
if FileExists(fileName+'.raid') then
inputFS[ix]:=TFileStream.Create(fileName+'.raid',fmOpenRead)
end
else //this is a stripe stream
begin
if FileExists(fileName+'.'+IntToStr(ix)) then
begin
inputFS[ix]:=TFileStream.Create(fileName+'.'+IntToStr(ix),fmOpenRead);
testFS:=inputFS[ix]; //this is just to test for eof. all files are the same size
end;
end;
if inputFS[ix]=nil then inc(countOfMissing); //if we didn't get an input stream, add to the missing count
end;
//you are only allowed to have 1 missing input file
if countOfMissing>1 then
raise Exception.Create('Unable to recover file! You must have at least N-1 parts of the file');
assert(testFS<>nil, 'testFS=nil? This should never happen');

while testFS.Position<testFS.Size do //while we are not at the end of the input file
begin
if countOfMissing=0 then //if there are no missing streams, we can just remerge the data together
begin
for ix := 0 to raidLength do
inputFS[ix].Read(byteArray[ix],1);

checkByte:=0;
//calc a checkByte to make sure it all still agrees
for ix := 0 to raidLength-2 do //0 based, stop 1 short of the end
if ix=0 then
checkByte:=byteArray[ix] xor byteArray[ix+1] //seed check byte by XORing the 1st two bytes together
else
checkByte:=checkbyte xor byteArray[ix+1]; //XOR the bytes together
if checkByte<>byteArray[raidLength] then //the new checkByte doesn't match the old parity byte. That means we have a problem, the file doesn't match
checkFails:=true;
end
else //if there are missing streams, we have to calculate the missing data
begin //you have to reverse the XOR with the parity bit to determine the result
for ix := 0 to raidLength-1 do //not counting the end of the array, that's where we'll store the damaged byte
if inputFS[ix]<>nil then //if this is a valid stream, read it's data
inputFS[ix].Read(byteArray[ix],1)
else
begin //this isn't a valid stream, so remember this is damaged and read from the parity stream for this position
damage:=ix;
inputFS[raidLength].Read(byteArray[ix],1);
end;
//this is calcs the damaged byte into the end of the array (raidLength)
for ix := 0 to raidLength-2 do //0 based, stop 1 short of the end
byteArray[raidLength]:=byteArray[ix] xor byteArray[ix+1]; //XOR the bytes together
byteArray[damage]:=byteArray[raidLength]; //replace the damaged byte with the restored byte
end;
FS.Write(byteArray[0],raidLength); //write out the merged (and psbly restored) data
end;
finally
for ix := 0 to raidLength do //clean up the memory from the input streams
if inputFS[ix]<>nil then inputFS[ix].Free;
FS.Free; //clean up the memory for the output stream
end;
if checkFails then
raise Exception.Create(Format('This file (%s) does not match to its parity checks. Damage may be present', [outputName]));
end;

Comments

Bob Swart said…
You write that the size stored on each drive is 1/(n-1) of the total file size. Thus, you conclude that a 100 byte file in a RAID 5 system only stores 20 bytes on each drive. This is not true IMHO. In a RAID 5 system, the entrie file is stored 4 times (4 times and the parity checksum). If you have N disks of M size, your total diskspace is N * M but the amount of files you can store is still only M.
Thomas Mueller said…
wouldn't the example 10 gb file result in 3x 5 gb files for raid 5 with 3 "disks" rather than 3x 3.33 gb?
Anonymous said…
RAID 5 implements striping with parity. Data is "striped" across disk, and parity is ditributed across disk too. Available disk space in n-1. It will tolerate one disk failure, but two disk failure will lead to data loss.
RAID 5 incurs a significate write penalty (due to parity calculation and writing), while read performance are good.
Marshall Fryman said…
@Bob: RAID 5 does not mirror the data to each drive. You can see the documentation here discussing RAID 5 striping with parity. Perhaps you're thinking of RAID 1 (mirroring)? With RAID 1, there is no parity drive since an exact copy if stored on each drive.

@Thomas: You are correct. Sorry, I made an inline edit and didn't catch the number change. Thanks!
Marshall Fryman said…
Ok, apparently href's aren't allowed in the comment section. Here's the link: http://en.wikipedia.org/wiki/RAID_5#RAID_5
Dale McKay said…
The following 2 statements are equivalent:

1) for ix := 0 to ( rl- 2 ) do
ba[rl] := ba[ix] xor ba[ix+1];

2) ba[rl] := ba[rl- 2] xor ba[rl-1];

So I think your parity calculation is incorrect.
You're not accounting for the result of the previous loop iteration.
Marshall Fryman said…
@Dale : You are, indeed, correct. I tested the code using a 3-file emulation which just so happens to work. When you move beyond 3-files (3 "drives") the parity code does not work. I've corrected the post so it contains the correct source. The only change is in the parity calculation. Thanks for catching that. I was so focused on writing a utility for Amazon and Rackspace I didn't think to test it for more files.

Popular posts from this blog

SMTP Mail and Indy (again)

Having spent a lot of time recently working on a ping scanner using Indy, I noticed that there's a lot of questions still on using SMTP with Indy. Let me first say that I am not an Indy expert. I get along with it successfully but find I still have to research things frequently. With the disclaimer out of the way, we can get to the offering. A while back I wrote a handy little unit to simply send a mail message with or without attachments and with or without providing authentication. It even has support for OpenSSL. I use this unit in a number of applications and have seen it successfully send hundreds of e-mails without issue. I recently added support for threaded message sending to take advantage of the multi-core system I'm now running on. This code has had a few additions (like the logger) that I've gleaned over time from various newsgroup postings, but I didn't record the authors so let me credit the anonymous Internet authors who support Indy. It's really amaz...

Detecting a virtualized environment

CubicDesign on delphi-talk.elists.org recently asked the question: "How do I know/detect if my software is running under Windows [or a virtual environment]?" Well, it turns out that it's a lot harder to tell than you would think. Apparently, the VM (VMware, Xen, Wine, etc.) doesn't really want you to be able to do this. At least not easily. For VMware, there is a decent little C routine called jerry.c that does the trick. Jerry actually uses a simple communication channel that VMware left open. It's not 100% foolproof since the admin can change the channel, but that's not likely going to happen unless the system is specifically designed to be a honeypot. If you're running on a honeypot and still have a legitimate reason for detection, you could look at the much more complex scoopy implementation which inspects how the system is actually virtualized using the SIDT CPU instruction instead of a communication channel. Another reference (red pill) is here . F...

Delphi Case with Strings

Zarko Gajic posted about this topic on his Delphi Tips recently showing how one could use a case statement with strings. His solution basically passes in an array on the stack and then iterates through it providing the index number of the matching string. I don't often want to do this, but the idea comes up occassionally enough that I thought I'd play with it a little. The first thing that struck me with this is that passing things on the stack is bound to be slow. Any time you can avoid memory allocation in your routines, do it. The way Zarko wrote his StringToCaseSelect routine created a COPY of the information on the stack. In my testing, just changing the CaseList: array of string to CaseList:const array of string improved the performance of the code by almost 30% for his example. Mind, I'm not using hyper-precise counters; however, it definitely makes a difference. Secondly, I was curious how the performance changed if I used the old stand-by: If then else. Using the ...

Copyright 2008-2022, Marshall Fryman